Automated Root-Cause Attribution for Fleet-Level Operational Anomalies

Causal Discovery, Shapley Attribution, Bayesian Diagnosis, and Streaming Analytics for Autonomous GSE Fleets

Last updated: 2026-04-11

Key Takeaway: When a 50-vehicle autonomous GSE fleet shows a 3x intervention rate spike at Airport B on Tuesday, the root cause could be any of 8+ categories: sensor degradation, stale maps, OTA model regression, weather shift, new aircraft type, construction, operator behavior, or software bug. Manual triage (reviewing rosbags, cross-referencing logs, interviewing operators) takes 4-48 hours per incident and does not scale past 20 vehicles or 2 airports. Automated root-cause attribution combines hierarchical anomaly detection (CUSUM/EWMA on fleet KPIs), causal discovery (NOTEARS on operational telemetry), Shapley-value decomposition (which airport/vehicle/component drives the anomaly), and Bayesian diagnosis trees (sequential evidence gathering to pinpoint causes). The system reduces Mean Time To Resolution (MTTR) from 8-48 hours to 15-90 minutes, cuts false-positive alerts by 60-80% through hierarchical filtering, and provides the causal evidence trail that EU PLD 2024/2853 and ISO 3691-4 increasingly require. Implementation cost: $55-95K over 16-22 weeks, with ROI at 15-25 vehicles as avoided downtime and faster resolution exceed system cost.

The Attribution Problem
Anomaly Detection at Fleet Scale
Causal Discovery from Operational Data
Shapley-Value Attribution
Bayesian Diagnosis Trees
OTA Regression Attribution
Map Staleness Attribution
Environmental Attribution
Implementation Architecture
Cost-Benefit and Metrics
Key Takeaways
References

1. The Attribution Problem

1.1 Why Fleet-Level Root-Cause Analysis is Hard

A single vehicle anomaly has a tractable diagnosis path: review the rosbag, check sensor health, inspect the scene. Fleet-level anomalies are fundamentally different because causes propagate, interact, and hide across the combinatorial space of vehicles x airports x subsystems x time.

The dimensionality of fleet diagnosis:

Fleet anomaly search space:

  Vehicles:      50-100 vehicles
  Airports:      3-10 airports
  Subsystems:    6 per vehicle (perception, localization, planning, control, sensors, compute)
  Components:    ~25 per vehicle (4-8 LiDAR, cameras, radar, thermal, GTSAM, Frenet, CAN, etc.)
  Software:      OTA version, map version, parameter set, calibration
  Environment:   Weather, time-of-day, traffic, construction, aircraft mix
  Human:         Operator behavior, shift patterns, training recency

  Total search space: ~10^6 - 10^8 unique state combinations per day
  
  Manual triage throughput: 1-3 root causes / analyst / day
  Anomaly generation rate (50-vehicle fleet): 5-20 anomalies / day
  
  Result: Manual triage creates a permanent and growing backlog

1.2 Root Cause Categories

Every fleet anomaly traces to one or more of eight root cause categories. These are not mutually exclusive -- a model regression exposed only during fog at a newly resurfaced stand involves three simultaneous causes.

#	Root Cause Category	Example Signal	Typical Onset	Manual Triage Time
1	Sensor degradation	Point cloud density drop, intensity shift	Gradual (hours-days)	1-4 hours
2	Map staleness	Localization error increase, path deviation	Step change (construction) or gradual (drift)	2-8 hours
3	OTA model regression	Per-class mAP drop, new false-positive pattern	Step change (deployment time)	4-24 hours
4	Weather/environment	Correlated cross-vehicle performance drop	Gradual-to-sudden	30 min - 2 hours
5	New aircraft type	Detection failures near specific stands	Step change (airline schedule change)	2-8 hours
6	Construction/layout change	Localization divergence in specific zone	Step change (construction start)	4-16 hours
7	Operator behavior change	Intervention pattern shift without vehicle anomaly	Gradual (new operators) or step (policy change)	4-48 hours
8	Software bug	Deterministic failure on specific input pattern	Step change (code deployment)	2-24 hours

1.3 Why Manual Triage Does Not Scale Past 20 Vehicles

Manual root-cause analysis follows a predictable workflow that breaks at fleet scale:

Manual triage workflow (per anomaly):

  1. Receive alert (e.g., "Vehicle 23 intervention rate 4x baseline")     5 min
  2. Query telemetry database for vehicle timeline                        10-30 min
  3. Download and review rosbag from event                                20-60 min
  4. Cross-reference with weather, OTA, map version                       10-30 min
  5. Compare with other vehicles at same airport                          15-45 min
  6. Compare with same vehicle at different times                         15-45 min
  7. Check sensor health logs                                              10-20 min
  8. Formulate hypothesis and test                                         30-120 min
  9. Write root-cause report                                               20-60 min
  
  Total: 2.5 - 7 hours per anomaly (median ~4 hours)
  
  With 1 analyst: 2 anomalies/day capacity
  Fleet of 20 vehicles: ~5-10 anomalies/day (borderline manageable)
  Fleet of 50 vehicles: ~15-30 anomalies/day (3-6x analyst capacity)
  Fleet of 100 vehicles across 5 airports: ~40-80 anomalies/day (impossible)

The problem compounds because unresolved anomalies cascade. A sensor degradation left undiagnosed for 48 hours because the analyst is triaging OTA regressions leads to a safety event. The analyst then spends 2 days on the safety investigation, during which 10 more anomalies pile up.

Industry experience confirms this scaling wall:

Company	Fleet Size at Manual Triage Breakdown	Solution
Waymo	~100 vehicles (2019-2020)	Built centralized anomaly attribution system
Cruise	~200 vehicles (2022)	Automated "trip diagnostics" pipeline
Nuro	~50 vehicles (2021)	Hierarchical alert triage with auto-classification
Tesla	~50,000 FSD vehicles (2023)	Fully automated fleet learning + regression system
Motional	~150 vehicles (2023)	ML-based root cause classifier

reference airside AV stack will hit this wall at 15-25 vehicles across 2+ airports. Building attribution infrastructure before that point is significantly cheaper than retrofitting under operational pressure.

1.4 The Multi-Cause Problem

In practice, 30-40% of fleet anomalies have multiple contributing causes. Single-cause diagnosis leads to partial fixes that recur.

Example: The Tuesday morning problem at Airport B

Observed: 3x intervention rate, Tuesday 06:00-10:00, Airport B, 12 of 15 vehicles

Investigation reveals three co-occurring causes:
  1. OTA v3.2.1 deployed Monday night (regression on personnel detection in fog)
  2. Seasonal fog onset (October) degrades LiDAR range 20-30% at dawn
  3. New Turkish Airlines A350 began operations Tuesday (wider wingspan,
     different reflectivity profile than trained A320/B737)

Fixing only the OTA regression reduces interventions 40%, not 100%.
Fixing only the fog handling reduces interventions 20%.
Addressing the new aircraft type reduces interventions 15%.
All three together resolve 90%+ of the anomaly.

Interactions:
  - OTA regression only manifests in fog (personnel detection robust in clear weather)
  - Fog was tolerable before OTA regression (previous model handled reduced range)
  - New aircraft type causes confusion only at specific stands (B5-B8)
  
  The causes are not additive -- they interact multiplicatively.

This motivates the causal approach: correlation-based methods identify associations ("interventions correlate with fog") but cannot distinguish direct causes from confounders or identify interaction effects.

1.5 Relationship to Existing Systems

This document covers fleet-level anomaly attribution -- the "why" layer that sits above existing monitoring and below incident management. It consumes signals from, but does not duplicate, several existing systems:

Existing System	Doc Reference	What It Provides to Attribution
Sensor health monitoring	`sensor-degradation-health-monitoring.md`	Per-sensor health scores, degradation type classification
Runtime verification	`runtime-verification-monitoring.md`	STL violation events, OOD detection scores, safety envelope status
Predictive maintenance	`../fleet-management/fleet-predictive-maintenance.md`	Component RUL predictions, failure probability estimates
Data flywheel	`data-flywheel-airside.md`	Trigger events, retraining status, model performance metrics
Cloud backend	`cloud-backend-infrastructure.md`	Telemetry pipeline, data lake, Airflow orchestration
Causal reasoning	`causal-reasoning-counterfactual.md`	SCM methodology for planning decisions (this doc extends to fleet operations)

2. Anomaly Detection at Fleet Scale

2.1 Fleet KPI Hierarchy

Anomaly detection requires a well-defined metric hierarchy. Each level aggregates from the one below, enabling drill-down from "fleet is unhealthy" to "LiDAR #3 on Vehicle #23 at Airport B has a contaminated lens."

FLEET KPI HIERARCHY

Level 0: Fleet
  Metrics: fleet_intervention_rate, fleet_availability, fleet_mission_success_rate
  Granularity: 1-hour rolling windows
  Normal: intervention_rate < 0.5/100km, availability > 95%, success > 98%

Level 1: Airport  
  Metrics: airport_{X}_intervention_rate, airport_{X}_availability
  Granularity: 1-hour rolling windows
  Normal: within 2 sigma of airport historical baseline

Level 2: Vehicle
  Metrics: vehicle_{ID}_intervention_rate, mission_completion, avg_speed
  Granularity: per-mission and 4-hour rolling
  Normal: within 3 sigma of vehicle historical baseline

Level 3: Subsystem
  Metrics: perception_mAP, localization_cov_trace, planning_jerk,
           control_tracking_error, sensor_health_score, compute_utilization
  Granularity: per-mission, 30-min rolling
  Normal: perception_mAP > 0.65, loc_cov_trace < 0.1, tracking_error < 0.15m

Level 4: Component
  Metrics: lidar_{N}_point_count, lidar_{N}_max_range, gtsam_iteration_count,
           centerpoint_latency, frenet_candidate_count, can_bus_error_rate
  Granularity: 1-second (streaming), aggregated to 1-min
  Normal: component-specific thresholds from commissioning baseline

2.2 Statistical Process Control Methods

Statistical Process Control (SPC) methods detect when a metric shifts from its baseline distribution. Two methods are complementary for fleet operations:

CUSUM (Cumulative Sum Control Chart)

CUSUM is optimal for detecting small persistent shifts in the mean -- exactly the signature of gradual sensor degradation or slow model drift.

python

class CUSUMDetector:
    """
    Tabular CUSUM for detecting mean shifts in fleet KPIs.
    
    Parameters:
        k: allowance (slack) -- typically 0.5 * shift_to_detect (in sigma units)
        h: decision threshold -- controls false-positive rate
           h=4 gives ARL_0 ~168 (false alarm every 168 samples on average)
           h=5 gives ARL_0 ~465
    
    For fleet metrics sampled hourly:
        h=4: false alarm every ~7 days (acceptable for Tier 1 alerts)
        h=5: false alarm every ~19 days (appropriate for Tier 0 escalations)
    """
    
    def __init__(self, k=0.5, h=4.0, baseline_mean=None, baseline_std=None):
        self.k = k
        self.h = h
        self.mu_0 = baseline_mean    # estimated from 30+ day commissioning period
        self.sigma = baseline_std
        self.S_pos = 0.0             # upper CUSUM statistic
        self.S_neg = 0.0             # lower CUSUM statistic
        self.run_length = 0
    
    def update(self, x):
        """Process new observation. Returns (alarm, direction, run_length)."""
        z = (x - self.mu_0) / self.sigma   # standardize
        
        self.S_pos = max(0, self.S_pos + z - self.k)
        self.S_neg = max(0, self.S_neg - z - self.k)
        self.run_length += 1
        
        if self.S_pos > self.h:
            self.S_pos = 0.0
            return True, 'increase', self.run_length
        elif self.S_neg > self.h:
            self.S_neg = 0.0
            return True, 'decrease', self.run_length
        
        return False, None, self.run_length
    
    def reset_baseline(self, new_mean, new_std):
        """Update baseline after confirmed operational change (e.g., new OTA)."""
        self.mu_0 = new_mean
        self.sigma = new_std
        self.S_pos = 0.0
        self.S_neg = 0.0
        self.run_length = 0

EWMA (Exponentially Weighted Moving Average)

EWMA smooths out high-frequency noise while preserving trend information. Better than CUSUM for metrics with high variance (e.g., per-mission intervention counts that are Poisson-distributed).

python

class EWMADetector:
    """
    EWMA control chart for noisy fleet metrics.
    
    lambda_ (smoothing): 0.05-0.25 typical
        0.05: sensitive to small shifts, slower to detect large shifts
        0.25: detects large shifts quickly, less sensitive to small shifts
    
    For intervention rates (high variance): lambda_=0.10
    For sensor health scores (low variance): lambda_=0.20
    """
    
    def __init__(self, lambda_=0.10, L=3.0, baseline_mean=None, baseline_std=None):
        self.lambda_ = lambda_
        self.L = L                   # control limit multiplier (3.0 = ~0.27% false alarm)
        self.mu_0 = baseline_mean
        self.sigma = baseline_std
        self.z = baseline_mean       # EWMA statistic
        self.n = 0
    
    def update(self, x):
        """Process new observation."""
        self.n += 1
        self.z = self.lambda_ * x + (1 - self.lambda_) * self.z
        
        # Time-varying control limits (accounts for EWMA startup)
        sigma_z = self.sigma * math.sqrt(
            (self.lambda_ / (2 - self.lambda_)) * 
            (1 - (1 - self.lambda_) ** (2 * self.n))
        )
        
        ucl = self.mu_0 + self.L * sigma_z
        lcl = self.mu_0 - self.L * sigma_z
        
        if self.z > ucl:
            return True, 'increase', self.z - self.mu_0
        elif self.z < lcl:
            return True, 'decrease', self.z - self.mu_0
        
        return False, None, self.z - self.mu_0

Which method for which KPI:

KPI	Recommended Method	Rationale
Intervention rate (fleet/airport)	EWMA (lambda=0.10)	High variance (Poisson), need smoothing
Perception mAP (per-vehicle)	CUSUM (k=0.5, h=4)	Low variance, detect small persistent drops
Localization covariance trace	CUSUM (k=0.3, h=5)	Safety-critical, want low false-alarm rate
Sensor health score	EWMA (lambda=0.20)	Moderate variance, want trend sensitivity
Mission success rate	EWMA (lambda=0.15)	Binomial variance, need smoothing
Compute utilization	CUSUM (k=1.0, h=4)	Detect step changes from code changes
Planning jerk/tracking error	EWMA (lambda=0.10)	High variance from scenario diversity

2.3 Multivariate Anomaly Detection

Univariate monitors catch single-metric deviations. But fleet anomalies often manifest as coordinated shifts across multiple metrics that individually remain within bounds.

Example: Perception mAP drops 5% (within normal variance) AND localization covariance increases 15% (within normal variance) AND planning jerk increases 10% (within normal variance). Each alone is not alarming. Together they indicate a systematic problem.

Isolation Forest for Fleet Telemetry

Isolation Forest (Liu et al., 2008) is effective for high-dimensional fleet telemetry because it handles mixed feature types, requires no distributional assumptions, and scales linearly.

python

from sklearn.ensemble import IsolationForest
import numpy as np

class FleetAnomalyDetector:
    """
    Multivariate anomaly detection on fleet telemetry snapshots.
    
    Each observation is a fixed-length feature vector representing
    one vehicle's state over a time window (e.g., 1 hour).
    """
    
    # Feature vector per vehicle per hour
    FEATURES = [
        'intervention_count',          # count in window
        'perception_mAP_mean',         # mean mAP over window
        'perception_mAP_std',          # mAP stability
        'localization_cov_trace_mean', # mean position uncertainty
        'localization_cov_trace_max',  # worst-case uncertainty
        'planning_jerk_p95',           # 95th percentile jerk
        'control_tracking_error_mean', # mean lateral/longitudinal error
        'sensor_health_min',           # worst sensor health in window
        'compute_gpu_util_mean',       # GPU utilization
        'compute_latency_p99',         # 99th percentile end-to-end latency
        'speed_mean',                  # average operating speed
        'distance_km',                 # distance driven in window
        'ota_version_hash',            # categorical -> ordinal encoding
        'map_age_hours',               # hours since last map update
        'weather_visibility_m',        # METAR visibility
        'weather_precip_intensity',    # precipitation rate
        'time_of_day_sin',            # cyclic encoding of hour
        'time_of_day_cos',
        'airport_id',                  # categorical -> one-hot
    ]
    
    def __init__(self, contamination=0.05, n_estimators=200):
        self.model = IsolationForest(
            contamination=contamination,  # expect ~5% anomalous observations
            n_estimators=n_estimators,
            max_samples='auto',
            random_state=42,
            n_jobs=-1,
        )
        self.scaler = None  # fit during training
    
    def fit(self, historical_data: np.ndarray):
        """Fit on 30+ days of normal fleet operation data."""
        self.scaler = RobustScaler().fit(historical_data)
        scaled = self.scaler.transform(historical_data)
        self.model.fit(scaled)
    
    def score(self, current_snapshot: np.ndarray):
        """
        Score current fleet state.
        Returns anomaly_scores (lower = more anomalous, <0 = anomaly).
        """
        scaled = self.scaler.transform(current_snapshot)
        return self.model.decision_function(scaled)

Autoencoder for Temporal Patterns

When the anomaly is not a point-in-time shift but a temporal pattern (e.g., oscillating localization error that correlates with aircraft departure schedule), a sequence autoencoder captures what Isolation Forest misses.

python

import torch
import torch.nn as nn

class FleetTemporalAutoencoder(nn.Module):
    """
    LSTM autoencoder for detecting anomalous temporal patterns
    in per-vehicle telemetry sequences.
    
    Input: (batch, seq_len=24, n_features=19)  -- 24 hours x 19 features
    Output: reconstruction of input
    Anomaly score: reconstruction error (MSE per feature, weighted)
    """
    
    def __init__(self, n_features=19, hidden_dim=64, latent_dim=16):
        super().__init__()
        
        # Encoder
        self.encoder_lstm = nn.LSTM(
            input_size=n_features,
            hidden_size=hidden_dim,
            num_layers=2,
            batch_first=True,
            dropout=0.1,
        )
        self.encoder_fc = nn.Linear(hidden_dim, latent_dim)
        
        # Decoder
        self.decoder_fc = nn.Linear(latent_dim, hidden_dim)
        self.decoder_lstm = nn.LSTM(
            input_size=hidden_dim,
            hidden_size=hidden_dim,
            num_layers=2,
            batch_first=True,
            dropout=0.1,
        )
        self.output_fc = nn.Linear(hidden_dim, n_features)
    
    def encode(self, x):
        _, (h_n, _) = self.encoder_lstm(x)
        return self.encoder_fc(h_n[-1])
    
    def decode(self, z, seq_len):
        h = self.decoder_fc(z).unsqueeze(1).repeat(1, seq_len, 1)
        out, _ = self.decoder_lstm(h)
        return self.output_fc(out)
    
    def forward(self, x):
        z = self.encode(x)
        return self.decode(z, x.size(1))
    
    def anomaly_score(self, x, feature_weights=None):
        """
        Per-feature weighted reconstruction error.
        
        feature_weights: higher weight for safety-critical features
          e.g., perception_mAP: 2.0, localization_cov: 2.0, 
                intervention_count: 3.0, sensor_health: 1.5
        """
        x_hat = self.forward(x)
        mse_per_feature = ((x - x_hat) ** 2).mean(dim=1)  # (batch, n_features)
        
        if feature_weights is not None:
            mse_per_feature = mse_per_feature * feature_weights
        
        return mse_per_feature.sum(dim=-1)  # scalar anomaly score per sample

2.4 Hierarchical Anomaly Detection

The key to avoiding alert fatigue is hierarchical detection that only escalates when lower levels cannot self-resolve.

HIERARCHICAL ANOMALY DETECTION FLOW

       ┌────────────────────────────────────────────────┐
       │ LEVEL 0: Fleet-wide                             │
       │ EWMA on fleet_intervention_rate                 │
       │ Isolation Forest on fleet telemetry snapshot     │
       │ Check: Is the fleet as a whole anomalous?       │
       └───────────────┬────────────────────────────────┘
                       │ ALARM
                       ▼
       ┌────────────────────────────────────────────────┐
       │ LEVEL 1: Airport decomposition                  │
       │ Per-airport CUSUM on intervention_rate           │
       │ Question: Which airport(s) drive the anomaly?   │
       │ Output: Airport B is anomalous, A and C normal  │
       └───────────────┬────────────────────────────────┘
                       │ AIRPORT B FLAGGED
                       ▼
       ┌────────────────────────────────────────────────┐
       │ LEVEL 2: Vehicle decomposition                  │
       │ Per-vehicle CUSUM within Airport B              │
       │ Question: All vehicles or subset?               │
       │ Output: 12/15 vehicles affected                 │
       │ Note: 12/15 -> likely environmental, not vehicle│
       │       3/15 -> likely vehicle-specific            │
       └───────────────┬────────────────────────────────┘
                       │ 12/15 VEHICLES -> ENVIRONMENTAL
                       ▼
       ┌────────────────────────────────────────────────┐
       │ LEVEL 3: Subsystem decomposition                │
       │ Per-subsystem metrics for affected vehicles     │
       │ Question: Perception, localization, or planning?│
       │ Output: Perception mAP degraded on 10/12,       │
       │         Localization normal on all 12            │
       └───────────────┬────────────────────────────────┘
                       │ PERCEPTION SUBSYSTEM
                       ▼
       ┌────────────────────────────────────────────────┐
       │ LEVEL 4: Component decomposition                │
       │ Per-sensor health, per-class mAP, per-range mAP│
       │ Question: Which sensor or class or range band?  │
       │ Output: Personnel class mAP dropped 25%,        │
       │         other classes stable                     │
       └───────────────┬────────────────────────────────┘
                       │ PERSONNEL DETECTION REGRESSION
                       ▼
       ┌────────────────────────────────────────────────┐
       │ ATTRIBUTION: Cross-reference with context       │
       │ OTA v3.2.1 deployed 2 days ago: YES             │
       │ Weather change: fog onset this week              │
       │ Conclusion: OTA regression on personnel class    │
       │             exacerbated by seasonal fog          │
       └────────────────────────────────────────────────┘

Fraction-of-vehicles heuristic for rapid triage:

Fraction Affected	Likely Root Cause Category	Rationale
> 80% at one airport	Environment, map, airport-wide change	External factor affecting all vehicles equally
30-80% at one airport	OTA regression, partial environment	Subset of scenarios or routes affected
< 30% at one airport	Vehicle-specific (sensor, calibration)	Individual vehicle hardware or config
> 50% fleet-wide	OTA regression, fleet-wide software bug	Only software changes propagate globally
Correlated with time	Weather, time-of-day, shift change	Temporal pattern indicates external driver
Correlated with zone	Map, construction, aircraft type at stands	Spatial pattern indicates localized cause

2.5 Alert Fatigue Management

Alert fatigue is the primary failure mode of monitoring systems. The solution is aggressive filtering:

Alert tiering:

Tier	Criteria	Notification	Response Time
P0: Safety	STL violation, collision/near-miss, geofence breach	Pager + SMS + dashboard	< 5 min
P1: Operational	Fleet-level KPI anomaly (CUSUM alarm on L0)	Slack + dashboard	< 1 hour
P2: Degradation	Airport-level or multi-vehicle subsystem anomaly	Dashboard + daily digest	< 4 hours
P3: Informational	Single-vehicle anomaly, single-sensor degradation	Daily digest	< 24 hours
P4: Analytics	Trend detected, no immediate impact	Weekly report	Next planning cycle

Suppression rules:

Deduplication: Same root cause on same vehicle within 4 hours -> suppress subsequent alerts
Known-issue suppression: Acknowledged anomalies with open tickets -> suppress until resolution
Maintenance window: Alerts during scheduled maintenance -> suppress automatically
Correlated suppression: If airport-level alert fires, suppress all vehicle-level alerts at that airport for 30 min (give the attribution system time to diagnose)
Self-resolving: If CUSUM resets within 2 hours (anomaly self-resolves), downgrade to P4

Target alert volume:

Fleet Size	Raw Anomalies/Day	After Hierarchical Filtering	After Suppression	Analyst Load
20 vehicles	5-10	2-5	1-3	Manageable (1 analyst)
50 vehicles	15-30	5-10	2-5	Manageable (1-2 analysts)
100 vehicles	40-80	10-20	4-8	Requires 2-3 analysts
200 vehicles	80-160	15-30	5-12	Requires 3-4 analysts

Without hierarchical filtering and suppression, a 200-vehicle fleet would require 40-80 analysts to manually triage every anomaly. With the system, 3-4 analysts handle the residual that automated attribution cannot resolve.

3. Causal Discovery from Operational Data

3.1 Why Causal, Not Just Correlational

Correlation identifies that fog and high intervention rates co-occur. Causation identifies that fog causes LiDAR range reduction, which causes perception mAP drop, which causes missed personnel detections, which causes operator interventions. The difference matters because:

Actionable remediation: Correlation says "fog = bad." Causation says "deploy radar-primary mode during fog" or "retrain personnel detector with fog augmentation."
Multi-cause disentanglement: When fog AND OTA change co-occur, correlation cannot separate their contributions. Causal analysis can.
Regulatory evidence: EU PLD 2024/2853 requires demonstrating causal contribution to incidents, not mere correlation.

3.2 The Fleet Operations Causal DAG

The causal DAG for fleet operations is distinct from the planning-level DAG in causal-reasoning-counterfactual.md. It operates at a higher abstraction level with different timescales.

FLEET OPERATIONS CAUSAL DAG

Exogenous (uncontrollable):
  WEATHER ──────────────────────┐
  TIME_OF_DAY ──────────────────┤
  AIRCRAFT_MIX ─────────────────┤
  CONSTRUCTION ─────────────────┤
  OPERATOR_EXPERIENCE ──────────┤
                                │
Controllable decisions:         │
  OTA_VERSION ──────────────┐   │
  MAP_VERSION ──────────────┤   │
  PARAMETER_SET ────────────┤   │
  CALIBRATION ──────────────┤   │
                            │   │
                            ▼   ▼
                    ┌───────────────────┐
  WEATHER ─────────→│ SENSOR_HEALTH     │
  TIME_OF_DAY ─────→│ (degradation rate │
  CONSTRUCTION ────→│  per vehicle)     │
                    └────────┬──────────┘
                             │
                             ▼
  OTA_VERSION ──────→ PERCEPTION_QUALITY ←── AIRCRAFT_MIX
  SENSOR_HEALTH ───→        │                (new types = OOD)
                            │
                            ▼
  MAP_VERSION ──────→ LOCALIZATION_QUALITY
  CONSTRUCTION ────→        │
  WEATHER ─────────→        │ (wet surface = GPS multipath)
                            │
                            ▼
  PERCEPTION ──────→ PLANNING_QUALITY ←── PARAMETER_SET
  LOCALIZATION ───→         │
                            │
                            ▼
  PLANNING ────────→ CONTROL_QUALITY ←── CALIBRATION
  VEHICLE_HEALTH ─→         │
                            │
                            ▼
  CONTROL ─────────→ INTERVENTION_RATE ←── OPERATOR_EXPERIENCE
  PERCEPTION ─────→         │                (threshold varies)
  ODD_COMPLIANCE ─→         │
                            │
                            ▼
                    MISSION_SUCCESS

Variable definitions for the DAG:

Variable	Type	Source	Granularity
`WEATHER`	Continuous (visibility, precip, wind, temp)	METAR feed, on-vehicle sensors	30 min (METAR), 1 Hz (vehicle)
`TIME_OF_DAY`	Cyclic (sin/cos encoding of hour)	System clock	1 hour
`AIRCRAFT_MIX`	Categorical (type distribution at stands)	Airport FIDS/A-CDM	Per-flight
`CONSTRUCTION`	Binary/categorical per zone	Airport NOTAMs, fleet detection	Daily
`OPERATOR_EXPERIENCE`	Ordinal (hours on system)	HR/training database	Per-shift
`OTA_VERSION`	Categorical (model + code version hash)	Fleet management system	Per-deployment
`MAP_VERSION`	Ordinal (age in hours + version ID)	Map management system	Per-update
`SENSOR_HEALTH`	Continuous [0, 1] per sensor	Health monitor (1 Hz)	Aggregated hourly
`PERCEPTION_QUALITY`	Continuous (mAP, per-class, per-range)	Shadow evaluation or production metrics	Per-mission
`LOCALIZATION_QUALITY`	Continuous (cov trace, RMSE to ground truth)	GTSAM output	10 Hz, aggregated per-mission
`PLANNING_QUALITY`	Continuous (jerk, tracking error, clearance)	Planning node output	Per-mission
`INTERVENTION_RATE`	Count per 100 km	Operator input log	Per-intervention
`MISSION_SUCCESS`	Binary	Fleet management system	Per-mission

3.3 NOTEARS for Structure Learning

NOTEARS (Zheng et al., 2018) formulates causal structure learning as a continuous optimization problem, converting the NP-hard combinatorial DAG search into a smooth optimization with an acyclicity constraint.

python

import numpy as np
from scipy.optimize import minimize

def notears_linear(X, lambda1=0.1, max_iter=100, h_tol=1e-8):
    """
    NOTEARS: continuous optimization for DAG structure learning.
    
    X: (n_samples, d_variables) -- fleet telemetry data
    lambda1: L1 penalty (sparsity)
    
    Returns: W (d x d) weighted adjacency matrix
             W[i,j] != 0 means variable i directly causes variable j
    
    For fleet operations with ~15-20 variables:
        d=20 -> 400 parameters -> tractable (seconds on CPU)
        Needs n >> d samples: 30 days x 24 hours x 50 vehicles = 36,000 samples
        More than sufficient for d=20
    """
    n, d = X.shape
    
    def _loss(W):
        """Least-squares loss for linear SEM: X = X @ W + noise."""
        M = X @ W
        R = X - M
        return 0.5 / n * (R ** 2).sum()
    
    def _h(W):
        """Acyclicity constraint: tr(e^{W * W}) - d = 0 iff W is DAG."""
        E = np.linalg.matrix_power(np.eye(d) + W * W / d, d)
        return np.trace(E) - d
    
    def _objective(w_flat, rho, alpha):
        W = w_flat.reshape(d, d)
        loss = _loss(W)
        h = _h(W)
        penalty = lambda1 * np.abs(W).sum()
        return loss + 0.5 * rho * h * h + alpha * h + penalty
    
    # Augmented Lagrangian optimization
    W = np.zeros((d, d))
    rho, alpha = 1.0, 0.0
    
    for i in range(max_iter):
        result = minimize(
            _objective, W.flatten(), args=(rho, alpha),
            method='L-BFGS-B',
            jac=True  # in practice, use autograd for gradient
        )
        W = result.x.reshape(d, d)
        h = _h(W)
        
        if abs(h) < h_tol:
            break
        
        alpha += rho * h
        rho *= 10
    
    # Threshold small edges
    W[np.abs(W) < 0.1] = 0.0
    return W

Practical considerations for fleet data:

Non-stationarity: Fleet data is non-stationary (seasonal, OTA changes, fleet growth). Fit NOTEARS on rolling 90-day windows, compare learned DAGs across windows for structural stability.
Mixed data types: Fleet variables mix continuous (mAP), ordinal (map age), and categorical (OTA version). Use NOTEARS-NONLINEAR (Zheng et al., 2020) with MLP basis functions, or discretize continuous variables.
Latent confounders: Some causes are unobserved (e.g., runway resurfacing affects tire grip and sensor vibration simultaneously, but "runway condition" is not directly measured). FCI algorithm (Spirtes et al., 2000) handles latent confounders by identifying ancestral relationships rather than direct causes.

3.4 PC Algorithm for Constraint-Based Discovery

The PC algorithm (Spirtes et al., 2000) discovers causal structure via conditional independence tests. It complements NOTEARS by making weaker parametric assumptions.

python

from causallearn.search.ConstraintBased.PC import pc
from causallearn.utils.cit import fisherz

def discover_fleet_dag(telemetry_df, alpha=0.05):
    """
    PC algorithm for fleet causal structure discovery.
    
    Uses Fisher-z test for conditional independence (assumes Gaussian;
    for non-Gaussian, use KCI or RCIT kernel-based tests).
    
    alpha: significance level for conditional independence tests
           0.01 -> conservative (fewer edges, risk missing real causes)
           0.05 -> moderate (standard choice)
           0.10 -> aggressive (more edges, risk spurious causes)
    """
    data = telemetry_df[FLEET_VARIABLES].values
    
    cg = pc(
        data,
        alpha=alpha,
        indep_test=fisherz,
        stable=True,     # order-independent (recommended for reproducibility)
        uc_rule=0,       # conservative orientation
        uc_priority=-1,  # background knowledge can override
    )
    
    # Apply background knowledge (known causal directions)
    # Weather cannot be caused by vehicle state
    # OTA version cannot be caused by perception quality
    # These constraints reduce false discoveries
    background_knowledge = {
        ('WEATHER', '*'): 'forbidden',      # nothing causes weather
        ('TIME_OF_DAY', '*'): 'forbidden',  # nothing causes time
        ('*', 'OTA_VERSION'): 'forbidden',  # OTA is a decision, not caused by operations
    }
    
    return cg.G, cg.sepset  # graph + separation sets for causal reasoning

3.5 Granger Causality for Temporal Relationships

Some causal relationships are purely temporal: an OTA deployment today causes intervention rate increase tomorrow. Granger causality captures these time-lagged relationships.

python

from statsmodels.tsa.stattools import grangercausalitytests

def fleet_granger_analysis(telemetry_df, max_lag=48, alpha=0.05):
    """
    Pairwise Granger causality tests across fleet variables.
    
    max_lag=48: test up to 48 hours of lag (covers OTA effects that
                may take 1-2 days to manifest as interventions).
    
    Returns: dictionary of significant Granger-causal relationships
             with optimal lag and p-value.
    """
    variables = [
        'intervention_rate', 'perception_mAP', 'localization_cov',
        'sensor_health_min', 'weather_visibility', 'ota_version_age',
        'map_age_hours', 'compute_latency_p99',
    ]
    
    results = {}
    for cause_var in variables:
        for effect_var in variables:
            if cause_var == effect_var:
                continue
            
            data = telemetry_df[[effect_var, cause_var]].dropna()
            
            try:
                gc_result = grangercausalitytests(
                    data.values, maxlag=max_lag, verbose=False
                )
                
                # Find optimal lag (lowest p-value)
                best_lag = min(
                    gc_result.keys(),
                    key=lambda lag: gc_result[lag][0]['ssr_ftest'][1]
                )
                p_value = gc_result[best_lag][0]['ssr_ftest'][1]
                
                if p_value < alpha:
                    results[(cause_var, effect_var)] = {
                        'lag_hours': best_lag,
                        'p_value': p_value,
                        'f_stat': gc_result[best_lag][0]['ssr_ftest'][0],
                    }
            except Exception:
                continue  # insufficient data or singular matrix
    
    return results

Expected Granger-causal relationships for airside fleet:

Cause	Effect	Expected Lag	Mechanism
`ota_version_age`	`intervention_rate`	6-48 hours	OTA deployment -> operational exposure -> incidents
`weather_visibility`	`perception_mAP`	0-1 hours	Near-immediate (fog onset -> range reduction)
`weather_visibility`	`sensor_health_min`	2-24 hours	Delayed (moisture -> lens contamination)
`map_age_hours`	`localization_cov`	0-4 hours	Progressive (map drift -> localization uncertainty)
`perception_mAP`	`intervention_rate`	0-2 hours	Near-immediate (missed detection -> intervention)
`sensor_health_min`	`perception_mAP`	0-1 hours	Near-immediate (degraded sensor -> degraded perception)
`compute_latency_p99`	`planning_quality`	0 hours	Immediate (latency -> stale plan)

3.6 Combining Discovery Methods

No single method is sufficient. The practical approach combines all three:

CAUSAL DISCOVERY PIPELINE

  1. Expert prior DAG
     - Encode known physics (weather -> sensor degradation)
     - Encode known architecture (perception -> planning -> control)
     - Encode known impossibilities (vehicle state cannot cause weather)
     Result: skeleton with ~60-70% of edges pre-specified

  2. NOTEARS on 90-day fleet data
     - Discover data-driven edges not in expert prior
     - Validate expert edges (are they supported by data?)
     - Weight edges by coefficient magnitude
     Result: data-augmented DAG

  3. PC algorithm as validation
     - Independent method with different assumptions
     - Edges present in BOTH NOTEARS and PC have high confidence
     - Edges in only one method flagged for human review
     Result: confidence-annotated DAG

  4. Granger causality for temporal structure
     - Add lag information to each edge
     - Discover temporal-only relationships missed by instantaneous methods
     Result: temporally-annotated causal DAG

  5. Human expert review
     - Review edges present in data but not in prior (potential discovery)
     - Review edges in prior but not in data (potential data limitation)
     - Final DAG approved for production use
     Result: validated fleet operations causal DAG
     
  Update cadence: Re-run quarterly or after major fleet changes (new airport, 
                  new vehicle type, major OTA)

4. Shapley-Value Attribution

4.1 The Attribution Question

Given that the fleet is anomalous, how much of the anomaly is attributable to each factor? This is precisely the Shapley value question from cooperative game theory.

The game: Players are the potential root causes (weather, OTA version, map age, sensor health, etc.). The "payoff" is the anomaly score. The Shapley value of each player is its fair marginal contribution to the anomaly, averaged over all possible coalitions.

Formal definition:

phi_i = (1/|N|!) * sum over all permutations pi of N:
    [v(S_pi^i union {i}) - v(S_pi^i)]

where:
  N = set of all potential causes
  S_pi^i = set of causes that appear before i in permutation pi
  v(S) = anomaly score when only causes in S are at anomalous levels
         and all others are at baseline levels

4.2 SHAP for Anomaly Model Attribution

SHAP (SHapley Additive exPlanations) computes Shapley values efficiently for ML model predictions. Applied to the Isolation Forest anomaly model, it answers "which features contribute most to this observation being anomalous?"

python

import shap

class AnomalyAttributor:
    """
    SHAP-based attribution for fleet anomaly scores.
    
    Given an anomalous fleet state, decomposes the anomaly score
    into per-feature contributions using Shapley values.
    """
    
    def __init__(self, anomaly_model, background_data):
        """
        anomaly_model: trained IsolationForest or autoencoder
        background_data: representative normal fleet data (100-1000 samples)
        """
        self.model = anomaly_model
        
        # TreeExplainer for IsolationForest (exact Shapley, fast)
        # KernelSHAP for autoencoder (approximate, slower)
        if hasattr(anomaly_model, 'estimators_'):
            self.explainer = shap.TreeExplainer(
                anomaly_model, 
                data=background_data,
                feature_perturbation='interventional',  # causal, not observational
            )
        else:
            self.explainer = shap.KernelExplainer(
                anomaly_model.predict, 
                background_data,
                nsamples=500,  # approximation budget
            )
    
    def attribute(self, anomalous_observation):
        """
        Returns per-feature Shapley values for the anomaly.
        
        Positive value: feature contributes to anomaly (anomalous direction)
        Negative value: feature is normal (counteracts anomaly)
        Sum of all values = anomaly_score - expected_anomaly_score
        """
        shap_values = self.explainer.shap_values(anomalous_observation)
        
        # Rank features by absolute contribution
        feature_names = FleetAnomalyDetector.FEATURES
        contributions = sorted(
            zip(feature_names, shap_values[0]),
            key=lambda x: abs(x[1]),
            reverse=True,
        )
        
        return contributions
    
    def attribute_fleet(self, fleet_snapshot):
        """
        Attribute fleet-level anomaly to airports and vehicles.
        
        Uses distributed Shapley: treats each vehicle as a "player"
        and computes its marginal contribution to the fleet anomaly.
        """
        n_vehicles = len(fleet_snapshot)
        vehicle_shapley = np.zeros(n_vehicles)
        
        # For tractable fleet sizes (< 200), use permutation sampling
        n_permutations = min(1000, math.factorial(n_vehicles))
        
        for _ in range(n_permutations):
            perm = np.random.permutation(n_vehicles)
            prev_score = self._fleet_anomaly_score(fleet_snapshot[:0])  # empty
            
            for i, vehicle_idx in enumerate(perm):
                coalition = fleet_snapshot[perm[:i+1]]
                curr_score = self._fleet_anomaly_score(coalition)
                vehicle_shapley[vehicle_idx] += (curr_score - prev_score)
                prev_score = curr_score
        
        vehicle_shapley /= n_permutations
        return vehicle_shapley

4.3 Hierarchical Shapley Decomposition

For fleet attribution, a flat Shapley over all features x vehicles is computationally intractable. Hierarchical decomposition mirrors the KPI hierarchy:

HIERARCHICAL SHAPLEY DECOMPOSITION

Step 1: Fleet -> Airport attribution
  Players: {Airport_A, Airport_B, Airport_C}
  Game: v(S) = fleet anomaly score with only airports in S contributing
  Result: "Airport B contributes 72% of fleet anomaly"
  Cost: 2^3 = 8 evaluations (exact Shapley feasible)

Step 2: Airport B -> Vehicle attribution
  Players: {Vehicle_1, ..., Vehicle_15} at Airport B
  Game: v(S) = Airport B anomaly score with only vehicles in S
  Result: "Vehicles 3, 7, 11 contribute 60% of Airport B anomaly"
  Cost: ~1000 permutation samples (approximate Shapley)

Step 3: Per-vehicle -> Feature attribution
  Players: {perception_mAP, localization_cov, sensor_health, ...}
  Game: v(S) = vehicle anomaly score with only features in S at anomalous level
  Result: "perception_mAP accounts for 45% of Vehicle 7's anomaly"
  Cost: TreeSHAP = exact in O(TLD) per observation (milliseconds)

Total: Fleet anomaly decomposed as:
  "72% Airport B, of which 60% from 3 vehicles,
   of which perception mAP drop is the primary driver,
   specifically personnel class detection in fog conditions."

4.4 Computational Cost and Approximations

Method	Exact Cost	Approximation	Accuracy	Wall-Clock Time
TreeSHAP (per observation)	O(TLD)	Exact	100%	< 10 ms
KernelSHAP (per observation)	O(2^M)	M=500 samples	~95%	1-5 seconds
Permutation SHAP (fleet)	O(N! * eval_cost)	1000 permutations	~90%	5-30 seconds
Hierarchical SHAP (full)	O(levels * per_level)	Per above	~85-90%	10-60 seconds

For real-time fleet dashboards, pre-compute SHAP values on the most recent anomalous snapshot and cache. Update every 15-60 minutes.

4.5 Shapley for Cross-Airport Comparison

A powerful application: when multiple airports operate, Shapley values answer "does Airport B's anomaly look like a known pattern?"

python

def compute_airport_shapley_profile(airport_data, reference_profiles):
    """
    Compare current airport anomaly against known root-cause profiles.
    
    reference_profiles: dictionary of known anomaly signatures
      e.g., {
        'ota_regression': [high_perception, normal_localization, normal_sensor, ...],
        'fog_event': [moderate_perception, normal_localization, degraded_sensor, ...],
        'map_stale': [normal_perception, high_localization, normal_sensor, ...],
      }
    
    Returns: similarity to each known profile (cosine similarity of SHAP vectors)
    """
    current_shap = attributor.attribute(airport_data)
    current_vec = np.array([v for _, v in current_shap])
    
    similarities = {}
    for cause_name, ref_vec in reference_profiles.items():
        cos_sim = np.dot(current_vec, ref_vec) / (
            np.linalg.norm(current_vec) * np.linalg.norm(ref_vec) + 1e-8
        )
        similarities[cause_name] = cos_sim
    
    return similarities  # highest similarity = most likely root cause category

5. Bayesian Diagnosis Trees

5.1 Sequential Diagnosis as Bayesian Inference

When the anomaly detection and Shapley attribution narrow the problem to a small set of hypotheses, Bayesian diagnosis systematically eliminates alternatives by querying the cheapest diagnostic tests first.

The diagnosis as a decision tree:

                        Fleet anomaly detected
                               │
                    ┌──────────┴──────────┐
                    │ Check: How many      │
                    │ vehicles affected?   │
                    └──────────┬──────────┘
                     ┌─────────┴─────────┐
                     │                    │
                  > 50%                < 50%
                (fleet-wide)        (subset)
                     │                    │
              ┌──────┴──────┐      ┌──────┴──────┐
              │ Check: OTA   │      │ Check: Same  │
              │ deployed in  │      │ route/zone?  │
              │ last 72h?    │      └──────┬──────┘
              └──────┬──────┘        ┌─────┴─────┐
               ┌─────┴─────┐         │            │
              YES          NO      YES           NO
               │            │        │            │
        ┌──────┴──┐   ┌────┴────┐   │     ┌──────┴──────┐
        │Check:   │   │Check:   │   │     │Check: Same  │
        │Per-class│   │METAR    │   │     │vehicle(s)?  │
        │mAP drop?│   │change?  │   │     └──────┬──────┘
        └────┬────┘   └────┬────┘   │      ┌─────┴─────┐
          ┌──┴──┐       ┌──┴──┐     │    YES           NO
         YES   NO     YES   NO     │      │         (random)
          │     │      │     │      │      │
       OTA   Software  Weather  Map    Sensor    Investigate
    regression  bug    change  stale  degradation  further

5.2 Prior Probabilities from Fleet History

Prior probabilities are calibrated from historical fleet data. These update as the fleet matures and accumulates incidents.

python

class BayesianDiagnosisEngine:
    """
    Bayesian root-cause diagnosis with sequential evidence gathering.
    
    Maintains posterior probabilities over 8 root cause hypotheses,
    updates with evidence from diagnostic tests.
    """
    
    # Prior probabilities (calibrated from industry data + fleet history)
    # These should be updated quarterly from actual fleet incident data
    ROOT_CAUSES = {
        'sensor_degradation':  0.25,   # most common in airside operations
        'weather_environment':  0.20,   # second most common
        'ota_model_regression': 0.15,   # happens with every OTA
        'map_staleness':        0.12,   # increases with construction frequency
        'software_bug':         0.10,   # decreases as codebase matures
        'operator_behavior':    0.08,   # new operators, policy changes
        'new_aircraft_type':    0.05,   # seasonal schedule changes
        'construction_layout':  0.05,   # airport development
    }
    
    # Likelihood functions: P(evidence | root_cause)
    # Each test produces a binary or categorical result
    # Likelihoods estimated from historical diagnosis data
    
    LIKELIHOODS = {
        'fraction_vehicles_affected': {
            # P(>50% vehicles affected | cause)
            'high': {
                'sensor_degradation':  0.15,   # usually individual vehicles
                'weather_environment':  0.85,   # affects all vehicles
                'ota_model_regression': 0.75,   # affects all vehicles with new OTA
                'map_staleness':        0.60,   # affects vehicles on stale-map routes
                'software_bug':         0.70,   # affects all vehicles with new code
                'operator_behavior':    0.10,   # individual operator variation
                'new_aircraft_type':    0.30,   # affects vehicles at specific stands
                'construction_layout':  0.20,   # affects vehicles in specific zones
            },
        },
        'ota_deployed_recently': {
            # P(OTA deployed in last 72h | cause)
            'yes': {
                'sensor_degradation':  0.10,   # coincidental
                'weather_environment':  0.10,   # coincidental
                'ota_model_regression': 0.95,   # direct cause
                'map_staleness':        0.10,   # coincidental
                'software_bug':         0.80,   # code change introduces bug
                'operator_behavior':    0.05,   # coincidental
                'new_aircraft_type':    0.05,   # coincidental
                'construction_layout':  0.05,   # coincidental
            },
        },
        'per_class_mAP_drop': {
            # P(specific class mAP degraded | cause)
            'yes': {
                'sensor_degradation':  0.40,   # affects all classes (range-dependent)
                'weather_environment':  0.50,   # affects range-dependent classes
                'ota_model_regression': 0.85,   # often class-specific regression
                'map_staleness':        0.05,   # map does not affect detection
                'software_bug':         0.30,   # can affect specific detection paths
                'operator_behavior':    0.02,   # does not affect perception
                'new_aircraft_type':    0.70,   # new type has different features
                'construction_layout':  0.10,   # may cause OOD detections
            },
        },
        'metar_changed': {
            # P(significant METAR change in last 24h | cause)
            'yes': {
                'sensor_degradation':  0.20,   # weather accelerates degradation
                'weather_environment':  0.90,   # direct indicator
                'ota_model_regression': 0.10,   # coincidental
                'map_staleness':        0.05,   # coincidental
                'software_bug':         0.05,   # coincidental
                'operator_behavior':    0.10,   # weather may change behavior
                'new_aircraft_type':    0.05,   # coincidental
                'construction_layout':  0.05,   # coincidental
            },
        },
        'localization_degraded': {
            # P(localization quality degraded | cause)
            'yes': {
                'sensor_degradation':  0.40,   # LiDAR degradation affects GTSAM
                'weather_environment':  0.30,   # GPS multipath in rain, fog
                'ota_model_regression': 0.05,   # OTA rarely changes localization
                'map_staleness':        0.85,   # primary indicator of map issues
                'software_bug':         0.20,   # can affect GTSAM pipeline
                'operator_behavior':    0.02,   # does not affect localization
                'new_aircraft_type':    0.05,   # does not affect localization
                'construction_layout':  0.75,   # layout change = map mismatch
            },
        },
        'zone_correlated': {
            # P(anomaly concentrated in specific zone | cause)
            'yes': {
                'sensor_degradation':  0.10,   # not zone-dependent
                'weather_environment':  0.15,   # sometimes micro-climate zones
                'ota_model_regression': 0.15,   # usually global
                'map_staleness':        0.70,   # construction is localized
                'software_bug':         0.10,   # usually global
                'operator_behavior':    0.20,   # operators assigned to zones
                'new_aircraft_type':    0.80,   # new aircraft at specific stands
                'construction_layout':  0.90,   # construction is localized
            },
        },
    }
    
    def __init__(self):
        self.posteriors = dict(self.ROOT_CAUSES)  # start with priors
        self.evidence_collected = []
    
    def update(self, test_name, result):
        """
        Bayesian update: P(cause | evidence) proportional to
        P(evidence | cause) * P(cause)
        """
        likelihoods = self.LIKELIHOODS[test_name][result]
        
        # Bayes rule
        new_posteriors = {}
        for cause, prior in self.posteriors.items():
            likelihood = likelihoods.get(cause, 0.5)  # default uniform
            new_posteriors[cause] = likelihood * prior
        
        # Normalize
        total = sum(new_posteriors.values())
        for cause in new_posteriors:
            new_posteriors[cause] /= total
        
        self.posteriors = new_posteriors
        self.evidence_collected.append((test_name, result))
        
        return self.get_diagnosis()
    
    def get_diagnosis(self):
        """Return ranked hypotheses with confidence."""
        ranked = sorted(
            self.posteriors.items(), 
            key=lambda x: x[1], 
            reverse=True
        )
        
        top_cause, top_prob = ranked[0]
        second_cause, second_prob = ranked[1]
        
        confidence = 'high' if top_prob > 0.6 else (
            'medium' if top_prob > 0.35 else 'low'
        )
        
        return {
            'top_hypothesis': top_cause,
            'probability': top_prob,
            'confidence': confidence,
            'ratio_to_second': top_prob / max(second_prob, 1e-6),
            'all_hypotheses': ranked,
            'evidence_count': len(self.evidence_collected),
        }
    
    def next_best_test(self):
        """
        Select the diagnostic test with highest expected information gain.
        
        Uses entropy reduction: pick the test that minimizes expected
        posterior entropy (equivalently, maximizes mutual information
        between test result and root cause).
        """
        remaining_tests = [
            t for t in self.LIKELIHOODS.keys()
            if t not in [e[0] for e in self.evidence_collected]
        ]
        
        best_test = None
        best_info_gain = -1
        
        current_entropy = self._entropy(self.posteriors)
        
        for test in remaining_tests:
            expected_posterior_entropy = 0
            
            for result in self.LIKELIHOODS[test].keys():
                # P(result) = sum over causes of P(result|cause) * P(cause)
                p_result = sum(
                    self.LIKELIHOODS[test][result].get(c, 0.5) * p
                    for c, p in self.posteriors.items()
                )
                
                # Posterior after observing this result
                temp_posteriors = {}
                for cause, prior in self.posteriors.items():
                    l = self.LIKELIHOODS[test][result].get(cause, 0.5)
                    temp_posteriors[cause] = l * prior
                total = sum(temp_posteriors.values())
                for c in temp_posteriors:
                    temp_posteriors[c] /= max(total, 1e-10)
                
                posterior_entropy = self._entropy(temp_posteriors)
                expected_posterior_entropy += p_result * posterior_entropy
            
            info_gain = current_entropy - expected_posterior_entropy
            if info_gain > best_info_gain:
                best_info_gain = info_gain
                best_test = test
        
        return best_test, best_info_gain
    
    @staticmethod
    def _entropy(distribution):
        return -sum(
            p * math.log2(p + 1e-10) for p in distribution.values()
        )

5.3 Worked Example: The Tuesday Morning Problem

BAYESIAN DIAGNOSIS TRACE

Initial priors:
  sensor_degradation:  0.25
  weather_environment:  0.20
  ota_model_regression: 0.15
  map_staleness:        0.12
  software_bug:         0.10
  operator_behavior:    0.08
  new_aircraft_type:    0.05
  construction_layout:  0.05

Step 1: Check fraction of vehicles affected -> HIGH (12/15 = 80%)
  Updated posteriors:
  weather_environment:  0.37  (+0.17)  <-- now leads
  ota_model_regression: 0.24  (+0.09)
  software_bug:         0.15  (+0.05)
  map_staleness:        0.12  (stable)
  sensor_degradation:   0.06  (-0.19)  <-- eliminated (individual vehicle issue)
  ...
  Confidence: low (0.37)
  Next best test: ota_deployed_recently (info gain: 0.41 bits)

Step 2: Check OTA deployed recently -> YES (v3.2.1, 2 days ago)
  Updated posteriors:
  ota_model_regression: 0.51  (+0.27)  <-- now leads strongly
  software_bug:         0.27  (+0.12)
  weather_environment:  0.08  (-0.29)  <-- weather unlikely given OTA timing
  ...
  Confidence: medium (0.51)
  Next best test: per_class_mAP_drop (info gain: 0.35 bits)

Step 3: Check per-class mAP -> YES (personnel class down 25%)
  Updated posteriors:
  ota_model_regression: 0.68  (+0.17)  <-- high confidence
  software_bug:         0.13  (-0.14)
  new_aircraft_type:    0.06  (stable)
  ...
  Confidence: high (0.68)
  Ratio to second: 5.2x

Step 4: Check METAR change -> YES (fog onset, visibility 800m)
  Updated posteriors:
  ota_model_regression: 0.59  (-0.09)
  weather_environment:  0.22  (+0.14)  <-- weather is contributing factor
  ...
  Note: Both OTA and weather are contributing. Multi-cause scenario.

DIAGNOSIS: OTA model regression on personnel class (primary, P=0.59)
           exacerbated by seasonal fog onset (secondary, P=0.22)

RECOMMENDED ACTIONS:
  1. Immediate: Rollback OTA v3.2.1 to v3.2.0 on Airport B fleet
  2. Short-term: Retrain personnel detection with fog augmentation
  3. Medium-term: Add fog-specific validation gate to CI/CD pipeline

Time to diagnosis: 4 automated tests, < 2 minutes
Manual triage equivalent: 8-16 hours

5.4 Multi-Hypothesis Tracking

When posterior remains split between two or more hypotheses after all automated tests, maintain parallel investigation tracks:

python

class MultiHypothesisTracker:
    """
    When Bayesian diagnosis cannot reach high confidence,
    maintain and investigate multiple hypotheses in parallel.
    
    Trigger: top hypothesis probability < 0.50 after all automated tests
    """
    
    def __init__(self, diagnosis_result, max_hypotheses=3):
        self.hypotheses = [
            (cause, prob) 
            for cause, prob in diagnosis_result['all_hypotheses'][:max_hypotheses]
        ]
        self.investigations = {}
    
    def generate_investigation_plan(self):
        """
        For each hypothesis, generate the specific diagnostic steps
        that would confirm or refute it.
        """
        plans = {
            'sensor_degradation': [
                'Run per-sensor health diagnostic on all affected vehicles',
                'Compare point cloud density across affected vs unaffected vehicles',
                'Check cleaning/maintenance logs for affected vehicles',
            ],
            'ota_model_regression': [
                'Run shadow evaluation of v_old vs v_new on captured rosbags',
                'Compare per-class mAP between canary and control vehicles',
                'Check model training logs for data distribution changes',
            ],
            'weather_environment': [
                'Correlate intervention timestamps with METAR visibility',
                'Check if interventions cluster in specific time-of-day',
                'Compare same vehicles on clear-weather days (control)',
            ],
            'map_staleness': [
                'Compare current SLAM output to reference map for discrepancies',
                'Check AMDB/NOTAM for recent changes at affected zones',
                'Measure localization RMSE in affected vs unaffected zones',
            ],
            'new_aircraft_type': [
                'Cross-reference intervention locations with aircraft stand assignments',
                'Check FIDS for new airline/aircraft type at affected stands',
                'Review false-negative detections for unknown aircraft geometry',
            ],
            'construction_layout': [
                'Query NOTAM feed for active construction at airport',
                'Compare fleet SLAM maps from this week vs last week',
                'Check if interventions cluster in specific geographic zones',
            ],
            'operator_behavior': [
                'Compare intervention rates by operator ID',
                'Check for new operators on shift during anomaly window',
                'Review intervention justification notes for patterns',
            ],
            'software_bug': [
                'Check error logs for new exception types or increased error rates',
                'Review code changes deployed in same OTA as model update',
                'Run affected scenarios in SIL with verbose logging',
            ],
        }
        
        return {
            cause: plans.get(cause, ['Manual investigation required'])
            for cause, prob in self.hypotheses
        }

6. OTA Regression Attribution

6.1 The OTA Regression Problem

Every OTA model update risks regression on specific classes, scenarios, or conditions that were not adequately represented in the validation set. The CI/CD pipeline (see ../../40-runtime-systems/ml-deployment/av-cicd-devops-pipeline.md) catches many regressions pre-deployment, but production-time regressions occur because:

Validation set coverage gap: Offline validation cannot cover all airport x weather x aircraft x time-of-day combinations
Distribution shift: The training distribution may not match the current operational distribution (e.g., new seasonal fog pattern)
Interaction effects: Model performs well on fog OR on new aircraft, but fails on fog AND new aircraft together
Calibration shift: Model accuracy unchanged but confidence calibration drifts, causing different downstream decisions

6.2 A/B Testing Between OTA Versions

The gold standard for OTA regression detection is a controlled A/B test within the fleet.

OTA A/B TEST ARCHITECTURE

Airport B fleet (15 vehicles):

  Control group (5 vehicles):     Treatment group (10 vehicles):
  OTA v3.2.0 (previous)           OTA v3.2.1 (new)
  Vehicle IDs: 1, 4, 7, 11, 14   Vehicle IDs: 2, 3, 5, 6, 8, 9, 10, 12, 13, 15
  
  Assignment: stratified random
  - Same route distribution between groups
  - Same shift/operator distribution
  - Same vehicle model mix
  
  Duration: 7-14 days (minimum for statistical power)
  
  Primary metric: intervention_rate (events per 100 km)
  Secondary metrics: per_class_mAP, localization_cov, mission_success_rate
  
  Analysis: two-sample t-test (parametric) or Mann-Whitney U (non-parametric)
  Power: 80% to detect 50% intervention rate increase
  Required sample size: ~200 missions per group (achievable in 3-5 days)

Automated regression detection from A/B results:

python

from scipy import stats

class OTARegressionDetector:
    """
    Automated detection of OTA model regression from fleet A/B test.
    """
    
    def __init__(self, significance_level=0.05, effect_size_threshold=0.20):
        self.alpha = significance_level
        self.min_effect = effect_size_threshold  # 20% relative change
    
    def detect_regression(self, control_metrics, treatment_metrics):
        """
        Compare control (old OTA) vs treatment (new OTA) metrics.
        
        Returns: regression detected (bool), details (dict)
        """
        results = {}
        
        for metric_name in control_metrics.columns:
            control = control_metrics[metric_name].dropna()
            treatment = treatment_metrics[metric_name].dropna()
            
            # Mann-Whitney U test (non-parametric, handles non-normal distributions)
            stat, p_value = stats.mannwhitneyu(
                control, treatment, alternative='two-sided'
            )
            
            # Effect size (rank-biserial correlation)
            n1, n2 = len(control), len(treatment)
            effect_size = 1 - (2 * stat) / (n1 * n2)
            
            # Relative change
            relative_change = (treatment.mean() - control.mean()) / (
                control.mean() + 1e-10
            )
            
            is_regression = (
                p_value < self.alpha and
                abs(relative_change) > self.min_effect and
                self._is_degradation(metric_name, relative_change)
            )
            
            results[metric_name] = {
                'p_value': p_value,
                'effect_size': effect_size,
                'relative_change': relative_change,
                'control_mean': control.mean(),
                'treatment_mean': treatment.mean(),
                'is_regression': is_regression,
            }
        
        any_regression = any(r['is_regression'] for r in results.values())
        return any_regression, results
    
    @staticmethod
    def _is_degradation(metric_name, relative_change):
        """Determine if change direction is a regression."""
        # Higher is worse for these metrics
        worse_if_higher = [
            'intervention_rate', 'localization_cov', 'planning_jerk',
            'control_tracking_error', 'compute_latency',
        ]
        # Higher is better for these metrics
        worse_if_lower = [
            'perception_mAP', 'mission_success_rate', 'sensor_health',
        ]
        
        if metric_name in worse_if_higher:
            return relative_change > 0
        elif metric_name in worse_if_lower:
            return relative_change < 0
        return False

6.3 Per-Class and Per-Scenario Regression Detection

Overall mAP may improve while a specific class regresses -- the "accuracy paradox" for safety-critical systems.

python

class PerClassRegressionDetector:
    """
    Detect per-class regression that may be masked by overall improvement.
    
    Example: OTA v3.2.1 improves aircraft detection +3% but regresses
    personnel detection -8%. Overall mAP: +1%. But personnel regression
    is safety-critical and must be caught.
    """
    
    SAFETY_CRITICAL_CLASSES = {
        'personnel': {'min_mAP': 0.60, 'max_regression': 0.03},
        'aircraft':  {'min_mAP': 0.70, 'max_regression': 0.05},
        'fod':       {'min_mAP': 0.50, 'max_regression': 0.02},
    }
    
    def check_per_class(self, old_model_eval, new_model_eval):
        """
        Compare per-class metrics between old and new model.
        
        old_model_eval / new_model_eval: dict[class_name -> mAP]
        """
        regressions = []
        
        for cls, thresholds in self.SAFETY_CRITICAL_CLASSES.items():
            old_mAP = old_model_eval.get(cls, 0)
            new_mAP = new_model_eval.get(cls, 0)
            
            delta = new_mAP - old_mAP
            
            if delta < -thresholds['max_regression']:
                regressions.append({
                    'class': cls,
                    'old_mAP': old_mAP,
                    'new_mAP': new_mAP,
                    'regression': abs(delta),
                    'severity': 'critical' if cls in ('personnel', 'fod') else 'high',
                    'below_minimum': new_mAP < thresholds['min_mAP'],
                })
            
            if new_mAP < thresholds['min_mAP']:
                regressions.append({
                    'class': cls,
                    'new_mAP': new_mAP,
                    'minimum': thresholds['min_mAP'],
                    'severity': 'critical',
                    'reason': 'below_absolute_minimum',
                })
        
        return regressions

6.4 Counterfactual OTA Analysis

For each intervention that occurred after an OTA update, ask: "Would this intervention have occurred on the previous model version?"

python

class CounterfactualOTAAnalysis:
    """
    Counterfactual analysis: replay intervention events through
    the previous model version to determine if OTA caused them.
    
    Requires: rosbag capture from intervention events (standard trigger
    in data-flywheel-airside.md) + previous model version available
    in artifact registry.
    """
    
    def analyze_intervention(self, rosbag_path, old_model, new_model):
        """
        Replay intervention rosbag through both model versions.
        
        Returns: counterfactual result
          - 'ota_caused': intervention would NOT have occurred with old model
          - 'ota_independent': intervention WOULD have occurred with old model
          - 'ota_exacerbated': intervention occurred with both, but new model
                               responded later/worse
        """
        # Extract sensor data from rosbag
        sensor_data = extract_sensor_frames(rosbag_path)
        
        # Run old model on same sensor data
        old_detections = old_model.run_batch(sensor_data)
        old_planning = planning_sim(old_detections)
        old_would_intervene = check_intervention_needed(old_planning)
        
        # Run new model on same sensor data
        new_detections = new_model.run_batch(sensor_data)
        new_planning = planning_sim(new_detections)
        new_would_intervene = check_intervention_needed(new_planning)
        
        if new_would_intervene and not old_would_intervene:
            return 'ota_caused'
        elif new_would_intervene and old_would_intervene:
            # Check if new model was worse (later detection, smaller margin)
            old_margin = compute_safety_margin(old_planning)
            new_margin = compute_safety_margin(new_planning)
            if new_margin < old_margin * 0.8:
                return 'ota_exacerbated'
            return 'ota_independent'
        else:
            return 'ota_independent'

6.5 Automatic Rollback Triggers

OTA ROLLBACK DECISION MATRIX

                    Per-class       Safety         Overall
                    regression?     critical?      mAP impact?     Action
                    ──────────────────────────────────────────────────────────
                    YES             YES (person/   Any              ROLLBACK
                                    FOD)                            immediately
                    
                    YES             NO (GSE,       > -5%            ROLLBACK
                                    other)                          within 24h
                    
                    YES             NO             < -5%            MONITOR
                                                                    for 72h
                    
                    NO              N/A            > -3%            ROLLBACK
                                                                    within 24h
                    
                    NO              N/A            < -3%            MONITOR
                                                                    for 7 days

Rollback procedure:
  1. Stop canary expansion (halt OTA to remaining vehicles)
  2. Revert treatment group to control version (OTA v_old)
  3. Capture all rosbags from anomalous period for analysis
  4. Create Jira ticket with counterfactual analysis results
  5. Block v_new from re-deployment until regression fixed

7. Map Staleness Attribution

7.1 Map Staleness Detection

Map staleness manifests as increased localization error, path deviations, and unexpected obstacles. The fleet itself is the best sensor for map changes (see hd-map-change-detection-maintenance.md for the maintenance pipeline). Here we focus on attributing anomalies to map staleness.

python

class MapStalenessAttributor:
    """
    Attributes fleet anomalies to map staleness by correlating
    localization quality with map age and spatial patterns.
    """
    
    def compute_staleness_score(self, fleet_telemetry, map_metadata):
        """
        Score the likelihood that map staleness is causing the anomaly.
        
        Signals:
          1. Localization covariance increase correlated with map age
          2. Spatial clustering of localization errors
          3. GTSAM iteration count increase (harder to converge)
          4. AMDB/NOTAM changes since last map update
          5. Fleet SLAM discrepancies vs reference map
        """
        signals = {}
        
        # Signal 1: Map age vs localization quality correlation
        map_ages = fleet_telemetry['map_age_hours']
        loc_quality = fleet_telemetry['localization_cov_trace']
        correlation = np.corrcoef(map_ages, loc_quality)[0, 1]
        signals['age_loc_correlation'] = correlation
        
        # Signal 2: Spatial clustering
        error_locations = fleet_telemetry[
            fleet_telemetry['localization_cov_trace'] > 
            fleet_telemetry['localization_cov_trace'].quantile(0.9)
        ][['x', 'y']]
        
        if len(error_locations) > 10:
            from sklearn.cluster import DBSCAN
            clusters = DBSCAN(eps=50, min_samples=5).fit(error_locations)
            n_clusters = len(set(clusters.labels_)) - (1 if -1 in clusters.labels_ else 0)
            signals['spatial_clusters'] = n_clusters  # > 0 suggests localized map issue
        
        # Signal 3: GTSAM convergence difficulty
        gtsam_iters = fleet_telemetry['gtsam_iteration_count']
        signals['gtsam_iter_increase'] = (
            gtsam_iters.tail(100).mean() / gtsam_iters.head(100).mean()
        )
        
        # Signal 4: NOTAM changes
        active_notams = self.query_notams(map_metadata['airport_id'])
        notams_since_map = [
            n for n in active_notams
            if n['effective_date'] > map_metadata['last_update']
        ]
        signals['notam_changes'] = len(notams_since_map)
        
        # Composite score
        score = (
            0.30 * min(max(correlation, 0), 1.0) +
            0.25 * min(signals.get('spatial_clusters', 0) / 3.0, 1.0) +
            0.20 * min(max(signals['gtsam_iter_increase'] - 1.0, 0) / 2.0, 1.0) +
            0.25 * min(signals['notam_changes'] / 5.0, 1.0)
        )
        
        return score, signals

7.2 Construction Activity Detection

The fleet detects construction and layout changes from perception data before the map is updated.

CONSTRUCTION DETECTION PIPELINE

  Fleet SLAM maps (daily aggregate)
       │
       ▼
  Difference detection vs reference map
  ┌────────────────────────────────────────┐
  │ For each 10m x 10m tile:               │
  │   diff = |current_occupancy - ref_map| │
  │   if diff > threshold for 3+ days:     │
  │     → flag as potential construction    │
  └──────────────┬─────────────────────────┘
                 │
                 ▼
  Cross-reference with NOTAM feed
  ┌────────────────────────────────────────┐
  │ Active construction NOTAM in zone?     │
  │   YES → confirmed construction         │
  │   NO  → possible unreported change     │
  │         → escalate to operations       │
  └──────────────┬─────────────────────────┘
                 │
                 ▼
  Impact assessment
  ┌────────────────────────────────────────┐
  │ Does construction zone overlap with:   │
  │   - Active routes? → reroute           │
  │   - Map features? → trigger map update │
  │   - Localization landmarks? → alert    │
  └────────────────────────────────────────┘

7.3 Automatic Map Refresh Triggers

Trigger	Threshold	Action	Priority
Localization RMSE in zone > 0.5m for > 4 hours	0.5m sustained	Schedule map survey of zone	P2
GTSAM iteration count increase > 50%	50% over baseline	Schedule map survey	P3
NOTAM issued for construction in mapped area	Any relevant NOTAM	Assess impact, schedule if needed	P2
Map age > 90 days with no validation	90 days	Mandatory resurvey or fleet validation	P3
Fleet SLAM detects > 10 new static objects in zone	10+ new objects	Trigger incremental map update	P2
3+ vehicles report path deviation in same zone	3 vehicles	Emergency map investigation	P1

8. Environmental Attribution

8.1 METAR Correlation Engine

METAR (Meteorological Aerodrome Report) is the standard aviation weather report, available every 30-60 minutes for every airport with IFR operations. Correlating fleet performance with METAR is the fastest environmental attribution method.

python

class METARCorrelationEngine:
    """
    Correlate fleet KPIs with METAR weather observations.
    
    METAR fields relevant to GSE operations:
      - visibility_m: prevailing visibility in meters
      - ceiling_ft: cloud base height (affects lighting)
      - temp_c: temperature (affects sensor performance, tarmac)
      - dewpoint_c: dew point (fog likelihood when temp-dewpoint < 3C)
      - wind_speed_kt: wind speed (affects jet blast, FOD movement)
      - wind_gust_kt: gust speed
      - precip_type: RA (rain), SN (snow), FG (fog), BR (mist), HZ (haze)
      - precip_intensity: - (light), (moderate), + (heavy)
    """
    
    def correlate(self, fleet_kpis, metar_history, window_hours=24):
        """
        Compute time-lagged correlations between METAR conditions
        and fleet performance metrics.
        """
        results = {}
        
        metar_features = [
            'visibility_m', 'temp_c', 'dewpoint_c', 'wind_speed_kt',
            'precip_intensity_encoded',  # 0=none, 1=light, 2=moderate, 3=heavy
        ]
        
        kpi_features = [
            'intervention_rate', 'perception_mAP', 'sensor_health_min',
            'localization_cov_trace',
        ]
        
        for metar_feat in metar_features:
            for kpi_feat in kpi_features:
                # Test multiple lags (0, 1, 2, 4, 8, 12, 24 hours)
                for lag in [0, 1, 2, 4, 8, 12, 24]:
                    metar_lagged = metar_history[metar_feat].shift(lag)
                    kpi_aligned = fleet_kpis[kpi_feat]
                    
                    valid = metar_lagged.notna() & kpi_aligned.notna()
                    if valid.sum() < 30:
                        continue
                    
                    r, p = stats.pearsonr(
                        metar_lagged[valid], kpi_aligned[valid]
                    )
                    
                    if p < 0.05 and abs(r) > 0.3:
                        key = (metar_feat, kpi_feat, lag)
                        results[key] = {'r': r, 'p': p, 'lag_hours': lag}
        
        return results

8.2 Seasonal Drift Detection

Airside operations experience strong seasonal patterns that must be separated from anomalies.

SEASONAL PATTERNS IN AIRSIDE FLEET PERFORMANCE

     Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
     ────────────────────────────────────────────────────────────
mAP: Low  Low  Med  Med  High High Med  High High Med  Med  Low
     (fog, de-icing)              (heat shimmer)    (fog onset)
     
Int: High High Med  Med  Low  Med  Med  Low  Low  Med  Med  High
rate (vis, ice)                (thermal)           (fog, early dark)

Sensor
health: Low  Low  Med  High High Med  Med  High High Med  Low  Low
        (glycol, salt)        (heat stress)       (debris, ice)

Key seasonal transitions (highest risk of misattribution):
  - Oct: Fog onset -- first foggy day of season often causes spike
         that looks like OTA regression if OTA was deployed in Sep
  - Nov-Dec: De-icing onset -- sensor degradation accelerates
  - Mar-Apr: Pollen season -- gradual sensor contamination
  - Jun-Jul: Heat shimmer -- LiDAR range reduction on hot tarmac

Seasonal baseline adjustment:

python

class SeasonalBaselineAdjuster:
    """
    Adjust fleet baselines for seasonal patterns to prevent
    seasonal effects from being misattributed as anomalies.
    
    Requires: 12+ months of fleet data for robust seasonal estimation.
    For new airports: use data from similar airports (same climate zone)
    as initial seasonal prior, update with local data.
    """
    
    def __init__(self, historical_data, period='weekly'):
        """
        Fit seasonal decomposition on historical fleet data.
        
        Uses STL (Seasonal and Trend decomposition using Loess)
        for robust decomposition that handles outliers.
        """
        from statsmodels.tsa.seasonal import STL
        
        self.decompositions = {}
        for metric in historical_data.columns:
            stl = STL(
                historical_data[metric],
                period=52 if period == 'weekly' else 365,
                seasonal=13,    # seasonal smoother window
                trend=53,       # trend smoother window
                robust=True,    # robust to outliers
            )
            self.decompositions[metric] = stl.fit()
    
    def get_adjusted_baseline(self, metric, date):
        """Return seasonally-adjusted baseline for given metric and date."""
        decomp = self.decompositions[metric]
        seasonal_component = decomp.seasonal[date]
        trend_component = decomp.trend[date]
        return trend_component + seasonal_component
    
    def is_anomalous(self, metric, date, observed_value, n_sigma=3):
        """
        Check if observed value is anomalous relative to
        seasonally-adjusted baseline.
        """
        baseline = self.get_adjusted_baseline(metric, date)
        residual_std = self.decompositions[metric].resid.std()
        
        z_score = (observed_value - baseline) / residual_std
        return abs(z_score) > n_sigma, z_score

8.3 Time-of-Day Effects

Several airside phenomena have strong diurnal patterns:

Time Window	Environmental Effect	Impact on AV Operations
04:00-07:00	Dawn fog, dew, thermal inversion	LiDAR range reduction 20-40%, GPS multipath
07:00-09:00	Morning rush, de-icing (winter)	High traffic density, sensor contamination
10:00-14:00	Solar heating, heat shimmer	Tarmac heat shimmer, thermal camera saturation
14:00-17:00	Afternoon thunderstorms (summer)	Sudden visibility drops, standing water
17:00-20:00	Evening rush, low sun angle	Glare on cameras, long shadows in LiDAR
20:00-04:00	Night operations	No natural lighting, thermal-primary perception

8.4 Airport-Specific Environmental Signatures

Each airport has unique environmental characteristics that create distinct anomaly patterns:

python

AIRPORT_ENVIRONMENTAL_PROFILES = {
    'LHR': {
        'primary_hazards': ['fog', 'rain', 'de-icing_Nov-Mar'],
        'fog_frequency': 0.15,  # fraction of operating hours
        'de_icing_months': [11, 12, 1, 2, 3],
        'heat_stress_rare': True,
        'expected_mAP_reduction_fog': 0.15,  # 15% mAP drop in fog is normal
    },
    'DXB': {
        'primary_hazards': ['heat_shimmer', 'sand', 'humidity'],
        'heat_months': [5, 6, 7, 8, 9],
        'max_tarmac_temp_c': 65,
        'sand_storm_frequency': 0.02,
        'expected_thermal_throttling': True,  # Orin throttles in summer
    },
    'NRT': {
        'primary_hazards': ['typhoon_season', 'humidity', 'snow'],
        'typhoon_months': [8, 9, 10],
        'snow_months': [12, 1, 2],
        'expected_mAP_reduction_rain': 0.10,
    },
}

9. Implementation Architecture

9.1 System Architecture

FLEET ANOMALY ATTRIBUTION ARCHITECTURE

┌──────────────────────────────────────────────────────────────────────────┐
│                                                                          │
│  VEHICLE TIER (On-Vehicle, Orin)                                        │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ Telemetry Emitter (1 Hz)                                        │    │
│  │ - Perception metrics (mAP, per-class, latency)                  │    │
│  │ - Localization metrics (cov trace, GTSAM iters, RMSE)          │    │
│  │ - Sensor health scores (per-sensor, per-check)                  │    │
│  │ - Planning metrics (jerk, clearance, candidate count)           │    │
│  │ - Control metrics (tracking error, actuator status)             │    │
│  │ - System metrics (GPU util, temp, memory, latency)              │    │
│  │ - OTA/map version identifiers                                    │    │
│  │ Format: Protobuf -> 5G/WiFi -> Kafka                           │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
├──────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  STREAMING TIER (Airport Edge or Cloud)                                 │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ Apache Kafka                                                     │    │
│  │ Topics: telemetry.{airport}.{vehicle}, alerts.{tier},           │    │
│  │         interventions.{airport}, metar.{airport}                │    │
│  │ Retention: 7 days (hot), 90 days (cold via S3 sink)             │    │
│  └────────────────────┬────────────────────────────────────────────┘    │
│                       │                                                  │
│  ┌────────────────────▼────────────────────────────────────────────┐    │
│  │ Apache Flink (Stream Processing)                                 │    │
│  │                                                                   │    │
│  │ Job 1: Per-vehicle CUSUM/EWMA (1-min windows)                   │    │
│  │   - Update SPC charts for each vehicle x metric                  │    │
│  │   - Emit to alerts.p3 on single-vehicle alarm                   │    │
│  │                                                                   │    │
│  │ Job 2: Per-airport aggregation (15-min windows)                  │    │
│  │   - Aggregate vehicle metrics to airport level                   │    │
│  │   - CUSUM on airport-level KPIs                                  │    │
│  │   - Emit to alerts.p2 on airport alarm                          │    │
│  │                                                                   │    │
│  │ Job 3: Fleet-wide aggregation (1-hour windows)                  │    │
│  │   - Isolation Forest scoring on fleet snapshot                   │    │
│  │   - EWMA on fleet-wide intervention rate                        │    │
│  │   - Emit to alerts.p1 on fleet alarm                            │    │
│  │                                                                   │    │
│  │ Job 4: METAR enrichment                                          │    │
│  │   - Join telemetry with METAR feed                               │    │
│  │   - Compute environmental context features                       │    │
│  │                                                                   │    │
│  │ Job 5: Alert suppression and deduplication                       │    │
│  │   - Apply suppression rules (Section 2.5)                       │    │
│  │   - Route to appropriate tier                                    │    │
│  └────────────────────┬────────────────────────────────────────────┘    │
│                       │                                                  │
├──────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ATTRIBUTION TIER (Cloud)                                               │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ Attribution Engine (triggered by P1/P2 alerts)                   │    │
│  │                                                                   │    │
│  │ Step 1: Hierarchical decomposition                               │    │
│  │   Fleet -> Airport -> Vehicle -> Subsystem -> Component          │    │
│  │                                                                   │    │
│  │ Step 2: Shapley attribution                                      │    │
│  │   Per-feature SHAP values for anomalous observations             │    │
│  │   Airport-level and vehicle-level Shapley decomposition          │    │
│  │                                                                   │    │
│  │ Step 3: Bayesian diagnosis                                       │    │
│  │   Sequential evidence gathering                                   │    │
│  │   Prior update from fleet history                                │    │
│  │   Multi-hypothesis tracking if ambiguous                         │    │
│  │                                                                   │    │
│  │ Step 4: Context enrichment                                       │    │
│  │   Cross-reference with OTA deployment log                        │    │
│  │   Cross-reference with METAR history                             │    │
│  │   Cross-reference with NOTAM feed                                │    │
│  │   Cross-reference with FIDS (flight schedule / aircraft types)   │    │
│  │                                                                   │    │
│  │ Step 5: Report generation                                        │    │
│  │   Root cause hypothesis with confidence                          │    │
│  │   Recommended actions                                             │    │
│  │   Evidence trail for audit                                        │    │
│  └────────────────────┬────────────────────────────────────────────┘    │
│                       │                                                  │
├──────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  BATCH TIER (Cloud, Airflow)                                            │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ DAG: daily_fleet_analysis (runs 02:00 UTC)                      │    │
│  │   Task 1: Aggregate 24h telemetry from S3                       │    │
│  │   Task 2: Run Isolation Forest + autoencoder on daily snapshot   │    │
│  │   Task 3: Update seasonal baselines (STL decomposition)         │    │
│  │   Task 4: Granger causality analysis (90-day rolling window)    │    │
│  │   Task 5: Generate daily fleet health report                    │    │
│  │                                                                   │    │
│  │ DAG: quarterly_causal_discovery (runs quarterly)                 │    │
│  │   Task 1: Run NOTEARS on 90-day fleet telemetry                 │    │
│  │   Task 2: Run PC algorithm for validation                       │    │
│  │   Task 3: Compare discovered DAG with prior DAG                 │    │
│  │   Task 4: Generate causal structure report for human review      │    │
│  │                                                                   │    │
│  │ DAG: ota_regression_analysis (triggered by OTA deployment)       │    │
│  │   Task 1: Define A/B groups                                      │    │
│  │   Task 2: Collect 7-day metrics from control + treatment         │    │
│  │   Task 3: Run regression detection tests                        │    │
│  │   Task 4: Counterfactual replay on captured rosbags              │    │
│  │   Task 5: Generate OTA regression report                        │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
├──────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PRESENTATION TIER                                                      │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ Grafana Dashboard                                                │    │
│  │   Panel 1: Fleet KPI overview (SPC charts per level)            │    │
│  │   Panel 2: Active anomalies with attribution status             │    │
│  │   Panel 3: Shapley waterfall chart (feature contributions)      │    │
│  │   Panel 4: Bayesian diagnosis progress (posterior evolution)    │    │
│  │   Panel 5: OTA A/B comparison (treatment vs control)            │    │
│  │   Panel 6: Map staleness heat map (per zone per airport)        │    │
│  │   Panel 7: Environmental correlation matrix                     │    │
│  │                                                                   │    │
│  │ Slack Integration                                                │    │
│  │   P0: @channel with one-click acknowledge                       │    │
│  │   P1: Attribution summary with drill-down link                  │    │
│  │   P2: Daily digest thread                                       │    │
│  │                                                                   │    │
│  │ Jira Integration                                                 │    │
│  │   Auto-create ticket for P0/P1 with attribution evidence        │    │
│  │   Link to related rosbags, SHAP plots, Bayesian trace           │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
└──────────────────────────────────────────────────────────────────────────┘

9.2 Data Flow Summary

Data Source	Volume	Latency	Storage	Retention
Vehicle telemetry (1 Hz)	500 KB/vehicle/hour	< 5 sec	Kafka -> S3	7 days hot, 90 days cold
Interventions	1-5 events/vehicle/day	< 1 sec	Kafka -> PostgreSQL	Permanent
METAR	1 report/30 min/airport	< 5 min	PostgreSQL	Permanent
NOTAM	As issued	< 15 min	PostgreSQL	Permanent
Sensor health	1 Hz (aggregated to 1 min)	< 30 sec	Kafka -> S3	30 days
OTA deployment log	Per deployment	< 1 min	PostgreSQL	Permanent
Map metadata	Per map update	< 1 min	PostgreSQL	Permanent
FIDS/A-CDM	Per flight event	< 5 min	PostgreSQL	1 year
Attribution results	Per anomaly	< 5 min	PostgreSQL + S3	Permanent

9.3 Integration Points

INTEGRATION WITH EXISTING REFERENCE AIRSIDE AV STACK SYSTEMS

┌────────────────────────────────────────────────────────────────┐
│ UPSTREAM (data sources)                                        │
│                                                                 │
│ sensor-degradation-health-monitoring.md                         │
│   → Provides: per-sensor health scores, degradation type        │
│   → Integration: Flink reads from health_monitor Kafka topic    │
│                                                                 │
│ runtime-verification-monitoring.md                              │
│   → Provides: STL violation events, OOD scores                  │
│   → Integration: Flink reads from stl_monitor Kafka topic       │
│                                                                 │
│ cloud-backend-infrastructure.md                                 │
│   → Provides: data lake, Airflow, telemetry pipeline            │
│   → Integration: Attribution is an additional Airflow DAG       │
│                   and Flink job set on existing infrastructure   │
│                                                                 │
│ data-flywheel-airside.md                                        │
│   → Provides: trigger events, model eval metrics                │
│   → Integration: Attribution feeds edge-case mining priorities   │
│                                                                 │
│ ../fleet-management/fleet-predictive-maintenance.md                                 │
│   → Provides: component RUL, failure predictions                │
│   → Integration: PHM signals are features in anomaly model      │
├────────────────────────────────────────────────────────────────┤
│ DOWNSTREAM (consumers)                                          │
│                                                                 │
│ ../ota/ota-fleet-management.md                                         │
│   ← Consumes: OTA regression alerts, rollback triggers          │
│                                                                 │
│ hd-map-change-detection-maintenance.md                          │
│   ← Consumes: map staleness alerts, refresh triggers            │
│                                                                 │
│ ../../40-runtime-systems/monitoring-observability/hmi-operator-interface.md                                       │
│   ← Consumes: attribution summaries for operator dashboard      │
│                                                                 │
│ ../../40-runtime-systems/ml-deployment/av-cicd-devops-pipeline.md                                      │
│   ← Consumes: per-class regression signals for CI gates         │
│                                                                 │
│ weather-adaptive-odd-management.md                              │
│   ← Consumes: environmental attribution for ODD tuning          │
└────────────────────────────────────────────────────────────────┘

9.4 Technology Stack

Component	Technology	Rationale
Message bus	Apache Kafka	Already in cloud backend architecture. High throughput, replay capability
Stream processing	Apache Flink	Exactly-once semantics, event-time processing, sub-second latency
Batch processing	Apache Airflow	Already in cloud backend architecture. DAG scheduling, dependency management
Anomaly models	scikit-learn (IsolationForest), PyTorch (autoencoder)	Standard, well-tested implementations
Causal discovery	causal-learn (NOTEARS, PC), statsmodels (Granger)	Most mature Python causal inference library
SHAP computation	shap library	Standard implementation, TreeSHAP for exact computation
Database	PostgreSQL + TimescaleDB	Time-series extension for efficient temporal queries
Dashboard	Grafana	Already likely in stack. Custom panels via JSON API
Alerting	Grafana Alerting + PagerDuty	Multi-channel alert routing
Ticket management	Jira API	Auto-create tickets with evidence links

10. Cost-Benefit and Metrics

10.1 Implementation Cost

Phase	Scope	Duration	Cost
Phase 1: Anomaly detection	SPC (CUSUM/EWMA) on fleet KPIs, hierarchical decomposition, alert tiering, Grafana dashboards	6-8 weeks	$15-25K
Phase 2: Shapley attribution	Isolation Forest training, TreeSHAP integration, hierarchical Shapley decomposition, feature attribution dashboard	4-6 weeks	$12-20K
Phase 3: Bayesian diagnosis	Diagnosis engine, prior calibration, sequential evidence gathering, multi-hypothesis tracking	4-6 weeks	$12-20K
Phase 4: OTA regression	A/B test framework, per-class regression detection, counterfactual replay, automatic rollback triggers	4-6 weeks	$10-18K
Phase 5: Environmental + map	METAR correlation, seasonal baseline, map staleness scoring, construction detection	3-4 weeks	$8-15K
Phase 6: Causal discovery	NOTEARS pipeline, PC validation, Granger analysis, quarterly update DAG	3-4 weeks	$8-15K
Total		16-22 weeks	$55-95K

Infrastructure cost (incremental to existing cloud backend):

Component	Monthly Cost (50 vehicles)	Monthly Cost (200 vehicles)
Flink cluster (3 nodes)	$300-500	$800-1,500
TimescaleDB	$100-200	$300-600
Additional S3 storage	$50-100	$200-400
Grafana (hosted)	$100	$200
PagerDuty	$100-200	$200-400
Compute (batch attribution)	$100-200	$300-600
Total	$750-1,300/month	$1,800-3,500/month

10.2 Benefit Quantification

MTTR reduction:

Metric	Before Attribution System	After (Phase 1-2)	After (Full System)
Mean Time To Detection	1-4 hours	5-15 minutes	2-10 minutes
Mean Time To Attribution	4-24 hours	30-120 minutes	10-30 minutes
Mean Time To Resolution	8-48 hours	2-8 hours	1-4 hours
False positive alerts/day (50 vehicles)	N/A (no alerts)	3-5	1-2
Anomalies requiring manual triage	100%	40-60%	15-25%
Analyst headcount (50 vehicles)	2-3 (reactive)	1-2	1
Analyst headcount (200 vehicles)	8-12 (impossible)	3-4	2-3

Financial impact (50-vehicle fleet):

Benefit Category	Annual Value	Basis
Reduced downtime (faster MTTR)	$80-150K	20-40 hours/month saved x $100-200/hour fleet-hour value
Avoided cascading failures	$30-60K	2-5 prevented per year x $15-30K per cascade
Reduced analyst headcount	$50-100K	1-2 fewer analysts needed
OTA regression early detection	$20-50K	1-3 regressions caught 24-48 hours earlier per year
Improved fleet availability	$40-80K	1-2% availability improvement x fleet revenue
Total annual benefit	$220-440K
Total annual cost	$25-45K	Infrastructure + maintenance
Net annual benefit	$195-395K
ROI (including implementation)	2.5-5x in Year 1

10.3 Performance Metrics for the Attribution System

Metric	Target	Measurement Method
Attribution accuracy	> 80% correct root cause identification	Validated against manual investigation on sample
Attribution coverage	> 90% of P1/P2 anomalies get automated attribution	Fraction with high-confidence diagnosis
MTTD (detection)	< 15 min for P1, < 60 min for P2	Time from anomaly onset to alert
MTTA (attribution)	< 30 min for single-cause, < 60 min for multi-cause	Time from alert to root cause hypothesis
False positive rate	< 5% of P1 alerts, < 15% of P2 alerts	Manual review of 10% sample
Alert volume	< 5 P1-P2 alerts/day for 50-vehicle fleet	Dashboard count
Bayesian diagnosis depth	< 4 tests to reach high confidence	Average tests before P > 0.60
Seasonal adjustment accuracy	< 10% MAPE on seasonal baseline	Back-test against held-out year
Causal DAG stability	> 80% edge agreement across quarters	Compare quarterly NOTEARS outputs

10.4 Break-Even Analysis

BREAK-EVEN ANALYSIS

Fixed cost: $55-95K implementation
Variable cost: $750-1,300/month infrastructure (50 vehicles)
Annual operating: $25-45K

Annual benefit per vehicle: $4,400-8,800 (from MTTR reduction, availability, analyst savings)

Break-even fleet size (Year 1 including implementation):
  Conservative: $55K / ($4,400/vehicle) = ~13 vehicles
  Moderate:     $75K / ($6,600/vehicle) = ~12 vehicles
  Optimistic:   $95K / ($8,800/vehicle) = ~11 vehicles

Break-even fleet size (Year 2+, operating cost only):
  Conservative: $45K / ($4,400/vehicle) = ~11 vehicles
  Moderate:     $35K / ($6,600/vehicle) = ~6 vehicles

Conclusion: ROI-positive at 10-15 vehicles. Build before fleet 
reaches 20 vehicles (the manual triage scaling wall).

11. Key Takeaways

Manual triage does not scale past 20 vehicles. At 50 vehicles, anomaly generation rate (15-30/day) exceeds analyst capacity (2-3/day) by an order of magnitude. The backlog grows permanently, and undiagnosed anomalies cascade into safety events.
Hierarchical anomaly detection reduces alert volume 80-90%. Fleet -> airport -> vehicle -> subsystem -> component decomposition, combined with CUSUM/EWMA and Isolation Forest, filters noise before it reaches analysts. Target: < 5 actionable P1/P2 alerts per day for a 50-vehicle fleet.
Fraction-of-vehicles-affected is the single most informative diagnostic signal. If > 80% of vehicles at an airport are affected, the cause is environmental or software (fleet-wide). If < 30%, the cause is vehicle-specific (sensor, calibration). This one check eliminates half the hypothesis space in seconds.
Causal discovery (NOTEARS + PC + Granger) provides the structural backbone. The learned causal DAG encodes which variables can cause which, preventing spurious attribution (e.g., attributing a fog-caused intervention spike to a coincidental OTA update). Re-learn quarterly as fleet operations evolve.
Shapley values provide fair, decomposable attribution. SHAP decomposes the anomaly score into per-feature contributions with exact or near-exact computation. Hierarchical Shapley (fleet -> airport -> vehicle -> feature) makes the computation tractable for large fleets.
Bayesian diagnosis trees reach high confidence in 3-4 tests. Sequential evidence gathering, guided by information gain, reaches > 60% posterior on the true root cause in typically 3-4 automated tests (2-5 minutes), compared to 4-24 hours of manual investigation.
30-40% of fleet anomalies have multiple co-occurring causes. Single-cause diagnosis leads to partial fixes and recurrence. The Bayesian and causal framework explicitly handles multi-cause scenarios by tracking joint posteriors and interaction effects.
OTA regression is the most dangerous single cause because it is self-inflicted and fleet-wide. A/B testing between OTA versions, per-class regression detection, and counterfactual replay ("would this intervention have happened on the old model?") are non-negotiable for production fleet safety.
Seasonal baselines prevent misattribution. Without seasonal adjustment, the first fog of autumn looks like a system failure. STL decomposition on 12+ months of data creates robust seasonal baselines. For new airports, borrow seasonal profiles from same-climate airports.
EU PLD 2024/2853 creates legal urgency for causal attribution. The rebuttable presumption of causality means if reference airside AV stack cannot demonstrate what caused an incident, courts may presume the AV caused it. Automated causal attribution provides the evidence trail that regulatory defense requires.
Implementation cost $55-95K, break-even at 10-15 vehicles. The system pays for itself through MTTR reduction (8-48h to 15-90min), analyst headcount reduction (2-3 fewer for 50 vehicles), and avoided cascading failures ($30-60K/year).
Build before you need it. Retrofitting attribution under operational pressure (when the fleet hits 20+ vehicles and anomalies pile up) costs 2-3x more than building it proactively. The 16-22 week implementation timeline means starting at 10-15 vehicles to be ready at 20.

12. References

Anomaly Detection and SPC

Page, E. S. (1954). "Continuous Inspection Schemes." Biometrika, 41(1/2), 100-115. [Original CUSUM paper]
Roberts, S. W. (1959). "Control Chart Tests Based on Geometric Moving Averages." Technometrics, 1(3), 239-250. [EWMA control charts]
Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008). "Isolation Forest." ICDM, 413-422. [Isolation Forest for anomaly detection]
Malhotra, P. et al. (2016). "LSTM-based Encoder-Decoder for Multi-sensor Anomaly Detection." ICML Workshop on Anomaly Detection, arXiv:1607.00148. [Autoencoder for time-series anomaly detection]

Causal Discovery

Zheng, X. et al. (2018). "DAGs with NO TEARS: Continuous Optimization for Structure Learning." NeurIPS, 9472-9483. [NOTEARS algorithm]
Zheng, X. et al. (2020). "Learning Sparse Nonparametric DAGs." AISTATS, 3414-3425. [NOTEARS-NONLINEAR]
Spirtes, P., Glymour, C., & Scheines, R. (2000). Causation, Prediction, and Search. MIT Press. [PC algorithm, FCI algorithm]
Granger, C. W. J. (1969). "Investigating Causal Relations by Econometric Models and Cross-spectral Methods." Econometrica, 37(3), 424-438. [Granger causality]
Zheng, X. et al. (2024). "causal-learn: Causal Discovery in Python." JMLR, 25(60), 1-8. [causal-learn library]

Shapley Values and Attribution

Lundberg, S. M. & Lee, S. I. (2017). "A Unified Approach to Interpreting Model Predictions." NeurIPS, 4765-4774. [SHAP framework]
Lundberg, S. M. et al. (2020). "From local explanations to global understanding with explainable AI for trees." Nature Machine Intelligence, 2, 56-67. [TreeSHAP]
Covert, I. et al. (2021). "Explaining by Removing: A Unified Framework for Model Explanation." JMLR, 22(209), 1-90. [Shapley for feature attribution theory]

Bayesian Diagnosis

Pearl, J. (2009). Causality: Models, Reasoning, and Inference. 2nd edition. Cambridge University Press.
Heckerman, D. (1995). "A Tutorial on Learning with Bayesian Networks." Technical Report MSR-TR-95-06, Microsoft Research.
Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press. [Bayesian network diagnosis]

Fleet Operations and AV Monitoring

Uber ATG (2020). "Automated Root Cause Analysis for Autonomous Vehicle Failures." Internal technical report (disclosed in NTSB investigations).
Waymo (2023). "Fleet-Level Safety Monitoring and Anomaly Detection." Waymo Safety Report, 4th Edition.
ISO 3691-4:2023. "Industrial trucks -- Safety requirements and verification -- Part 4: Driverless industrial trucks and their systems."
EU Product Liability Directive 2024/2853. "Directive on liability for defective products." (Transpose deadline December 2026.)
ISO 26262:2018. "Road vehicles -- Functional safety." Parts 1-12.

Seasonal Decomposition and Environmental Analysis

Cleveland, R. B. et al. (1990). "STL: A Seasonal-Trend Decomposition Procedure Based on Loess." Journal of Official Statistics, 6(1), 3-73.
ICAO Annex 3 (2018). "Meteorological Service for International Air Navigation." 20th edition. [METAR standard]
FAA Advisory Circular 00-45H (2016). "Aviation Weather Services." [METAR interpretation]

Time-Series Analysis

Hamilton, J. D. (1994). Time Series Analysis. Princeton University Press. [Granger causality, VAR models]
Hyndman, R. J. & Athanasopoulos, G. (2021). Forecasting: Principles and Practice. 3rd edition. OTexts. [STL, ARIMA, seasonal adjustment]

SLAM Methods

Methods

Automated Root-Cause Attribution for Fleet-Level Operational Anomalies ​

Causal Discovery, Shapley Attribution, Bayesian Diagnosis, and Streaming Analytics for Autonomous GSE Fleets ​

Table of Contents ​

1. The Attribution Problem ​

1.1 Why Fleet-Level Root-Cause Analysis is Hard ​

1.2 Root Cause Categories ​

1.3 Why Manual Triage Does Not Scale Past 20 Vehicles ​

1.4 The Multi-Cause Problem ​

1.5 Relationship to Existing Systems ​

2. Anomaly Detection at Fleet Scale ​

2.1 Fleet KPI Hierarchy ​

2.2 Statistical Process Control Methods ​

2.3 Multivariate Anomaly Detection ​

2.4 Hierarchical Anomaly Detection ​

2.5 Alert Fatigue Management ​

3. Causal Discovery from Operational Data ​

3.1 Why Causal, Not Just Correlational ​

3.2 The Fleet Operations Causal DAG ​

3.3 NOTEARS for Structure Learning ​

3.4 PC Algorithm for Constraint-Based Discovery ​

3.5 Granger Causality for Temporal Relationships ​

3.6 Combining Discovery Methods ​

4. Shapley-Value Attribution ​

4.1 The Attribution Question ​

4.2 SHAP for Anomaly Model Attribution ​

4.3 Hierarchical Shapley Decomposition ​

4.4 Computational Cost and Approximations ​

4.5 Shapley for Cross-Airport Comparison ​

5. Bayesian Diagnosis Trees ​

Automated Root-Cause Attribution for Fleet-Level Operational Anomalies

Causal Discovery, Shapley Attribution, Bayesian Diagnosis, and Streaming Analytics for Autonomous GSE Fleets

Table of Contents

1. The Attribution Problem

1.1 Why Fleet-Level Root-Cause Analysis is Hard

1.2 Root Cause Categories

1.3 Why Manual Triage Does Not Scale Past 20 Vehicles

1.4 The Multi-Cause Problem

1.5 Relationship to Existing Systems

2. Anomaly Detection at Fleet Scale

2.1 Fleet KPI Hierarchy

2.2 Statistical Process Control Methods

2.3 Multivariate Anomaly Detection

2.4 Hierarchical Anomaly Detection

2.5 Alert Fatigue Management

3. Causal Discovery from Operational Data

3.1 Why Causal, Not Just Correlational

3.2 The Fleet Operations Causal DAG

3.3 NOTEARS for Structure Learning

3.4 PC Algorithm for Constraint-Based Discovery

3.5 Granger Causality for Temporal Relationships

3.6 Combining Discovery Methods

4. Shapley-Value Attribution

4.1 The Attribution Question

4.2 SHAP for Anomaly Model Attribution

4.3 Hierarchical Shapley Decomposition

4.4 Computational Cost and Approximations

4.5 Shapley for Cross-Airport Comparison

5. Bayesian Diagnosis Trees