Automated Root-Cause Attribution for Fleet-Level Operational Anomalies
Causal Discovery, Shapley Attribution, Bayesian Diagnosis, and Streaming Analytics for Autonomous GSE Fleets
Last updated: 2026-04-11
Key Takeaway: When a 50-vehicle autonomous GSE fleet shows a 3x intervention rate spike at Airport B on Tuesday, the root cause could be any of 8+ categories: sensor degradation, stale maps, OTA model regression, weather shift, new aircraft type, construction, operator behavior, or software bug. Manual triage (reviewing rosbags, cross-referencing logs, interviewing operators) takes 4-48 hours per incident and does not scale past 20 vehicles or 2 airports. Automated root-cause attribution combines hierarchical anomaly detection (CUSUM/EWMA on fleet KPIs), causal discovery (NOTEARS on operational telemetry), Shapley-value decomposition (which airport/vehicle/component drives the anomaly), and Bayesian diagnosis trees (sequential evidence gathering to pinpoint causes). The system reduces Mean Time To Resolution (MTTR) from 8-48 hours to 15-90 minutes, cuts false-positive alerts by 60-80% through hierarchical filtering, and provides the causal evidence trail that EU PLD 2024/2853 and ISO 3691-4 increasingly require. Implementation cost: $55-95K over 16-22 weeks, with ROI at 15-25 vehicles as avoided downtime and faster resolution exceed system cost.
Table of Contents
- The Attribution Problem
- Anomaly Detection at Fleet Scale
- Causal Discovery from Operational Data
- Shapley-Value Attribution
- Bayesian Diagnosis Trees
- OTA Regression Attribution
- Map Staleness Attribution
- Environmental Attribution
- Implementation Architecture
- Cost-Benefit and Metrics
- Key Takeaways
- References
1. The Attribution Problem
1.1 Why Fleet-Level Root-Cause Analysis is Hard
A single vehicle anomaly has a tractable diagnosis path: review the rosbag, check sensor health, inspect the scene. Fleet-level anomalies are fundamentally different because causes propagate, interact, and hide across the combinatorial space of vehicles x airports x subsystems x time.
The dimensionality of fleet diagnosis:
Fleet anomaly search space:
Vehicles: 50-100 vehicles
Airports: 3-10 airports
Subsystems: 6 per vehicle (perception, localization, planning, control, sensors, compute)
Components: ~25 per vehicle (4-8 LiDAR, cameras, radar, thermal, GTSAM, Frenet, CAN, etc.)
Software: OTA version, map version, parameter set, calibration
Environment: Weather, time-of-day, traffic, construction, aircraft mix
Human: Operator behavior, shift patterns, training recency
Total search space: ~10^6 - 10^8 unique state combinations per day
Manual triage throughput: 1-3 root causes / analyst / day
Anomaly generation rate (50-vehicle fleet): 5-20 anomalies / day
Result: Manual triage creates a permanent and growing backlog1.2 Root Cause Categories
Every fleet anomaly traces to one or more of eight root cause categories. These are not mutually exclusive -- a model regression exposed only during fog at a newly resurfaced stand involves three simultaneous causes.
| # | Root Cause Category | Example Signal | Typical Onset | Manual Triage Time |
|---|---|---|---|---|
| 1 | Sensor degradation | Point cloud density drop, intensity shift | Gradual (hours-days) | 1-4 hours |
| 2 | Map staleness | Localization error increase, path deviation | Step change (construction) or gradual (drift) | 2-8 hours |
| 3 | OTA model regression | Per-class mAP drop, new false-positive pattern | Step change (deployment time) | 4-24 hours |
| 4 | Weather/environment | Correlated cross-vehicle performance drop | Gradual-to-sudden | 30 min - 2 hours |
| 5 | New aircraft type | Detection failures near specific stands | Step change (airline schedule change) | 2-8 hours |
| 6 | Construction/layout change | Localization divergence in specific zone | Step change (construction start) | 4-16 hours |
| 7 | Operator behavior change | Intervention pattern shift without vehicle anomaly | Gradual (new operators) or step (policy change) | 4-48 hours |
| 8 | Software bug | Deterministic failure on specific input pattern | Step change (code deployment) | 2-24 hours |
1.3 Why Manual Triage Does Not Scale Past 20 Vehicles
Manual root-cause analysis follows a predictable workflow that breaks at fleet scale:
Manual triage workflow (per anomaly):
1. Receive alert (e.g., "Vehicle 23 intervention rate 4x baseline") 5 min
2. Query telemetry database for vehicle timeline 10-30 min
3. Download and review rosbag from event 20-60 min
4. Cross-reference with weather, OTA, map version 10-30 min
5. Compare with other vehicles at same airport 15-45 min
6. Compare with same vehicle at different times 15-45 min
7. Check sensor health logs 10-20 min
8. Formulate hypothesis and test 30-120 min
9. Write root-cause report 20-60 min
Total: 2.5 - 7 hours per anomaly (median ~4 hours)
With 1 analyst: 2 anomalies/day capacity
Fleet of 20 vehicles: ~5-10 anomalies/day (borderline manageable)
Fleet of 50 vehicles: ~15-30 anomalies/day (3-6x analyst capacity)
Fleet of 100 vehicles across 5 airports: ~40-80 anomalies/day (impossible)The problem compounds because unresolved anomalies cascade. A sensor degradation left undiagnosed for 48 hours because the analyst is triaging OTA regressions leads to a safety event. The analyst then spends 2 days on the safety investigation, during which 10 more anomalies pile up.
Industry experience confirms this scaling wall:
| Company | Fleet Size at Manual Triage Breakdown | Solution |
|---|---|---|
| Waymo | ~100 vehicles (2019-2020) | Built centralized anomaly attribution system |
| Cruise | ~200 vehicles (2022) | Automated "trip diagnostics" pipeline |
| Nuro | ~50 vehicles (2021) | Hierarchical alert triage with auto-classification |
| Tesla | ~50,000 FSD vehicles (2023) | Fully automated fleet learning + regression system |
| Motional | ~150 vehicles (2023) | ML-based root cause classifier |
reference airside AV stack will hit this wall at 15-25 vehicles across 2+ airports. Building attribution infrastructure before that point is significantly cheaper than retrofitting under operational pressure.
1.4 The Multi-Cause Problem
In practice, 30-40% of fleet anomalies have multiple contributing causes. Single-cause diagnosis leads to partial fixes that recur.
Example: The Tuesday morning problem at Airport B
Observed: 3x intervention rate, Tuesday 06:00-10:00, Airport B, 12 of 15 vehicles
Investigation reveals three co-occurring causes:
1. OTA v3.2.1 deployed Monday night (regression on personnel detection in fog)
2. Seasonal fog onset (October) degrades LiDAR range 20-30% at dawn
3. New Turkish Airlines A350 began operations Tuesday (wider wingspan,
different reflectivity profile than trained A320/B737)
Fixing only the OTA regression reduces interventions 40%, not 100%.
Fixing only the fog handling reduces interventions 20%.
Addressing the new aircraft type reduces interventions 15%.
All three together resolve 90%+ of the anomaly.
Interactions:
- OTA regression only manifests in fog (personnel detection robust in clear weather)
- Fog was tolerable before OTA regression (previous model handled reduced range)
- New aircraft type causes confusion only at specific stands (B5-B8)
The causes are not additive -- they interact multiplicatively.This motivates the causal approach: correlation-based methods identify associations ("interventions correlate with fog") but cannot distinguish direct causes from confounders or identify interaction effects.
1.5 Relationship to Existing Systems
This document covers fleet-level anomaly attribution -- the "why" layer that sits above existing monitoring and below incident management. It consumes signals from, but does not duplicate, several existing systems:
| Existing System | Doc Reference | What It Provides to Attribution |
|---|---|---|
| Sensor health monitoring | sensor-degradation-health-monitoring.md | Per-sensor health scores, degradation type classification |
| Runtime verification | runtime-verification-monitoring.md | STL violation events, OOD detection scores, safety envelope status |
| Predictive maintenance | ../fleet-management/fleet-predictive-maintenance.md | Component RUL predictions, failure probability estimates |
| Data flywheel | data-flywheel-airside.md | Trigger events, retraining status, model performance metrics |
| Cloud backend | cloud-backend-infrastructure.md | Telemetry pipeline, data lake, Airflow orchestration |
| Causal reasoning | causal-reasoning-counterfactual.md | SCM methodology for planning decisions (this doc extends to fleet operations) |
2. Anomaly Detection at Fleet Scale
2.1 Fleet KPI Hierarchy
Anomaly detection requires a well-defined metric hierarchy. Each level aggregates from the one below, enabling drill-down from "fleet is unhealthy" to "LiDAR #3 on Vehicle #23 at Airport B has a contaminated lens."
FLEET KPI HIERARCHY
Level 0: Fleet
Metrics: fleet_intervention_rate, fleet_availability, fleet_mission_success_rate
Granularity: 1-hour rolling windows
Normal: intervention_rate < 0.5/100km, availability > 95%, success > 98%
Level 1: Airport
Metrics: airport_{X}_intervention_rate, airport_{X}_availability
Granularity: 1-hour rolling windows
Normal: within 2 sigma of airport historical baseline
Level 2: Vehicle
Metrics: vehicle_{ID}_intervention_rate, mission_completion, avg_speed
Granularity: per-mission and 4-hour rolling
Normal: within 3 sigma of vehicle historical baseline
Level 3: Subsystem
Metrics: perception_mAP, localization_cov_trace, planning_jerk,
control_tracking_error, sensor_health_score, compute_utilization
Granularity: per-mission, 30-min rolling
Normal: perception_mAP > 0.65, loc_cov_trace < 0.1, tracking_error < 0.15m
Level 4: Component
Metrics: lidar_{N}_point_count, lidar_{N}_max_range, gtsam_iteration_count,
centerpoint_latency, frenet_candidate_count, can_bus_error_rate
Granularity: 1-second (streaming), aggregated to 1-min
Normal: component-specific thresholds from commissioning baseline2.2 Statistical Process Control Methods
Statistical Process Control (SPC) methods detect when a metric shifts from its baseline distribution. Two methods are complementary for fleet operations:
CUSUM (Cumulative Sum Control Chart)
CUSUM is optimal for detecting small persistent shifts in the mean -- exactly the signature of gradual sensor degradation or slow model drift.
class CUSUMDetector:
"""
Tabular CUSUM for detecting mean shifts in fleet KPIs.
Parameters:
k: allowance (slack) -- typically 0.5 * shift_to_detect (in sigma units)
h: decision threshold -- controls false-positive rate
h=4 gives ARL_0 ~168 (false alarm every 168 samples on average)
h=5 gives ARL_0 ~465
For fleet metrics sampled hourly:
h=4: false alarm every ~7 days (acceptable for Tier 1 alerts)
h=5: false alarm every ~19 days (appropriate for Tier 0 escalations)
"""
def __init__(self, k=0.5, h=4.0, baseline_mean=None, baseline_std=None):
self.k = k
self.h = h
self.mu_0 = baseline_mean # estimated from 30+ day commissioning period
self.sigma = baseline_std
self.S_pos = 0.0 # upper CUSUM statistic
self.S_neg = 0.0 # lower CUSUM statistic
self.run_length = 0
def update(self, x):
"""Process new observation. Returns (alarm, direction, run_length)."""
z = (x - self.mu_0) / self.sigma # standardize
self.S_pos = max(0, self.S_pos + z - self.k)
self.S_neg = max(0, self.S_neg - z - self.k)
self.run_length += 1
if self.S_pos > self.h:
self.S_pos = 0.0
return True, 'increase', self.run_length
elif self.S_neg > self.h:
self.S_neg = 0.0
return True, 'decrease', self.run_length
return False, None, self.run_length
def reset_baseline(self, new_mean, new_std):
"""Update baseline after confirmed operational change (e.g., new OTA)."""
self.mu_0 = new_mean
self.sigma = new_std
self.S_pos = 0.0
self.S_neg = 0.0
self.run_length = 0EWMA (Exponentially Weighted Moving Average)
EWMA smooths out high-frequency noise while preserving trend information. Better than CUSUM for metrics with high variance (e.g., per-mission intervention counts that are Poisson-distributed).
class EWMADetector:
"""
EWMA control chart for noisy fleet metrics.
lambda_ (smoothing): 0.05-0.25 typical
0.05: sensitive to small shifts, slower to detect large shifts
0.25: detects large shifts quickly, less sensitive to small shifts
For intervention rates (high variance): lambda_=0.10
For sensor health scores (low variance): lambda_=0.20
"""
def __init__(self, lambda_=0.10, L=3.0, baseline_mean=None, baseline_std=None):
self.lambda_ = lambda_
self.L = L # control limit multiplier (3.0 = ~0.27% false alarm)
self.mu_0 = baseline_mean
self.sigma = baseline_std
self.z = baseline_mean # EWMA statistic
self.n = 0
def update(self, x):
"""Process new observation."""
self.n += 1
self.z = self.lambda_ * x + (1 - self.lambda_) * self.z
# Time-varying control limits (accounts for EWMA startup)
sigma_z = self.sigma * math.sqrt(
(self.lambda_ / (2 - self.lambda_)) *
(1 - (1 - self.lambda_) ** (2 * self.n))
)
ucl = self.mu_0 + self.L * sigma_z
lcl = self.mu_0 - self.L * sigma_z
if self.z > ucl:
return True, 'increase', self.z - self.mu_0
elif self.z < lcl:
return True, 'decrease', self.z - self.mu_0
return False, None, self.z - self.mu_0Which method for which KPI:
| KPI | Recommended Method | Rationale |
|---|---|---|
| Intervention rate (fleet/airport) | EWMA (lambda=0.10) | High variance (Poisson), need smoothing |
| Perception mAP (per-vehicle) | CUSUM (k=0.5, h=4) | Low variance, detect small persistent drops |
| Localization covariance trace | CUSUM (k=0.3, h=5) | Safety-critical, want low false-alarm rate |
| Sensor health score | EWMA (lambda=0.20) | Moderate variance, want trend sensitivity |
| Mission success rate | EWMA (lambda=0.15) | Binomial variance, need smoothing |
| Compute utilization | CUSUM (k=1.0, h=4) | Detect step changes from code changes |
| Planning jerk/tracking error | EWMA (lambda=0.10) | High variance from scenario diversity |
2.3 Multivariate Anomaly Detection
Univariate monitors catch single-metric deviations. But fleet anomalies often manifest as coordinated shifts across multiple metrics that individually remain within bounds.
Example: Perception mAP drops 5% (within normal variance) AND localization covariance increases 15% (within normal variance) AND planning jerk increases 10% (within normal variance). Each alone is not alarming. Together they indicate a systematic problem.
Isolation Forest for Fleet Telemetry
Isolation Forest (Liu et al., 2008) is effective for high-dimensional fleet telemetry because it handles mixed feature types, requires no distributional assumptions, and scales linearly.
from sklearn.ensemble import IsolationForest
import numpy as np
class FleetAnomalyDetector:
"""
Multivariate anomaly detection on fleet telemetry snapshots.
Each observation is a fixed-length feature vector representing
one vehicle's state over a time window (e.g., 1 hour).
"""
# Feature vector per vehicle per hour
FEATURES = [
'intervention_count', # count in window
'perception_mAP_mean', # mean mAP over window
'perception_mAP_std', # mAP stability
'localization_cov_trace_mean', # mean position uncertainty
'localization_cov_trace_max', # worst-case uncertainty
'planning_jerk_p95', # 95th percentile jerk
'control_tracking_error_mean', # mean lateral/longitudinal error
'sensor_health_min', # worst sensor health in window
'compute_gpu_util_mean', # GPU utilization
'compute_latency_p99', # 99th percentile end-to-end latency
'speed_mean', # average operating speed
'distance_km', # distance driven in window
'ota_version_hash', # categorical -> ordinal encoding
'map_age_hours', # hours since last map update
'weather_visibility_m', # METAR visibility
'weather_precip_intensity', # precipitation rate
'time_of_day_sin', # cyclic encoding of hour
'time_of_day_cos',
'airport_id', # categorical -> one-hot
]
def __init__(self, contamination=0.05, n_estimators=200):
self.model = IsolationForest(
contamination=contamination, # expect ~5% anomalous observations
n_estimators=n_estimators,
max_samples='auto',
random_state=42,
n_jobs=-1,
)
self.scaler = None # fit during training
def fit(self, historical_data: np.ndarray):
"""Fit on 30+ days of normal fleet operation data."""
self.scaler = RobustScaler().fit(historical_data)
scaled = self.scaler.transform(historical_data)
self.model.fit(scaled)
def score(self, current_snapshot: np.ndarray):
"""
Score current fleet state.
Returns anomaly_scores (lower = more anomalous, <0 = anomaly).
"""
scaled = self.scaler.transform(current_snapshot)
return self.model.decision_function(scaled)Autoencoder for Temporal Patterns
When the anomaly is not a point-in-time shift but a temporal pattern (e.g., oscillating localization error that correlates with aircraft departure schedule), a sequence autoencoder captures what Isolation Forest misses.
import torch
import torch.nn as nn
class FleetTemporalAutoencoder(nn.Module):
"""
LSTM autoencoder for detecting anomalous temporal patterns
in per-vehicle telemetry sequences.
Input: (batch, seq_len=24, n_features=19) -- 24 hours x 19 features
Output: reconstruction of input
Anomaly score: reconstruction error (MSE per feature, weighted)
"""
def __init__(self, n_features=19, hidden_dim=64, latent_dim=16):
super().__init__()
# Encoder
self.encoder_lstm = nn.LSTM(
input_size=n_features,
hidden_size=hidden_dim,
num_layers=2,
batch_first=True,
dropout=0.1,
)
self.encoder_fc = nn.Linear(hidden_dim, latent_dim)
# Decoder
self.decoder_fc = nn.Linear(latent_dim, hidden_dim)
self.decoder_lstm = nn.LSTM(
input_size=hidden_dim,
hidden_size=hidden_dim,
num_layers=2,
batch_first=True,
dropout=0.1,
)
self.output_fc = nn.Linear(hidden_dim, n_features)
def encode(self, x):
_, (h_n, _) = self.encoder_lstm(x)
return self.encoder_fc(h_n[-1])
def decode(self, z, seq_len):
h = self.decoder_fc(z).unsqueeze(1).repeat(1, seq_len, 1)
out, _ = self.decoder_lstm(h)
return self.output_fc(out)
def forward(self, x):
z = self.encode(x)
return self.decode(z, x.size(1))
def anomaly_score(self, x, feature_weights=None):
"""
Per-feature weighted reconstruction error.
feature_weights: higher weight for safety-critical features
e.g., perception_mAP: 2.0, localization_cov: 2.0,
intervention_count: 3.0, sensor_health: 1.5
"""
x_hat = self.forward(x)
mse_per_feature = ((x - x_hat) ** 2).mean(dim=1) # (batch, n_features)
if feature_weights is not None:
mse_per_feature = mse_per_feature * feature_weights
return mse_per_feature.sum(dim=-1) # scalar anomaly score per sample2.4 Hierarchical Anomaly Detection
The key to avoiding alert fatigue is hierarchical detection that only escalates when lower levels cannot self-resolve.
HIERARCHICAL ANOMALY DETECTION FLOW
┌────────────────────────────────────────────────┐
│ LEVEL 0: Fleet-wide │
│ EWMA on fleet_intervention_rate │
│ Isolation Forest on fleet telemetry snapshot │
│ Check: Is the fleet as a whole anomalous? │
└───────────────┬────────────────────────────────┘
│ ALARM
▼
┌────────────────────────────────────────────────┐
│ LEVEL 1: Airport decomposition │
│ Per-airport CUSUM on intervention_rate │
│ Question: Which airport(s) drive the anomaly? │
│ Output: Airport B is anomalous, A and C normal │
└───────────────┬────────────────────────────────┘
│ AIRPORT B FLAGGED
▼
┌────────────────────────────────────────────────┐
│ LEVEL 2: Vehicle decomposition │
│ Per-vehicle CUSUM within Airport B │
│ Question: All vehicles or subset? │
│ Output: 12/15 vehicles affected │
│ Note: 12/15 -> likely environmental, not vehicle│
│ 3/15 -> likely vehicle-specific │
└───────────────┬────────────────────────────────┘
│ 12/15 VEHICLES -> ENVIRONMENTAL
▼
┌────────────────────────────────────────────────┐
│ LEVEL 3: Subsystem decomposition │
│ Per-subsystem metrics for affected vehicles │
│ Question: Perception, localization, or planning?│
│ Output: Perception mAP degraded on 10/12, │
│ Localization normal on all 12 │
└───────────────┬────────────────────────────────┘
│ PERCEPTION SUBSYSTEM
▼
┌────────────────────────────────────────────────┐
│ LEVEL 4: Component decomposition │
│ Per-sensor health, per-class mAP, per-range mAP│
│ Question: Which sensor or class or range band? │
│ Output: Personnel class mAP dropped 25%, │
│ other classes stable │
└───────────────┬────────────────────────────────┘
│ PERSONNEL DETECTION REGRESSION
▼
┌────────────────────────────────────────────────┐
│ ATTRIBUTION: Cross-reference with context │
│ OTA v3.2.1 deployed 2 days ago: YES │
│ Weather change: fog onset this week │
│ Conclusion: OTA regression on personnel class │
│ exacerbated by seasonal fog │
└────────────────────────────────────────────────┘Fraction-of-vehicles heuristic for rapid triage:
| Fraction Affected | Likely Root Cause Category | Rationale |
|---|---|---|
| > 80% at one airport | Environment, map, airport-wide change | External factor affecting all vehicles equally |
| 30-80% at one airport | OTA regression, partial environment | Subset of scenarios or routes affected |
| < 30% at one airport | Vehicle-specific (sensor, calibration) | Individual vehicle hardware or config |
| > 50% fleet-wide | OTA regression, fleet-wide software bug | Only software changes propagate globally |
| Correlated with time | Weather, time-of-day, shift change | Temporal pattern indicates external driver |
| Correlated with zone | Map, construction, aircraft type at stands | Spatial pattern indicates localized cause |
2.5 Alert Fatigue Management
Alert fatigue is the primary failure mode of monitoring systems. The solution is aggressive filtering:
Alert tiering:
| Tier | Criteria | Notification | Response Time |
|---|---|---|---|
| P0: Safety | STL violation, collision/near-miss, geofence breach | Pager + SMS + dashboard | < 5 min |
| P1: Operational | Fleet-level KPI anomaly (CUSUM alarm on L0) | Slack + dashboard | < 1 hour |
| P2: Degradation | Airport-level or multi-vehicle subsystem anomaly | Dashboard + daily digest | < 4 hours |
| P3: Informational | Single-vehicle anomaly, single-sensor degradation | Daily digest | < 24 hours |
| P4: Analytics | Trend detected, no immediate impact | Weekly report | Next planning cycle |
Suppression rules:
- Deduplication: Same root cause on same vehicle within 4 hours -> suppress subsequent alerts
- Known-issue suppression: Acknowledged anomalies with open tickets -> suppress until resolution
- Maintenance window: Alerts during scheduled maintenance -> suppress automatically
- Correlated suppression: If airport-level alert fires, suppress all vehicle-level alerts at that airport for 30 min (give the attribution system time to diagnose)
- Self-resolving: If CUSUM resets within 2 hours (anomaly self-resolves), downgrade to P4
Target alert volume:
| Fleet Size | Raw Anomalies/Day | After Hierarchical Filtering | After Suppression | Analyst Load |
|---|---|---|---|---|
| 20 vehicles | 5-10 | 2-5 | 1-3 | Manageable (1 analyst) |
| 50 vehicles | 15-30 | 5-10 | 2-5 | Manageable (1-2 analysts) |
| 100 vehicles | 40-80 | 10-20 | 4-8 | Requires 2-3 analysts |
| 200 vehicles | 80-160 | 15-30 | 5-12 | Requires 3-4 analysts |
Without hierarchical filtering and suppression, a 200-vehicle fleet would require 40-80 analysts to manually triage every anomaly. With the system, 3-4 analysts handle the residual that automated attribution cannot resolve.
3. Causal Discovery from Operational Data
3.1 Why Causal, Not Just Correlational
Correlation identifies that fog and high intervention rates co-occur. Causation identifies that fog causes LiDAR range reduction, which causes perception mAP drop, which causes missed personnel detections, which causes operator interventions. The difference matters because:
- Actionable remediation: Correlation says "fog = bad." Causation says "deploy radar-primary mode during fog" or "retrain personnel detector with fog augmentation."
- Multi-cause disentanglement: When fog AND OTA change co-occur, correlation cannot separate their contributions. Causal analysis can.
- Regulatory evidence: EU PLD 2024/2853 requires demonstrating causal contribution to incidents, not mere correlation.
3.2 The Fleet Operations Causal DAG
The causal DAG for fleet operations is distinct from the planning-level DAG in causal-reasoning-counterfactual.md. It operates at a higher abstraction level with different timescales.
FLEET OPERATIONS CAUSAL DAG
Exogenous (uncontrollable):
WEATHER ──────────────────────┐
TIME_OF_DAY ──────────────────┤
AIRCRAFT_MIX ─────────────────┤
CONSTRUCTION ─────────────────┤
OPERATOR_EXPERIENCE ──────────┤
│
Controllable decisions: │
OTA_VERSION ──────────────┐ │
MAP_VERSION ──────────────┤ │
PARAMETER_SET ────────────┤ │
CALIBRATION ──────────────┤ │
│ │
▼ ▼
┌───────────────────┐
WEATHER ─────────→│ SENSOR_HEALTH │
TIME_OF_DAY ─────→│ (degradation rate │
CONSTRUCTION ────→│ per vehicle) │
└────────┬──────────┘
│
▼
OTA_VERSION ──────→ PERCEPTION_QUALITY ←── AIRCRAFT_MIX
SENSOR_HEALTH ───→ │ (new types = OOD)
│
▼
MAP_VERSION ──────→ LOCALIZATION_QUALITY
CONSTRUCTION ────→ │
WEATHER ─────────→ │ (wet surface = GPS multipath)
│
▼
PERCEPTION ──────→ PLANNING_QUALITY ←── PARAMETER_SET
LOCALIZATION ───→ │
│
▼
PLANNING ────────→ CONTROL_QUALITY ←── CALIBRATION
VEHICLE_HEALTH ─→ │
│
▼
CONTROL ─────────→ INTERVENTION_RATE ←── OPERATOR_EXPERIENCE
PERCEPTION ─────→ │ (threshold varies)
ODD_COMPLIANCE ─→ │
│
▼
MISSION_SUCCESSVariable definitions for the DAG:
| Variable | Type | Source | Granularity |
|---|---|---|---|
WEATHER | Continuous (visibility, precip, wind, temp) | METAR feed, on-vehicle sensors | 30 min (METAR), 1 Hz (vehicle) |
TIME_OF_DAY | Cyclic (sin/cos encoding of hour) | System clock | 1 hour |
AIRCRAFT_MIX | Categorical (type distribution at stands) | Airport FIDS/A-CDM | Per-flight |
CONSTRUCTION | Binary/categorical per zone | Airport NOTAMs, fleet detection | Daily |
OPERATOR_EXPERIENCE | Ordinal (hours on system) | HR/training database | Per-shift |
OTA_VERSION | Categorical (model + code version hash) | Fleet management system | Per-deployment |
MAP_VERSION | Ordinal (age in hours + version ID) | Map management system | Per-update |
SENSOR_HEALTH | Continuous [0, 1] per sensor | Health monitor (1 Hz) | Aggregated hourly |
PERCEPTION_QUALITY | Continuous (mAP, per-class, per-range) | Shadow evaluation or production metrics | Per-mission |
LOCALIZATION_QUALITY | Continuous (cov trace, RMSE to ground truth) | GTSAM output | 10 Hz, aggregated per-mission |
PLANNING_QUALITY | Continuous (jerk, tracking error, clearance) | Planning node output | Per-mission |
INTERVENTION_RATE | Count per 100 km | Operator input log | Per-intervention |
MISSION_SUCCESS | Binary | Fleet management system | Per-mission |
3.3 NOTEARS for Structure Learning
NOTEARS (Zheng et al., 2018) formulates causal structure learning as a continuous optimization problem, converting the NP-hard combinatorial DAG search into a smooth optimization with an acyclicity constraint.
import numpy as np
from scipy.optimize import minimize
def notears_linear(X, lambda1=0.1, max_iter=100, h_tol=1e-8):
"""
NOTEARS: continuous optimization for DAG structure learning.
X: (n_samples, d_variables) -- fleet telemetry data
lambda1: L1 penalty (sparsity)
Returns: W (d x d) weighted adjacency matrix
W[i,j] != 0 means variable i directly causes variable j
For fleet operations with ~15-20 variables:
d=20 -> 400 parameters -> tractable (seconds on CPU)
Needs n >> d samples: 30 days x 24 hours x 50 vehicles = 36,000 samples
More than sufficient for d=20
"""
n, d = X.shape
def _loss(W):
"""Least-squares loss for linear SEM: X = X @ W + noise."""
M = X @ W
R = X - M
return 0.5 / n * (R ** 2).sum()
def _h(W):
"""Acyclicity constraint: tr(e^{W * W}) - d = 0 iff W is DAG."""
E = np.linalg.matrix_power(np.eye(d) + W * W / d, d)
return np.trace(E) - d
def _objective(w_flat, rho, alpha):
W = w_flat.reshape(d, d)
loss = _loss(W)
h = _h(W)
penalty = lambda1 * np.abs(W).sum()
return loss + 0.5 * rho * h * h + alpha * h + penalty
# Augmented Lagrangian optimization
W = np.zeros((d, d))
rho, alpha = 1.0, 0.0
for i in range(max_iter):
result = minimize(
_objective, W.flatten(), args=(rho, alpha),
method='L-BFGS-B',
jac=True # in practice, use autograd for gradient
)
W = result.x.reshape(d, d)
h = _h(W)
if abs(h) < h_tol:
break
alpha += rho * h
rho *= 10
# Threshold small edges
W[np.abs(W) < 0.1] = 0.0
return WPractical considerations for fleet data:
- Non-stationarity: Fleet data is non-stationary (seasonal, OTA changes, fleet growth). Fit NOTEARS on rolling 90-day windows, compare learned DAGs across windows for structural stability.
- Mixed data types: Fleet variables mix continuous (mAP), ordinal (map age), and categorical (OTA version). Use NOTEARS-NONLINEAR (Zheng et al., 2020) with MLP basis functions, or discretize continuous variables.
- Latent confounders: Some causes are unobserved (e.g., runway resurfacing affects tire grip and sensor vibration simultaneously, but "runway condition" is not directly measured). FCI algorithm (Spirtes et al., 2000) handles latent confounders by identifying ancestral relationships rather than direct causes.
3.4 PC Algorithm for Constraint-Based Discovery
The PC algorithm (Spirtes et al., 2000) discovers causal structure via conditional independence tests. It complements NOTEARS by making weaker parametric assumptions.
from causallearn.search.ConstraintBased.PC import pc
from causallearn.utils.cit import fisherz
def discover_fleet_dag(telemetry_df, alpha=0.05):
"""
PC algorithm for fleet causal structure discovery.
Uses Fisher-z test for conditional independence (assumes Gaussian;
for non-Gaussian, use KCI or RCIT kernel-based tests).
alpha: significance level for conditional independence tests
0.01 -> conservative (fewer edges, risk missing real causes)
0.05 -> moderate (standard choice)
0.10 -> aggressive (more edges, risk spurious causes)
"""
data = telemetry_df[FLEET_VARIABLES].values
cg = pc(
data,
alpha=alpha,
indep_test=fisherz,
stable=True, # order-independent (recommended for reproducibility)
uc_rule=0, # conservative orientation
uc_priority=-1, # background knowledge can override
)
# Apply background knowledge (known causal directions)
# Weather cannot be caused by vehicle state
# OTA version cannot be caused by perception quality
# These constraints reduce false discoveries
background_knowledge = {
('WEATHER', '*'): 'forbidden', # nothing causes weather
('TIME_OF_DAY', '*'): 'forbidden', # nothing causes time
('*', 'OTA_VERSION'): 'forbidden', # OTA is a decision, not caused by operations
}
return cg.G, cg.sepset # graph + separation sets for causal reasoning3.5 Granger Causality for Temporal Relationships
Some causal relationships are purely temporal: an OTA deployment today causes intervention rate increase tomorrow. Granger causality captures these time-lagged relationships.
from statsmodels.tsa.stattools import grangercausalitytests
def fleet_granger_analysis(telemetry_df, max_lag=48, alpha=0.05):
"""
Pairwise Granger causality tests across fleet variables.
max_lag=48: test up to 48 hours of lag (covers OTA effects that
may take 1-2 days to manifest as interventions).
Returns: dictionary of significant Granger-causal relationships
with optimal lag and p-value.
"""
variables = [
'intervention_rate', 'perception_mAP', 'localization_cov',
'sensor_health_min', 'weather_visibility', 'ota_version_age',
'map_age_hours', 'compute_latency_p99',
]
results = {}
for cause_var in variables:
for effect_var in variables:
if cause_var == effect_var:
continue
data = telemetry_df[[effect_var, cause_var]].dropna()
try:
gc_result = grangercausalitytests(
data.values, maxlag=max_lag, verbose=False
)
# Find optimal lag (lowest p-value)
best_lag = min(
gc_result.keys(),
key=lambda lag: gc_result[lag][0]['ssr_ftest'][1]
)
p_value = gc_result[best_lag][0]['ssr_ftest'][1]
if p_value < alpha:
results[(cause_var, effect_var)] = {
'lag_hours': best_lag,
'p_value': p_value,
'f_stat': gc_result[best_lag][0]['ssr_ftest'][0],
}
except Exception:
continue # insufficient data or singular matrix
return resultsExpected Granger-causal relationships for airside fleet:
| Cause | Effect | Expected Lag | Mechanism |
|---|---|---|---|
ota_version_age | intervention_rate | 6-48 hours | OTA deployment -> operational exposure -> incidents |
weather_visibility | perception_mAP | 0-1 hours | Near-immediate (fog onset -> range reduction) |
weather_visibility | sensor_health_min | 2-24 hours | Delayed (moisture -> lens contamination) |
map_age_hours | localization_cov | 0-4 hours | Progressive (map drift -> localization uncertainty) |
perception_mAP | intervention_rate | 0-2 hours | Near-immediate (missed detection -> intervention) |
sensor_health_min | perception_mAP | 0-1 hours | Near-immediate (degraded sensor -> degraded perception) |
compute_latency_p99 | planning_quality | 0 hours | Immediate (latency -> stale plan) |
3.6 Combining Discovery Methods
No single method is sufficient. The practical approach combines all three:
CAUSAL DISCOVERY PIPELINE
1. Expert prior DAG
- Encode known physics (weather -> sensor degradation)
- Encode known architecture (perception -> planning -> control)
- Encode known impossibilities (vehicle state cannot cause weather)
Result: skeleton with ~60-70% of edges pre-specified
2. NOTEARS on 90-day fleet data
- Discover data-driven edges not in expert prior
- Validate expert edges (are they supported by data?)
- Weight edges by coefficient magnitude
Result: data-augmented DAG
3. PC algorithm as validation
- Independent method with different assumptions
- Edges present in BOTH NOTEARS and PC have high confidence
- Edges in only one method flagged for human review
Result: confidence-annotated DAG
4. Granger causality for temporal structure
- Add lag information to each edge
- Discover temporal-only relationships missed by instantaneous methods
Result: temporally-annotated causal DAG
5. Human expert review
- Review edges present in data but not in prior (potential discovery)
- Review edges in prior but not in data (potential data limitation)
- Final DAG approved for production use
Result: validated fleet operations causal DAG
Update cadence: Re-run quarterly or after major fleet changes (new airport,
new vehicle type, major OTA)4. Shapley-Value Attribution
4.1 The Attribution Question
Given that the fleet is anomalous, how much of the anomaly is attributable to each factor? This is precisely the Shapley value question from cooperative game theory.
The game: Players are the potential root causes (weather, OTA version, map age, sensor health, etc.). The "payoff" is the anomaly score. The Shapley value of each player is its fair marginal contribution to the anomaly, averaged over all possible coalitions.
Formal definition:
phi_i = (1/|N|!) * sum over all permutations pi of N:
[v(S_pi^i union {i}) - v(S_pi^i)]
where:
N = set of all potential causes
S_pi^i = set of causes that appear before i in permutation pi
v(S) = anomaly score when only causes in S are at anomalous levels
and all others are at baseline levels4.2 SHAP for Anomaly Model Attribution
SHAP (SHapley Additive exPlanations) computes Shapley values efficiently for ML model predictions. Applied to the Isolation Forest anomaly model, it answers "which features contribute most to this observation being anomalous?"
import shap
class AnomalyAttributor:
"""
SHAP-based attribution for fleet anomaly scores.
Given an anomalous fleet state, decomposes the anomaly score
into per-feature contributions using Shapley values.
"""
def __init__(self, anomaly_model, background_data):
"""
anomaly_model: trained IsolationForest or autoencoder
background_data: representative normal fleet data (100-1000 samples)
"""
self.model = anomaly_model
# TreeExplainer for IsolationForest (exact Shapley, fast)
# KernelSHAP for autoencoder (approximate, slower)
if hasattr(anomaly_model, 'estimators_'):
self.explainer = shap.TreeExplainer(
anomaly_model,
data=background_data,
feature_perturbation='interventional', # causal, not observational
)
else:
self.explainer = shap.KernelExplainer(
anomaly_model.predict,
background_data,
nsamples=500, # approximation budget
)
def attribute(self, anomalous_observation):
"""
Returns per-feature Shapley values for the anomaly.
Positive value: feature contributes to anomaly (anomalous direction)
Negative value: feature is normal (counteracts anomaly)
Sum of all values = anomaly_score - expected_anomaly_score
"""
shap_values = self.explainer.shap_values(anomalous_observation)
# Rank features by absolute contribution
feature_names = FleetAnomalyDetector.FEATURES
contributions = sorted(
zip(feature_names, shap_values[0]),
key=lambda x: abs(x[1]),
reverse=True,
)
return contributions
def attribute_fleet(self, fleet_snapshot):
"""
Attribute fleet-level anomaly to airports and vehicles.
Uses distributed Shapley: treats each vehicle as a "player"
and computes its marginal contribution to the fleet anomaly.
"""
n_vehicles = len(fleet_snapshot)
vehicle_shapley = np.zeros(n_vehicles)
# For tractable fleet sizes (< 200), use permutation sampling
n_permutations = min(1000, math.factorial(n_vehicles))
for _ in range(n_permutations):
perm = np.random.permutation(n_vehicles)
prev_score = self._fleet_anomaly_score(fleet_snapshot[:0]) # empty
for i, vehicle_idx in enumerate(perm):
coalition = fleet_snapshot[perm[:i+1]]
curr_score = self._fleet_anomaly_score(coalition)
vehicle_shapley[vehicle_idx] += (curr_score - prev_score)
prev_score = curr_score
vehicle_shapley /= n_permutations
return vehicle_shapley4.3 Hierarchical Shapley Decomposition
For fleet attribution, a flat Shapley over all features x vehicles is computationally intractable. Hierarchical decomposition mirrors the KPI hierarchy:
HIERARCHICAL SHAPLEY DECOMPOSITION
Step 1: Fleet -> Airport attribution
Players: {Airport_A, Airport_B, Airport_C}
Game: v(S) = fleet anomaly score with only airports in S contributing
Result: "Airport B contributes 72% of fleet anomaly"
Cost: 2^3 = 8 evaluations (exact Shapley feasible)
Step 2: Airport B -> Vehicle attribution
Players: {Vehicle_1, ..., Vehicle_15} at Airport B
Game: v(S) = Airport B anomaly score with only vehicles in S
Result: "Vehicles 3, 7, 11 contribute 60% of Airport B anomaly"
Cost: ~1000 permutation samples (approximate Shapley)
Step 3: Per-vehicle -> Feature attribution
Players: {perception_mAP, localization_cov, sensor_health, ...}
Game: v(S) = vehicle anomaly score with only features in S at anomalous level
Result: "perception_mAP accounts for 45% of Vehicle 7's anomaly"
Cost: TreeSHAP = exact in O(TLD) per observation (milliseconds)
Total: Fleet anomaly decomposed as:
"72% Airport B, of which 60% from 3 vehicles,
of which perception mAP drop is the primary driver,
specifically personnel class detection in fog conditions."4.4 Computational Cost and Approximations
| Method | Exact Cost | Approximation | Accuracy | Wall-Clock Time |
|---|---|---|---|---|
| TreeSHAP (per observation) | O(TLD) | Exact | 100% | < 10 ms |
| KernelSHAP (per observation) | O(2^M) | M=500 samples | ~95% | 1-5 seconds |
| Permutation SHAP (fleet) | O(N! * eval_cost) | 1000 permutations | ~90% | 5-30 seconds |
| Hierarchical SHAP (full) | O(levels * per_level) | Per above | ~85-90% | 10-60 seconds |
For real-time fleet dashboards, pre-compute SHAP values on the most recent anomalous snapshot and cache. Update every 15-60 minutes.
4.5 Shapley for Cross-Airport Comparison
A powerful application: when multiple airports operate, Shapley values answer "does Airport B's anomaly look like a known pattern?"
def compute_airport_shapley_profile(airport_data, reference_profiles):
"""
Compare current airport anomaly against known root-cause profiles.
reference_profiles: dictionary of known anomaly signatures
e.g., {
'ota_regression': [high_perception, normal_localization, normal_sensor, ...],
'fog_event': [moderate_perception, normal_localization, degraded_sensor, ...],
'map_stale': [normal_perception, high_localization, normal_sensor, ...],
}
Returns: similarity to each known profile (cosine similarity of SHAP vectors)
"""
current_shap = attributor.attribute(airport_data)
current_vec = np.array([v for _, v in current_shap])
similarities = {}
for cause_name, ref_vec in reference_profiles.items():
cos_sim = np.dot(current_vec, ref_vec) / (
np.linalg.norm(current_vec) * np.linalg.norm(ref_vec) + 1e-8
)
similarities[cause_name] = cos_sim
return similarities # highest similarity = most likely root cause category5. Bayesian Diagnosis Trees
5.1 Sequential Diagnosis as Bayesian Inference
When the anomaly detection and Shapley attribution narrow the problem to a small set of hypotheses, Bayesian diagnosis systematically eliminates alternatives by querying the cheapest diagnostic tests first.
The diagnosis as a decision tree:
Fleet anomaly detected
│
┌──────────┴──────────┐
│ Check: How many │
│ vehicles affected? │
└──────────┬──────────┘
┌─────────┴─────────┐
│ │
> 50% < 50%
(fleet-wide) (subset)
│ │
┌──────┴──────┐ ┌──────┴──────┐
│ Check: OTA │ │ Check: Same │
│ deployed in │ │ route/zone? │
│ last 72h? │ └──────┬──────┘
└──────┬──────┘ ┌─────┴─────┐
┌─────┴─────┐ │ │
YES NO YES NO
│ │ │ │
┌──────┴──┐ ┌────┴────┐ │ ┌──────┴──────┐
│Check: │ │Check: │ │ │Check: Same │
│Per-class│ │METAR │ │ │vehicle(s)? │
│mAP drop?│ │change? │ │ └──────┬──────┘
└────┬────┘ └────┬────┘ │ ┌─────┴─────┐
┌──┴──┐ ┌──┴──┐ │ YES NO
YES NO YES NO │ │ (random)
│ │ │ │ │ │
OTA Software Weather Map Sensor Investigate
regression bug change stale degradation further5.2 Prior Probabilities from Fleet History
Prior probabilities are calibrated from historical fleet data. These update as the fleet matures and accumulates incidents.
class BayesianDiagnosisEngine:
"""
Bayesian root-cause diagnosis with sequential evidence gathering.
Maintains posterior probabilities over 8 root cause hypotheses,
updates with evidence from diagnostic tests.
"""
# Prior probabilities (calibrated from industry data + fleet history)
# These should be updated quarterly from actual fleet incident data
ROOT_CAUSES = {
'sensor_degradation': 0.25, # most common in airside operations
'weather_environment': 0.20, # second most common
'ota_model_regression': 0.15, # happens with every OTA
'map_staleness': 0.12, # increases with construction frequency
'software_bug': 0.10, # decreases as codebase matures
'operator_behavior': 0.08, # new operators, policy changes
'new_aircraft_type': 0.05, # seasonal schedule changes
'construction_layout': 0.05, # airport development
}
# Likelihood functions: P(evidence | root_cause)
# Each test produces a binary or categorical result
# Likelihoods estimated from historical diagnosis data
LIKELIHOODS = {
'fraction_vehicles_affected': {
# P(>50% vehicles affected | cause)
'high': {
'sensor_degradation': 0.15, # usually individual vehicles
'weather_environment': 0.85, # affects all vehicles
'ota_model_regression': 0.75, # affects all vehicles with new OTA
'map_staleness': 0.60, # affects vehicles on stale-map routes
'software_bug': 0.70, # affects all vehicles with new code
'operator_behavior': 0.10, # individual operator variation
'new_aircraft_type': 0.30, # affects vehicles at specific stands
'construction_layout': 0.20, # affects vehicles in specific zones
},
},
'ota_deployed_recently': {
# P(OTA deployed in last 72h | cause)
'yes': {
'sensor_degradation': 0.10, # coincidental
'weather_environment': 0.10, # coincidental
'ota_model_regression': 0.95, # direct cause
'map_staleness': 0.10, # coincidental
'software_bug': 0.80, # code change introduces bug
'operator_behavior': 0.05, # coincidental
'new_aircraft_type': 0.05, # coincidental
'construction_layout': 0.05, # coincidental
},
},
'per_class_mAP_drop': {
# P(specific class mAP degraded | cause)
'yes': {
'sensor_degradation': 0.40, # affects all classes (range-dependent)
'weather_environment': 0.50, # affects range-dependent classes
'ota_model_regression': 0.85, # often class-specific regression
'map_staleness': 0.05, # map does not affect detection
'software_bug': 0.30, # can affect specific detection paths
'operator_behavior': 0.02, # does not affect perception
'new_aircraft_type': 0.70, # new type has different features
'construction_layout': 0.10, # may cause OOD detections
},
},
'metar_changed': {
# P(significant METAR change in last 24h | cause)
'yes': {
'sensor_degradation': 0.20, # weather accelerates degradation
'weather_environment': 0.90, # direct indicator
'ota_model_regression': 0.10, # coincidental
'map_staleness': 0.05, # coincidental
'software_bug': 0.05, # coincidental
'operator_behavior': 0.10, # weather may change behavior
'new_aircraft_type': 0.05, # coincidental
'construction_layout': 0.05, # coincidental
},
},
'localization_degraded': {
# P(localization quality degraded | cause)
'yes': {
'sensor_degradation': 0.40, # LiDAR degradation affects GTSAM
'weather_environment': 0.30, # GPS multipath in rain, fog
'ota_model_regression': 0.05, # OTA rarely changes localization
'map_staleness': 0.85, # primary indicator of map issues
'software_bug': 0.20, # can affect GTSAM pipeline
'operator_behavior': 0.02, # does not affect localization
'new_aircraft_type': 0.05, # does not affect localization
'construction_layout': 0.75, # layout change = map mismatch
},
},
'zone_correlated': {
# P(anomaly concentrated in specific zone | cause)
'yes': {
'sensor_degradation': 0.10, # not zone-dependent
'weather_environment': 0.15, # sometimes micro-climate zones
'ota_model_regression': 0.15, # usually global
'map_staleness': 0.70, # construction is localized
'software_bug': 0.10, # usually global
'operator_behavior': 0.20, # operators assigned to zones
'new_aircraft_type': 0.80, # new aircraft at specific stands
'construction_layout': 0.90, # construction is localized
},
},
}
def __init__(self):
self.posteriors = dict(self.ROOT_CAUSES) # start with priors
self.evidence_collected = []
def update(self, test_name, result):
"""
Bayesian update: P(cause | evidence) proportional to
P(evidence | cause) * P(cause)
"""
likelihoods = self.LIKELIHOODS[test_name][result]
# Bayes rule
new_posteriors = {}
for cause, prior in self.posteriors.items():
likelihood = likelihoods.get(cause, 0.5) # default uniform
new_posteriors[cause] = likelihood * prior
# Normalize
total = sum(new_posteriors.values())
for cause in new_posteriors:
new_posteriors[cause] /= total
self.posteriors = new_posteriors
self.evidence_collected.append((test_name, result))
return self.get_diagnosis()
def get_diagnosis(self):
"""Return ranked hypotheses with confidence."""
ranked = sorted(
self.posteriors.items(),
key=lambda x: x[1],
reverse=True
)
top_cause, top_prob = ranked[0]
second_cause, second_prob = ranked[1]
confidence = 'high' if top_prob > 0.6 else (
'medium' if top_prob > 0.35 else 'low'
)
return {
'top_hypothesis': top_cause,
'probability': top_prob,
'confidence': confidence,
'ratio_to_second': top_prob / max(second_prob, 1e-6),
'all_hypotheses': ranked,
'evidence_count': len(self.evidence_collected),
}
def next_best_test(self):
"""
Select the diagnostic test with highest expected information gain.
Uses entropy reduction: pick the test that minimizes expected
posterior entropy (equivalently, maximizes mutual information
between test result and root cause).
"""
remaining_tests = [
t for t in self.LIKELIHOODS.keys()
if t not in [e[0] for e in self.evidence_collected]
]
best_test = None
best_info_gain = -1
current_entropy = self._entropy(self.posteriors)
for test in remaining_tests:
expected_posterior_entropy = 0
for result in self.LIKELIHOODS[test].keys():
# P(result) = sum over causes of P(result|cause) * P(cause)
p_result = sum(
self.LIKELIHOODS[test][result].get(c, 0.5) * p
for c, p in self.posteriors.items()
)
# Posterior after observing this result
temp_posteriors = {}
for cause, prior in self.posteriors.items():
l = self.LIKELIHOODS[test][result].get(cause, 0.5)
temp_posteriors[cause] = l * prior
total = sum(temp_posteriors.values())
for c in temp_posteriors:
temp_posteriors[c] /= max(total, 1e-10)
posterior_entropy = self._entropy(temp_posteriors)
expected_posterior_entropy += p_result * posterior_entropy
info_gain = current_entropy - expected_posterior_entropy
if info_gain > best_info_gain:
best_info_gain = info_gain
best_test = test
return best_test, best_info_gain
@staticmethod
def _entropy(distribution):
return -sum(
p * math.log2(p + 1e-10) for p in distribution.values()
)5.3 Worked Example: The Tuesday Morning Problem
BAYESIAN DIAGNOSIS TRACE
Initial priors:
sensor_degradation: 0.25
weather_environment: 0.20
ota_model_regression: 0.15
map_staleness: 0.12
software_bug: 0.10
operator_behavior: 0.08
new_aircraft_type: 0.05
construction_layout: 0.05
Step 1: Check fraction of vehicles affected -> HIGH (12/15 = 80%)
Updated posteriors:
weather_environment: 0.37 (+0.17) <-- now leads
ota_model_regression: 0.24 (+0.09)
software_bug: 0.15 (+0.05)
map_staleness: 0.12 (stable)
sensor_degradation: 0.06 (-0.19) <-- eliminated (individual vehicle issue)
...
Confidence: low (0.37)
Next best test: ota_deployed_recently (info gain: 0.41 bits)
Step 2: Check OTA deployed recently -> YES (v3.2.1, 2 days ago)
Updated posteriors:
ota_model_regression: 0.51 (+0.27) <-- now leads strongly
software_bug: 0.27 (+0.12)
weather_environment: 0.08 (-0.29) <-- weather unlikely given OTA timing
...
Confidence: medium (0.51)
Next best test: per_class_mAP_drop (info gain: 0.35 bits)
Step 3: Check per-class mAP -> YES (personnel class down 25%)
Updated posteriors:
ota_model_regression: 0.68 (+0.17) <-- high confidence
software_bug: 0.13 (-0.14)
new_aircraft_type: 0.06 (stable)
...
Confidence: high (0.68)
Ratio to second: 5.2x
Step 4: Check METAR change -> YES (fog onset, visibility 800m)
Updated posteriors:
ota_model_regression: 0.59 (-0.09)
weather_environment: 0.22 (+0.14) <-- weather is contributing factor
...
Note: Both OTA and weather are contributing. Multi-cause scenario.
DIAGNOSIS: OTA model regression on personnel class (primary, P=0.59)
exacerbated by seasonal fog onset (secondary, P=0.22)
RECOMMENDED ACTIONS:
1. Immediate: Rollback OTA v3.2.1 to v3.2.0 on Airport B fleet
2. Short-term: Retrain personnel detection with fog augmentation
3. Medium-term: Add fog-specific validation gate to CI/CD pipeline
Time to diagnosis: 4 automated tests, < 2 minutes
Manual triage equivalent: 8-16 hours5.4 Multi-Hypothesis Tracking
When posterior remains split between two or more hypotheses after all automated tests, maintain parallel investigation tracks:
class MultiHypothesisTracker:
"""
When Bayesian diagnosis cannot reach high confidence,
maintain and investigate multiple hypotheses in parallel.
Trigger: top hypothesis probability < 0.50 after all automated tests
"""
def __init__(self, diagnosis_result, max_hypotheses=3):
self.hypotheses = [
(cause, prob)
for cause, prob in diagnosis_result['all_hypotheses'][:max_hypotheses]
]
self.investigations = {}
def generate_investigation_plan(self):
"""
For each hypothesis, generate the specific diagnostic steps
that would confirm or refute it.
"""
plans = {
'sensor_degradation': [
'Run per-sensor health diagnostic on all affected vehicles',
'Compare point cloud density across affected vs unaffected vehicles',
'Check cleaning/maintenance logs for affected vehicles',
],
'ota_model_regression': [
'Run shadow evaluation of v_old vs v_new on captured rosbags',
'Compare per-class mAP between canary and control vehicles',
'Check model training logs for data distribution changes',
],
'weather_environment': [
'Correlate intervention timestamps with METAR visibility',
'Check if interventions cluster in specific time-of-day',
'Compare same vehicles on clear-weather days (control)',
],
'map_staleness': [
'Compare current SLAM output to reference map for discrepancies',
'Check AMDB/NOTAM for recent changes at affected zones',
'Measure localization RMSE in affected vs unaffected zones',
],
'new_aircraft_type': [
'Cross-reference intervention locations with aircraft stand assignments',
'Check FIDS for new airline/aircraft type at affected stands',
'Review false-negative detections for unknown aircraft geometry',
],
'construction_layout': [
'Query NOTAM feed for active construction at airport',
'Compare fleet SLAM maps from this week vs last week',
'Check if interventions cluster in specific geographic zones',
],
'operator_behavior': [
'Compare intervention rates by operator ID',
'Check for new operators on shift during anomaly window',
'Review intervention justification notes for patterns',
],
'software_bug': [
'Check error logs for new exception types or increased error rates',
'Review code changes deployed in same OTA as model update',
'Run affected scenarios in SIL with verbose logging',
],
}
return {
cause: plans.get(cause, ['Manual investigation required'])
for cause, prob in self.hypotheses
}6. OTA Regression Attribution
6.1 The OTA Regression Problem
Every OTA model update risks regression on specific classes, scenarios, or conditions that were not adequately represented in the validation set. The CI/CD pipeline (see ../../40-runtime-systems/ml-deployment/av-cicd-devops-pipeline.md) catches many regressions pre-deployment, but production-time regressions occur because:
- Validation set coverage gap: Offline validation cannot cover all airport x weather x aircraft x time-of-day combinations
- Distribution shift: The training distribution may not match the current operational distribution (e.g., new seasonal fog pattern)
- Interaction effects: Model performs well on fog OR on new aircraft, but fails on fog AND new aircraft together
- Calibration shift: Model accuracy unchanged but confidence calibration drifts, causing different downstream decisions
6.2 A/B Testing Between OTA Versions
The gold standard for OTA regression detection is a controlled A/B test within the fleet.
OTA A/B TEST ARCHITECTURE
Airport B fleet (15 vehicles):
Control group (5 vehicles): Treatment group (10 vehicles):
OTA v3.2.0 (previous) OTA v3.2.1 (new)
Vehicle IDs: 1, 4, 7, 11, 14 Vehicle IDs: 2, 3, 5, 6, 8, 9, 10, 12, 13, 15
Assignment: stratified random
- Same route distribution between groups
- Same shift/operator distribution
- Same vehicle model mix
Duration: 7-14 days (minimum for statistical power)
Primary metric: intervention_rate (events per 100 km)
Secondary metrics: per_class_mAP, localization_cov, mission_success_rate
Analysis: two-sample t-test (parametric) or Mann-Whitney U (non-parametric)
Power: 80% to detect 50% intervention rate increase
Required sample size: ~200 missions per group (achievable in 3-5 days)Automated regression detection from A/B results:
from scipy import stats
class OTARegressionDetector:
"""
Automated detection of OTA model regression from fleet A/B test.
"""
def __init__(self, significance_level=0.05, effect_size_threshold=0.20):
self.alpha = significance_level
self.min_effect = effect_size_threshold # 20% relative change
def detect_regression(self, control_metrics, treatment_metrics):
"""
Compare control (old OTA) vs treatment (new OTA) metrics.
Returns: regression detected (bool), details (dict)
"""
results = {}
for metric_name in control_metrics.columns:
control = control_metrics[metric_name].dropna()
treatment = treatment_metrics[metric_name].dropna()
# Mann-Whitney U test (non-parametric, handles non-normal distributions)
stat, p_value = stats.mannwhitneyu(
control, treatment, alternative='two-sided'
)
# Effect size (rank-biserial correlation)
n1, n2 = len(control), len(treatment)
effect_size = 1 - (2 * stat) / (n1 * n2)
# Relative change
relative_change = (treatment.mean() - control.mean()) / (
control.mean() + 1e-10
)
is_regression = (
p_value < self.alpha and
abs(relative_change) > self.min_effect and
self._is_degradation(metric_name, relative_change)
)
results[metric_name] = {
'p_value': p_value,
'effect_size': effect_size,
'relative_change': relative_change,
'control_mean': control.mean(),
'treatment_mean': treatment.mean(),
'is_regression': is_regression,
}
any_regression = any(r['is_regression'] for r in results.values())
return any_regression, results
@staticmethod
def _is_degradation(metric_name, relative_change):
"""Determine if change direction is a regression."""
# Higher is worse for these metrics
worse_if_higher = [
'intervention_rate', 'localization_cov', 'planning_jerk',
'control_tracking_error', 'compute_latency',
]
# Higher is better for these metrics
worse_if_lower = [
'perception_mAP', 'mission_success_rate', 'sensor_health',
]
if metric_name in worse_if_higher:
return relative_change > 0
elif metric_name in worse_if_lower:
return relative_change < 0
return False6.3 Per-Class and Per-Scenario Regression Detection
Overall mAP may improve while a specific class regresses -- the "accuracy paradox" for safety-critical systems.
class PerClassRegressionDetector:
"""
Detect per-class regression that may be masked by overall improvement.
Example: OTA v3.2.1 improves aircraft detection +3% but regresses
personnel detection -8%. Overall mAP: +1%. But personnel regression
is safety-critical and must be caught.
"""
SAFETY_CRITICAL_CLASSES = {
'personnel': {'min_mAP': 0.60, 'max_regression': 0.03},
'aircraft': {'min_mAP': 0.70, 'max_regression': 0.05},
'fod': {'min_mAP': 0.50, 'max_regression': 0.02},
}
def check_per_class(self, old_model_eval, new_model_eval):
"""
Compare per-class metrics between old and new model.
old_model_eval / new_model_eval: dict[class_name -> mAP]
"""
regressions = []
for cls, thresholds in self.SAFETY_CRITICAL_CLASSES.items():
old_mAP = old_model_eval.get(cls, 0)
new_mAP = new_model_eval.get(cls, 0)
delta = new_mAP - old_mAP
if delta < -thresholds['max_regression']:
regressions.append({
'class': cls,
'old_mAP': old_mAP,
'new_mAP': new_mAP,
'regression': abs(delta),
'severity': 'critical' if cls in ('personnel', 'fod') else 'high',
'below_minimum': new_mAP < thresholds['min_mAP'],
})
if new_mAP < thresholds['min_mAP']:
regressions.append({
'class': cls,
'new_mAP': new_mAP,
'minimum': thresholds['min_mAP'],
'severity': 'critical',
'reason': 'below_absolute_minimum',
})
return regressions6.4 Counterfactual OTA Analysis
For each intervention that occurred after an OTA update, ask: "Would this intervention have occurred on the previous model version?"
class CounterfactualOTAAnalysis:
"""
Counterfactual analysis: replay intervention events through
the previous model version to determine if OTA caused them.
Requires: rosbag capture from intervention events (standard trigger
in data-flywheel-airside.md) + previous model version available
in artifact registry.
"""
def analyze_intervention(self, rosbag_path, old_model, new_model):
"""
Replay intervention rosbag through both model versions.
Returns: counterfactual result
- 'ota_caused': intervention would NOT have occurred with old model
- 'ota_independent': intervention WOULD have occurred with old model
- 'ota_exacerbated': intervention occurred with both, but new model
responded later/worse
"""
# Extract sensor data from rosbag
sensor_data = extract_sensor_frames(rosbag_path)
# Run old model on same sensor data
old_detections = old_model.run_batch(sensor_data)
old_planning = planning_sim(old_detections)
old_would_intervene = check_intervention_needed(old_planning)
# Run new model on same sensor data
new_detections = new_model.run_batch(sensor_data)
new_planning = planning_sim(new_detections)
new_would_intervene = check_intervention_needed(new_planning)
if new_would_intervene and not old_would_intervene:
return 'ota_caused'
elif new_would_intervene and old_would_intervene:
# Check if new model was worse (later detection, smaller margin)
old_margin = compute_safety_margin(old_planning)
new_margin = compute_safety_margin(new_planning)
if new_margin < old_margin * 0.8:
return 'ota_exacerbated'
return 'ota_independent'
else:
return 'ota_independent'6.5 Automatic Rollback Triggers
OTA ROLLBACK DECISION MATRIX
Per-class Safety Overall
regression? critical? mAP impact? Action
──────────────────────────────────────────────────────────
YES YES (person/ Any ROLLBACK
FOD) immediately
YES NO (GSE, > -5% ROLLBACK
other) within 24h
YES NO < -5% MONITOR
for 72h
NO N/A > -3% ROLLBACK
within 24h
NO N/A < -3% MONITOR
for 7 days
Rollback procedure:
1. Stop canary expansion (halt OTA to remaining vehicles)
2. Revert treatment group to control version (OTA v_old)
3. Capture all rosbags from anomalous period for analysis
4. Create Jira ticket with counterfactual analysis results
5. Block v_new from re-deployment until regression fixed7. Map Staleness Attribution
7.1 Map Staleness Detection
Map staleness manifests as increased localization error, path deviations, and unexpected obstacles. The fleet itself is the best sensor for map changes (see hd-map-change-detection-maintenance.md for the maintenance pipeline). Here we focus on attributing anomalies to map staleness.
class MapStalenessAttributor:
"""
Attributes fleet anomalies to map staleness by correlating
localization quality with map age and spatial patterns.
"""
def compute_staleness_score(self, fleet_telemetry, map_metadata):
"""
Score the likelihood that map staleness is causing the anomaly.
Signals:
1. Localization covariance increase correlated with map age
2. Spatial clustering of localization errors
3. GTSAM iteration count increase (harder to converge)
4. AMDB/NOTAM changes since last map update
5. Fleet SLAM discrepancies vs reference map
"""
signals = {}
# Signal 1: Map age vs localization quality correlation
map_ages = fleet_telemetry['map_age_hours']
loc_quality = fleet_telemetry['localization_cov_trace']
correlation = np.corrcoef(map_ages, loc_quality)[0, 1]
signals['age_loc_correlation'] = correlation
# Signal 2: Spatial clustering
error_locations = fleet_telemetry[
fleet_telemetry['localization_cov_trace'] >
fleet_telemetry['localization_cov_trace'].quantile(0.9)
][['x', 'y']]
if len(error_locations) > 10:
from sklearn.cluster import DBSCAN
clusters = DBSCAN(eps=50, min_samples=5).fit(error_locations)
n_clusters = len(set(clusters.labels_)) - (1 if -1 in clusters.labels_ else 0)
signals['spatial_clusters'] = n_clusters # > 0 suggests localized map issue
# Signal 3: GTSAM convergence difficulty
gtsam_iters = fleet_telemetry['gtsam_iteration_count']
signals['gtsam_iter_increase'] = (
gtsam_iters.tail(100).mean() / gtsam_iters.head(100).mean()
)
# Signal 4: NOTAM changes
active_notams = self.query_notams(map_metadata['airport_id'])
notams_since_map = [
n for n in active_notams
if n['effective_date'] > map_metadata['last_update']
]
signals['notam_changes'] = len(notams_since_map)
# Composite score
score = (
0.30 * min(max(correlation, 0), 1.0) +
0.25 * min(signals.get('spatial_clusters', 0) / 3.0, 1.0) +
0.20 * min(max(signals['gtsam_iter_increase'] - 1.0, 0) / 2.0, 1.0) +
0.25 * min(signals['notam_changes'] / 5.0, 1.0)
)
return score, signals7.2 Construction Activity Detection
The fleet detects construction and layout changes from perception data before the map is updated.
CONSTRUCTION DETECTION PIPELINE
Fleet SLAM maps (daily aggregate)
│
▼
Difference detection vs reference map
┌────────────────────────────────────────┐
│ For each 10m x 10m tile: │
│ diff = |current_occupancy - ref_map| │
│ if diff > threshold for 3+ days: │
│ → flag as potential construction │
└──────────────┬─────────────────────────┘
│
▼
Cross-reference with NOTAM feed
┌────────────────────────────────────────┐
│ Active construction NOTAM in zone? │
│ YES → confirmed construction │
│ NO → possible unreported change │
│ → escalate to operations │
└──────────────┬─────────────────────────┘
│
▼
Impact assessment
┌────────────────────────────────────────┐
│ Does construction zone overlap with: │
│ - Active routes? → reroute │
│ - Map features? → trigger map update │
│ - Localization landmarks? → alert │
└────────────────────────────────────────┘7.3 Automatic Map Refresh Triggers
| Trigger | Threshold | Action | Priority |
|---|---|---|---|
| Localization RMSE in zone > 0.5m for > 4 hours | 0.5m sustained | Schedule map survey of zone | P2 |
| GTSAM iteration count increase > 50% | 50% over baseline | Schedule map survey | P3 |
| NOTAM issued for construction in mapped area | Any relevant NOTAM | Assess impact, schedule if needed | P2 |
| Map age > 90 days with no validation | 90 days | Mandatory resurvey or fleet validation | P3 |
| Fleet SLAM detects > 10 new static objects in zone | 10+ new objects | Trigger incremental map update | P2 |
| 3+ vehicles report path deviation in same zone | 3 vehicles | Emergency map investigation | P1 |
8. Environmental Attribution
8.1 METAR Correlation Engine
METAR (Meteorological Aerodrome Report) is the standard aviation weather report, available every 30-60 minutes for every airport with IFR operations. Correlating fleet performance with METAR is the fastest environmental attribution method.
class METARCorrelationEngine:
"""
Correlate fleet KPIs with METAR weather observations.
METAR fields relevant to GSE operations:
- visibility_m: prevailing visibility in meters
- ceiling_ft: cloud base height (affects lighting)
- temp_c: temperature (affects sensor performance, tarmac)
- dewpoint_c: dew point (fog likelihood when temp-dewpoint < 3C)
- wind_speed_kt: wind speed (affects jet blast, FOD movement)
- wind_gust_kt: gust speed
- precip_type: RA (rain), SN (snow), FG (fog), BR (mist), HZ (haze)
- precip_intensity: - (light), (moderate), + (heavy)
"""
def correlate(self, fleet_kpis, metar_history, window_hours=24):
"""
Compute time-lagged correlations between METAR conditions
and fleet performance metrics.
"""
results = {}
metar_features = [
'visibility_m', 'temp_c', 'dewpoint_c', 'wind_speed_kt',
'precip_intensity_encoded', # 0=none, 1=light, 2=moderate, 3=heavy
]
kpi_features = [
'intervention_rate', 'perception_mAP', 'sensor_health_min',
'localization_cov_trace',
]
for metar_feat in metar_features:
for kpi_feat in kpi_features:
# Test multiple lags (0, 1, 2, 4, 8, 12, 24 hours)
for lag in [0, 1, 2, 4, 8, 12, 24]:
metar_lagged = metar_history[metar_feat].shift(lag)
kpi_aligned = fleet_kpis[kpi_feat]
valid = metar_lagged.notna() & kpi_aligned.notna()
if valid.sum() < 30:
continue
r, p = stats.pearsonr(
metar_lagged[valid], kpi_aligned[valid]
)
if p < 0.05 and abs(r) > 0.3:
key = (metar_feat, kpi_feat, lag)
results[key] = {'r': r, 'p': p, 'lag_hours': lag}
return results8.2 Seasonal Drift Detection
Airside operations experience strong seasonal patterns that must be separated from anomalies.
SEASONAL PATTERNS IN AIRSIDE FLEET PERFORMANCE
Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
────────────────────────────────────────────────────────────
mAP: Low Low Med Med High High Med High High Med Med Low
(fog, de-icing) (heat shimmer) (fog onset)
Int: High High Med Med Low Med Med Low Low Med Med High
rate (vis, ice) (thermal) (fog, early dark)
Sensor
health: Low Low Med High High Med Med High High Med Low Low
(glycol, salt) (heat stress) (debris, ice)
Key seasonal transitions (highest risk of misattribution):
- Oct: Fog onset -- first foggy day of season often causes spike
that looks like OTA regression if OTA was deployed in Sep
- Nov-Dec: De-icing onset -- sensor degradation accelerates
- Mar-Apr: Pollen season -- gradual sensor contamination
- Jun-Jul: Heat shimmer -- LiDAR range reduction on hot tarmacSeasonal baseline adjustment:
class SeasonalBaselineAdjuster:
"""
Adjust fleet baselines for seasonal patterns to prevent
seasonal effects from being misattributed as anomalies.
Requires: 12+ months of fleet data for robust seasonal estimation.
For new airports: use data from similar airports (same climate zone)
as initial seasonal prior, update with local data.
"""
def __init__(self, historical_data, period='weekly'):
"""
Fit seasonal decomposition on historical fleet data.
Uses STL (Seasonal and Trend decomposition using Loess)
for robust decomposition that handles outliers.
"""
from statsmodels.tsa.seasonal import STL
self.decompositions = {}
for metric in historical_data.columns:
stl = STL(
historical_data[metric],
period=52 if period == 'weekly' else 365,
seasonal=13, # seasonal smoother window
trend=53, # trend smoother window
robust=True, # robust to outliers
)
self.decompositions[metric] = stl.fit()
def get_adjusted_baseline(self, metric, date):
"""Return seasonally-adjusted baseline for given metric and date."""
decomp = self.decompositions[metric]
seasonal_component = decomp.seasonal[date]
trend_component = decomp.trend[date]
return trend_component + seasonal_component
def is_anomalous(self, metric, date, observed_value, n_sigma=3):
"""
Check if observed value is anomalous relative to
seasonally-adjusted baseline.
"""
baseline = self.get_adjusted_baseline(metric, date)
residual_std = self.decompositions[metric].resid.std()
z_score = (observed_value - baseline) / residual_std
return abs(z_score) > n_sigma, z_score8.3 Time-of-Day Effects
Several airside phenomena have strong diurnal patterns:
| Time Window | Environmental Effect | Impact on AV Operations |
|---|---|---|
| 04:00-07:00 | Dawn fog, dew, thermal inversion | LiDAR range reduction 20-40%, GPS multipath |
| 07:00-09:00 | Morning rush, de-icing (winter) | High traffic density, sensor contamination |
| 10:00-14:00 | Solar heating, heat shimmer | Tarmac heat shimmer, thermal camera saturation |
| 14:00-17:00 | Afternoon thunderstorms (summer) | Sudden visibility drops, standing water |
| 17:00-20:00 | Evening rush, low sun angle | Glare on cameras, long shadows in LiDAR |
| 20:00-04:00 | Night operations | No natural lighting, thermal-primary perception |
8.4 Airport-Specific Environmental Signatures
Each airport has unique environmental characteristics that create distinct anomaly patterns:
AIRPORT_ENVIRONMENTAL_PROFILES = {
'LHR': {
'primary_hazards': ['fog', 'rain', 'de-icing_Nov-Mar'],
'fog_frequency': 0.15, # fraction of operating hours
'de_icing_months': [11, 12, 1, 2, 3],
'heat_stress_rare': True,
'expected_mAP_reduction_fog': 0.15, # 15% mAP drop in fog is normal
},
'DXB': {
'primary_hazards': ['heat_shimmer', 'sand', 'humidity'],
'heat_months': [5, 6, 7, 8, 9],
'max_tarmac_temp_c': 65,
'sand_storm_frequency': 0.02,
'expected_thermal_throttling': True, # Orin throttles in summer
},
'NRT': {
'primary_hazards': ['typhoon_season', 'humidity', 'snow'],
'typhoon_months': [8, 9, 10],
'snow_months': [12, 1, 2],
'expected_mAP_reduction_rain': 0.10,
},
}9. Implementation Architecture
9.1 System Architecture
FLEET ANOMALY ATTRIBUTION ARCHITECTURE
┌──────────────────────────────────────────────────────────────────────────┐
│ │
│ VEHICLE TIER (On-Vehicle, Orin) │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Telemetry Emitter (1 Hz) │ │
│ │ - Perception metrics (mAP, per-class, latency) │ │
│ │ - Localization metrics (cov trace, GTSAM iters, RMSE) │ │
│ │ - Sensor health scores (per-sensor, per-check) │ │
│ │ - Planning metrics (jerk, clearance, candidate count) │ │
│ │ - Control metrics (tracking error, actuator status) │ │
│ │ - System metrics (GPU util, temp, memory, latency) │ │
│ │ - OTA/map version identifiers │ │
│ │ Format: Protobuf -> 5G/WiFi -> Kafka │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
├──────────────────────────────────────────────────────────────────────────┤
│ │
│ STREAMING TIER (Airport Edge or Cloud) │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Apache Kafka │ │
│ │ Topics: telemetry.{airport}.{vehicle}, alerts.{tier}, │ │
│ │ interventions.{airport}, metar.{airport} │ │
│ │ Retention: 7 days (hot), 90 days (cold via S3 sink) │ │
│ └────────────────────┬────────────────────────────────────────────┘ │
│ │ │
│ ┌────────────────────▼────────────────────────────────────────────┐ │
│ │ Apache Flink (Stream Processing) │ │
│ │ │ │
│ │ Job 1: Per-vehicle CUSUM/EWMA (1-min windows) │ │
│ │ - Update SPC charts for each vehicle x metric │ │
│ │ - Emit to alerts.p3 on single-vehicle alarm │ │
│ │ │ │
│ │ Job 2: Per-airport aggregation (15-min windows) │ │
│ │ - Aggregate vehicle metrics to airport level │ │
│ │ - CUSUM on airport-level KPIs │ │
│ │ - Emit to alerts.p2 on airport alarm │ │
│ │ │ │
│ │ Job 3: Fleet-wide aggregation (1-hour windows) │ │
│ │ - Isolation Forest scoring on fleet snapshot │ │
│ │ - EWMA on fleet-wide intervention rate │ │
│ │ - Emit to alerts.p1 on fleet alarm │ │
│ │ │ │
│ │ Job 4: METAR enrichment │ │
│ │ - Join telemetry with METAR feed │ │
│ │ - Compute environmental context features │ │
│ │ │ │
│ │ Job 5: Alert suppression and deduplication │ │
│ │ - Apply suppression rules (Section 2.5) │ │
│ │ - Route to appropriate tier │ │
│ └────────────────────┬────────────────────────────────────────────┘ │
│ │ │
├──────────────────────────────────────────────────────────────────────────┤
│ │
│ ATTRIBUTION TIER (Cloud) │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Attribution Engine (triggered by P1/P2 alerts) │ │
│ │ │ │
│ │ Step 1: Hierarchical decomposition │ │
│ │ Fleet -> Airport -> Vehicle -> Subsystem -> Component │ │
│ │ │ │
│ │ Step 2: Shapley attribution │ │
│ │ Per-feature SHAP values for anomalous observations │ │
│ │ Airport-level and vehicle-level Shapley decomposition │ │
│ │ │ │
│ │ Step 3: Bayesian diagnosis │ │
│ │ Sequential evidence gathering │ │
│ │ Prior update from fleet history │ │
│ │ Multi-hypothesis tracking if ambiguous │ │
│ │ │ │
│ │ Step 4: Context enrichment │ │
│ │ Cross-reference with OTA deployment log │ │
│ │ Cross-reference with METAR history │ │
│ │ Cross-reference with NOTAM feed │ │
│ │ Cross-reference with FIDS (flight schedule / aircraft types) │ │
│ │ │ │
│ │ Step 5: Report generation │ │
│ │ Root cause hypothesis with confidence │ │
│ │ Recommended actions │ │
│ │ Evidence trail for audit │ │
│ └────────────────────┬────────────────────────────────────────────┘ │
│ │ │
├──────────────────────────────────────────────────────────────────────────┤
│ │
│ BATCH TIER (Cloud, Airflow) │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ DAG: daily_fleet_analysis (runs 02:00 UTC) │ │
│ │ Task 1: Aggregate 24h telemetry from S3 │ │
│ │ Task 2: Run Isolation Forest + autoencoder on daily snapshot │ │
│ │ Task 3: Update seasonal baselines (STL decomposition) │ │
│ │ Task 4: Granger causality analysis (90-day rolling window) │ │
│ │ Task 5: Generate daily fleet health report │ │
│ │ │ │
│ │ DAG: quarterly_causal_discovery (runs quarterly) │ │
│ │ Task 1: Run NOTEARS on 90-day fleet telemetry │ │
│ │ Task 2: Run PC algorithm for validation │ │
│ │ Task 3: Compare discovered DAG with prior DAG │ │
│ │ Task 4: Generate causal structure report for human review │ │
│ │ │ │
│ │ DAG: ota_regression_analysis (triggered by OTA deployment) │ │
│ │ Task 1: Define A/B groups │ │
│ │ Task 2: Collect 7-day metrics from control + treatment │ │
│ │ Task 3: Run regression detection tests │ │
│ │ Task 4: Counterfactual replay on captured rosbags │ │
│ │ Task 5: Generate OTA regression report │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
├──────────────────────────────────────────────────────────────────────────┤
│ │
│ PRESENTATION TIER │
│ ┌─────────────────────────────────────────────────────────────────┐ │
│ │ Grafana Dashboard │ │
│ │ Panel 1: Fleet KPI overview (SPC charts per level) │ │
│ │ Panel 2: Active anomalies with attribution status │ │
│ │ Panel 3: Shapley waterfall chart (feature contributions) │ │
│ │ Panel 4: Bayesian diagnosis progress (posterior evolution) │ │
│ │ Panel 5: OTA A/B comparison (treatment vs control) │ │
│ │ Panel 6: Map staleness heat map (per zone per airport) │ │
│ │ Panel 7: Environmental correlation matrix │ │
│ │ │ │
│ │ Slack Integration │ │
│ │ P0: @channel with one-click acknowledge │ │
│ │ P1: Attribution summary with drill-down link │ │
│ │ P2: Daily digest thread │ │
│ │ │ │
│ │ Jira Integration │ │
│ │ Auto-create ticket for P0/P1 with attribution evidence │ │
│ │ Link to related rosbags, SHAP plots, Bayesian trace │ │
│ └─────────────────────────────────────────────────────────────────┘ │
│ │
└──────────────────────────────────────────────────────────────────────────┘9.2 Data Flow Summary
| Data Source | Volume | Latency | Storage | Retention |
|---|---|---|---|---|
| Vehicle telemetry (1 Hz) | 500 KB/vehicle/hour | < 5 sec | Kafka -> S3 | 7 days hot, 90 days cold |
| Interventions | 1-5 events/vehicle/day | < 1 sec | Kafka -> PostgreSQL | Permanent |
| METAR | 1 report/30 min/airport | < 5 min | PostgreSQL | Permanent |
| NOTAM | As issued | < 15 min | PostgreSQL | Permanent |
| Sensor health | 1 Hz (aggregated to 1 min) | < 30 sec | Kafka -> S3 | 30 days |
| OTA deployment log | Per deployment | < 1 min | PostgreSQL | Permanent |
| Map metadata | Per map update | < 1 min | PostgreSQL | Permanent |
| FIDS/A-CDM | Per flight event | < 5 min | PostgreSQL | 1 year |
| Attribution results | Per anomaly | < 5 min | PostgreSQL + S3 | Permanent |
9.3 Integration Points
INTEGRATION WITH EXISTING REFERENCE AIRSIDE AV STACK SYSTEMS
┌────────────────────────────────────────────────────────────────┐
│ UPSTREAM (data sources) │
│ │
│ sensor-degradation-health-monitoring.md │
│ → Provides: per-sensor health scores, degradation type │
│ → Integration: Flink reads from health_monitor Kafka topic │
│ │
│ runtime-verification-monitoring.md │
│ → Provides: STL violation events, OOD scores │
│ → Integration: Flink reads from stl_monitor Kafka topic │
│ │
│ cloud-backend-infrastructure.md │
│ → Provides: data lake, Airflow, telemetry pipeline │
│ → Integration: Attribution is an additional Airflow DAG │
│ and Flink job set on existing infrastructure │
│ │
│ data-flywheel-airside.md │
│ → Provides: trigger events, model eval metrics │
│ → Integration: Attribution feeds edge-case mining priorities │
│ │
│ ../fleet-management/fleet-predictive-maintenance.md │
│ → Provides: component RUL, failure predictions │
│ → Integration: PHM signals are features in anomaly model │
├────────────────────────────────────────────────────────────────┤
│ DOWNSTREAM (consumers) │
│ │
│ ../ota/ota-fleet-management.md │
│ ← Consumes: OTA regression alerts, rollback triggers │
│ │
│ hd-map-change-detection-maintenance.md │
│ ← Consumes: map staleness alerts, refresh triggers │
│ │
│ ../../40-runtime-systems/monitoring-observability/hmi-operator-interface.md │
│ ← Consumes: attribution summaries for operator dashboard │
│ │
│ ../../40-runtime-systems/ml-deployment/av-cicd-devops-pipeline.md │
│ ← Consumes: per-class regression signals for CI gates │
│ │
│ weather-adaptive-odd-management.md │
│ ← Consumes: environmental attribution for ODD tuning │
└────────────────────────────────────────────────────────────────┘9.4 Technology Stack
| Component | Technology | Rationale |
|---|---|---|
| Message bus | Apache Kafka | Already in cloud backend architecture. High throughput, replay capability |
| Stream processing | Apache Flink | Exactly-once semantics, event-time processing, sub-second latency |
| Batch processing | Apache Airflow | Already in cloud backend architecture. DAG scheduling, dependency management |
| Anomaly models | scikit-learn (IsolationForest), PyTorch (autoencoder) | Standard, well-tested implementations |
| Causal discovery | causal-learn (NOTEARS, PC), statsmodels (Granger) | Most mature Python causal inference library |
| SHAP computation | shap library | Standard implementation, TreeSHAP for exact computation |
| Database | PostgreSQL + TimescaleDB | Time-series extension for efficient temporal queries |
| Dashboard | Grafana | Already likely in stack. Custom panels via JSON API |
| Alerting | Grafana Alerting + PagerDuty | Multi-channel alert routing |
| Ticket management | Jira API | Auto-create tickets with evidence links |
10. Cost-Benefit and Metrics
10.1 Implementation Cost
| Phase | Scope | Duration | Cost |
|---|---|---|---|
| Phase 1: Anomaly detection | SPC (CUSUM/EWMA) on fleet KPIs, hierarchical decomposition, alert tiering, Grafana dashboards | 6-8 weeks | $15-25K |
| Phase 2: Shapley attribution | Isolation Forest training, TreeSHAP integration, hierarchical Shapley decomposition, feature attribution dashboard | 4-6 weeks | $12-20K |
| Phase 3: Bayesian diagnosis | Diagnosis engine, prior calibration, sequential evidence gathering, multi-hypothesis tracking | 4-6 weeks | $12-20K |
| Phase 4: OTA regression | A/B test framework, per-class regression detection, counterfactual replay, automatic rollback triggers | 4-6 weeks | $10-18K |
| Phase 5: Environmental + map | METAR correlation, seasonal baseline, map staleness scoring, construction detection | 3-4 weeks | $8-15K |
| Phase 6: Causal discovery | NOTEARS pipeline, PC validation, Granger analysis, quarterly update DAG | 3-4 weeks | $8-15K |
| Total | 16-22 weeks | $55-95K |
Infrastructure cost (incremental to existing cloud backend):
| Component | Monthly Cost (50 vehicles) | Monthly Cost (200 vehicles) |
|---|---|---|
| Flink cluster (3 nodes) | $300-500 | $800-1,500 |
| TimescaleDB | $100-200 | $300-600 |
| Additional S3 storage | $50-100 | $200-400 |
| Grafana (hosted) | $100 | $200 |
| PagerDuty | $100-200 | $200-400 |
| Compute (batch attribution) | $100-200 | $300-600 |
| Total | $750-1,300/month | $1,800-3,500/month |
10.2 Benefit Quantification
MTTR reduction:
| Metric | Before Attribution System | After (Phase 1-2) | After (Full System) |
|---|---|---|---|
| Mean Time To Detection | 1-4 hours | 5-15 minutes | 2-10 minutes |
| Mean Time To Attribution | 4-24 hours | 30-120 minutes | 10-30 minutes |
| Mean Time To Resolution | 8-48 hours | 2-8 hours | 1-4 hours |
| False positive alerts/day (50 vehicles) | N/A (no alerts) | 3-5 | 1-2 |
| Anomalies requiring manual triage | 100% | 40-60% | 15-25% |
| Analyst headcount (50 vehicles) | 2-3 (reactive) | 1-2 | 1 |
| Analyst headcount (200 vehicles) | 8-12 (impossible) | 3-4 | 2-3 |
Financial impact (50-vehicle fleet):
| Benefit Category | Annual Value | Basis |
|---|---|---|
| Reduced downtime (faster MTTR) | $80-150K | 20-40 hours/month saved x $100-200/hour fleet-hour value |
| Avoided cascading failures | $30-60K | 2-5 prevented per year x $15-30K per cascade |
| Reduced analyst headcount | $50-100K | 1-2 fewer analysts needed |
| OTA regression early detection | $20-50K | 1-3 regressions caught 24-48 hours earlier per year |
| Improved fleet availability | $40-80K | 1-2% availability improvement x fleet revenue |
| Total annual benefit | $220-440K | |
| Total annual cost | $25-45K | Infrastructure + maintenance |
| Net annual benefit | $195-395K | |
| ROI (including implementation) | 2.5-5x in Year 1 |
10.3 Performance Metrics for the Attribution System
| Metric | Target | Measurement Method |
|---|---|---|
| Attribution accuracy | > 80% correct root cause identification | Validated against manual investigation on sample |
| Attribution coverage | > 90% of P1/P2 anomalies get automated attribution | Fraction with high-confidence diagnosis |
| MTTD (detection) | < 15 min for P1, < 60 min for P2 | Time from anomaly onset to alert |
| MTTA (attribution) | < 30 min for single-cause, < 60 min for multi-cause | Time from alert to root cause hypothesis |
| False positive rate | < 5% of P1 alerts, < 15% of P2 alerts | Manual review of 10% sample |
| Alert volume | < 5 P1-P2 alerts/day for 50-vehicle fleet | Dashboard count |
| Bayesian diagnosis depth | < 4 tests to reach high confidence | Average tests before P > 0.60 |
| Seasonal adjustment accuracy | < 10% MAPE on seasonal baseline | Back-test against held-out year |
| Causal DAG stability | > 80% edge agreement across quarters | Compare quarterly NOTEARS outputs |
10.4 Break-Even Analysis
BREAK-EVEN ANALYSIS
Fixed cost: $55-95K implementation
Variable cost: $750-1,300/month infrastructure (50 vehicles)
Annual operating: $25-45K
Annual benefit per vehicle: $4,400-8,800 (from MTTR reduction, availability, analyst savings)
Break-even fleet size (Year 1 including implementation):
Conservative: $55K / ($4,400/vehicle) = ~13 vehicles
Moderate: $75K / ($6,600/vehicle) = ~12 vehicles
Optimistic: $95K / ($8,800/vehicle) = ~11 vehicles
Break-even fleet size (Year 2+, operating cost only):
Conservative: $45K / ($4,400/vehicle) = ~11 vehicles
Moderate: $35K / ($6,600/vehicle) = ~6 vehicles
Conclusion: ROI-positive at 10-15 vehicles. Build before fleet
reaches 20 vehicles (the manual triage scaling wall).11. Key Takeaways
Manual triage does not scale past 20 vehicles. At 50 vehicles, anomaly generation rate (15-30/day) exceeds analyst capacity (2-3/day) by an order of magnitude. The backlog grows permanently, and undiagnosed anomalies cascade into safety events.
Hierarchical anomaly detection reduces alert volume 80-90%. Fleet -> airport -> vehicle -> subsystem -> component decomposition, combined with CUSUM/EWMA and Isolation Forest, filters noise before it reaches analysts. Target: < 5 actionable P1/P2 alerts per day for a 50-vehicle fleet.
Fraction-of-vehicles-affected is the single most informative diagnostic signal. If > 80% of vehicles at an airport are affected, the cause is environmental or software (fleet-wide). If < 30%, the cause is vehicle-specific (sensor, calibration). This one check eliminates half the hypothesis space in seconds.
Causal discovery (NOTEARS + PC + Granger) provides the structural backbone. The learned causal DAG encodes which variables can cause which, preventing spurious attribution (e.g., attributing a fog-caused intervention spike to a coincidental OTA update). Re-learn quarterly as fleet operations evolve.
Shapley values provide fair, decomposable attribution. SHAP decomposes the anomaly score into per-feature contributions with exact or near-exact computation. Hierarchical Shapley (fleet -> airport -> vehicle -> feature) makes the computation tractable for large fleets.
Bayesian diagnosis trees reach high confidence in 3-4 tests. Sequential evidence gathering, guided by information gain, reaches > 60% posterior on the true root cause in typically 3-4 automated tests (2-5 minutes), compared to 4-24 hours of manual investigation.
30-40% of fleet anomalies have multiple co-occurring causes. Single-cause diagnosis leads to partial fixes and recurrence. The Bayesian and causal framework explicitly handles multi-cause scenarios by tracking joint posteriors and interaction effects.
OTA regression is the most dangerous single cause because it is self-inflicted and fleet-wide. A/B testing between OTA versions, per-class regression detection, and counterfactual replay ("would this intervention have happened on the old model?") are non-negotiable for production fleet safety.
Seasonal baselines prevent misattribution. Without seasonal adjustment, the first fog of autumn looks like a system failure. STL decomposition on 12+ months of data creates robust seasonal baselines. For new airports, borrow seasonal profiles from same-climate airports.
EU PLD 2024/2853 creates legal urgency for causal attribution. The rebuttable presumption of causality means if reference airside AV stack cannot demonstrate what caused an incident, courts may presume the AV caused it. Automated causal attribution provides the evidence trail that regulatory defense requires.
Implementation cost $55-95K, break-even at 10-15 vehicles. The system pays for itself through MTTR reduction (8-48h to 15-90min), analyst headcount reduction (2-3 fewer for 50 vehicles), and avoided cascading failures ($30-60K/year).
Build before you need it. Retrofitting attribution under operational pressure (when the fleet hits 20+ vehicles and anomalies pile up) costs 2-3x more than building it proactively. The 16-22 week implementation timeline means starting at 10-15 vehicles to be ready at 20.
12. References
Anomaly Detection and SPC
- Page, E. S. (1954). "Continuous Inspection Schemes." Biometrika, 41(1/2), 100-115. [Original CUSUM paper]
- Roberts, S. W. (1959). "Control Chart Tests Based on Geometric Moving Averages." Technometrics, 1(3), 239-250. [EWMA control charts]
- Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008). "Isolation Forest." ICDM, 413-422. [Isolation Forest for anomaly detection]
- Malhotra, P. et al. (2016). "LSTM-based Encoder-Decoder for Multi-sensor Anomaly Detection." ICML Workshop on Anomaly Detection, arXiv:1607.00148. [Autoencoder for time-series anomaly detection]
Causal Discovery
- Zheng, X. et al. (2018). "DAGs with NO TEARS: Continuous Optimization for Structure Learning." NeurIPS, 9472-9483. [NOTEARS algorithm]
- Zheng, X. et al. (2020). "Learning Sparse Nonparametric DAGs." AISTATS, 3414-3425. [NOTEARS-NONLINEAR]
- Spirtes, P., Glymour, C., & Scheines, R. (2000). Causation, Prediction, and Search. MIT Press. [PC algorithm, FCI algorithm]
- Granger, C. W. J. (1969). "Investigating Causal Relations by Econometric Models and Cross-spectral Methods." Econometrica, 37(3), 424-438. [Granger causality]
- Zheng, X. et al. (2024). "causal-learn: Causal Discovery in Python." JMLR, 25(60), 1-8. [causal-learn library]
Shapley Values and Attribution
- Lundberg, S. M. & Lee, S. I. (2017). "A Unified Approach to Interpreting Model Predictions." NeurIPS, 4765-4774. [SHAP framework]
- Lundberg, S. M. et al. (2020). "From local explanations to global understanding with explainable AI for trees." Nature Machine Intelligence, 2, 56-67. [TreeSHAP]
- Covert, I. et al. (2021). "Explaining by Removing: A Unified Framework for Model Explanation." JMLR, 22(209), 1-90. [Shapley for feature attribution theory]
Bayesian Diagnosis
- Pearl, J. (2009). Causality: Models, Reasoning, and Inference. 2nd edition. Cambridge University Press.
- Heckerman, D. (1995). "A Tutorial on Learning with Bayesian Networks." Technical Report MSR-TR-95-06, Microsoft Research.
- Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press. [Bayesian network diagnosis]
Fleet Operations and AV Monitoring
- Uber ATG (2020). "Automated Root Cause Analysis for Autonomous Vehicle Failures." Internal technical report (disclosed in NTSB investigations).
- Waymo (2023). "Fleet-Level Safety Monitoring and Anomaly Detection." Waymo Safety Report, 4th Edition.
- ISO 3691-4:2023. "Industrial trucks -- Safety requirements and verification -- Part 4: Driverless industrial trucks and their systems."
- EU Product Liability Directive 2024/2853. "Directive on liability for defective products." (Transpose deadline December 2026.)
- ISO 26262:2018. "Road vehicles -- Functional safety." Parts 1-12.
Seasonal Decomposition and Environmental Analysis
- Cleveland, R. B. et al. (1990). "STL: A Seasonal-Trend Decomposition Procedure Based on Loess." Journal of Official Statistics, 6(1), 3-73.
- ICAO Annex 3 (2018). "Meteorological Service for International Air Navigation." 20th edition. [METAR standard]
- FAA Advisory Circular 00-45H (2016). "Aviation Weather Services." [METAR interpretation]
Time-Series Analysis
- Hamilton, J. D. (1994). Time Series Analysis. Princeton University Press. [Granger causality, VAR models]
- Hyndman, R. J. & Athanasopoulos, G. (2021). Forecasting: Principles and Practice. 3rd edition. OTexts. [STL, ARIMA, seasonal adjustment]