Skip to content

Automated Root-Cause Attribution for Fleet-Level Operational Anomalies

Causal Discovery, Shapley Attribution, Bayesian Diagnosis, and Streaming Analytics for Autonomous GSE Fleets

Last updated: 2026-04-11


Key Takeaway: When a 50-vehicle autonomous GSE fleet shows a 3x intervention rate spike at Airport B on Tuesday, the root cause could be any of 8+ categories: sensor degradation, stale maps, OTA model regression, weather shift, new aircraft type, construction, operator behavior, or software bug. Manual triage (reviewing rosbags, cross-referencing logs, interviewing operators) takes 4-48 hours per incident and does not scale past 20 vehicles or 2 airports. Automated root-cause attribution combines hierarchical anomaly detection (CUSUM/EWMA on fleet KPIs), causal discovery (NOTEARS on operational telemetry), Shapley-value decomposition (which airport/vehicle/component drives the anomaly), and Bayesian diagnosis trees (sequential evidence gathering to pinpoint causes). The system reduces Mean Time To Resolution (MTTR) from 8-48 hours to 15-90 minutes, cuts false-positive alerts by 60-80% through hierarchical filtering, and provides the causal evidence trail that EU PLD 2024/2853 and ISO 3691-4 increasingly require. Implementation cost: $55-95K over 16-22 weeks, with ROI at 15-25 vehicles as avoided downtime and faster resolution exceed system cost.


Table of Contents

  1. The Attribution Problem
  2. Anomaly Detection at Fleet Scale
  3. Causal Discovery from Operational Data
  4. Shapley-Value Attribution
  5. Bayesian Diagnosis Trees
  6. OTA Regression Attribution
  7. Map Staleness Attribution
  8. Environmental Attribution
  9. Implementation Architecture
  10. Cost-Benefit and Metrics
  11. Key Takeaways
  12. References

1. The Attribution Problem

1.1 Why Fleet-Level Root-Cause Analysis is Hard

A single vehicle anomaly has a tractable diagnosis path: review the rosbag, check sensor health, inspect the scene. Fleet-level anomalies are fundamentally different because causes propagate, interact, and hide across the combinatorial space of vehicles x airports x subsystems x time.

The dimensionality of fleet diagnosis:

Fleet anomaly search space:

  Vehicles:      50-100 vehicles
  Airports:      3-10 airports
  Subsystems:    6 per vehicle (perception, localization, planning, control, sensors, compute)
  Components:    ~25 per vehicle (4-8 LiDAR, cameras, radar, thermal, GTSAM, Frenet, CAN, etc.)
  Software:      OTA version, map version, parameter set, calibration
  Environment:   Weather, time-of-day, traffic, construction, aircraft mix
  Human:         Operator behavior, shift patterns, training recency

  Total search space: ~10^6 - 10^8 unique state combinations per day
  
  Manual triage throughput: 1-3 root causes / analyst / day
  Anomaly generation rate (50-vehicle fleet): 5-20 anomalies / day
  
  Result: Manual triage creates a permanent and growing backlog

1.2 Root Cause Categories

Every fleet anomaly traces to one or more of eight root cause categories. These are not mutually exclusive -- a model regression exposed only during fog at a newly resurfaced stand involves three simultaneous causes.

#Root Cause CategoryExample SignalTypical OnsetManual Triage Time
1Sensor degradationPoint cloud density drop, intensity shiftGradual (hours-days)1-4 hours
2Map stalenessLocalization error increase, path deviationStep change (construction) or gradual (drift)2-8 hours
3OTA model regressionPer-class mAP drop, new false-positive patternStep change (deployment time)4-24 hours
4Weather/environmentCorrelated cross-vehicle performance dropGradual-to-sudden30 min - 2 hours
5New aircraft typeDetection failures near specific standsStep change (airline schedule change)2-8 hours
6Construction/layout changeLocalization divergence in specific zoneStep change (construction start)4-16 hours
7Operator behavior changeIntervention pattern shift without vehicle anomalyGradual (new operators) or step (policy change)4-48 hours
8Software bugDeterministic failure on specific input patternStep change (code deployment)2-24 hours

1.3 Why Manual Triage Does Not Scale Past 20 Vehicles

Manual root-cause analysis follows a predictable workflow that breaks at fleet scale:

Manual triage workflow (per anomaly):

  1. Receive alert (e.g., "Vehicle 23 intervention rate 4x baseline")     5 min
  2. Query telemetry database for vehicle timeline                        10-30 min
  3. Download and review rosbag from event                                20-60 min
  4. Cross-reference with weather, OTA, map version                       10-30 min
  5. Compare with other vehicles at same airport                          15-45 min
  6. Compare with same vehicle at different times                         15-45 min
  7. Check sensor health logs                                              10-20 min
  8. Formulate hypothesis and test                                         30-120 min
  9. Write root-cause report                                               20-60 min
  
  Total: 2.5 - 7 hours per anomaly (median ~4 hours)
  
  With 1 analyst: 2 anomalies/day capacity
  Fleet of 20 vehicles: ~5-10 anomalies/day (borderline manageable)
  Fleet of 50 vehicles: ~15-30 anomalies/day (3-6x analyst capacity)
  Fleet of 100 vehicles across 5 airports: ~40-80 anomalies/day (impossible)

The problem compounds because unresolved anomalies cascade. A sensor degradation left undiagnosed for 48 hours because the analyst is triaging OTA regressions leads to a safety event. The analyst then spends 2 days on the safety investigation, during which 10 more anomalies pile up.

Industry experience confirms this scaling wall:

CompanyFleet Size at Manual Triage BreakdownSolution
Waymo~100 vehicles (2019-2020)Built centralized anomaly attribution system
Cruise~200 vehicles (2022)Automated "trip diagnostics" pipeline
Nuro~50 vehicles (2021)Hierarchical alert triage with auto-classification
Tesla~50,000 FSD vehicles (2023)Fully automated fleet learning + regression system
Motional~150 vehicles (2023)ML-based root cause classifier

reference airside AV stack will hit this wall at 15-25 vehicles across 2+ airports. Building attribution infrastructure before that point is significantly cheaper than retrofitting under operational pressure.

1.4 The Multi-Cause Problem

In practice, 30-40% of fleet anomalies have multiple contributing causes. Single-cause diagnosis leads to partial fixes that recur.

Example: The Tuesday morning problem at Airport B

Observed: 3x intervention rate, Tuesday 06:00-10:00, Airport B, 12 of 15 vehicles

Investigation reveals three co-occurring causes:
  1. OTA v3.2.1 deployed Monday night (regression on personnel detection in fog)
  2. Seasonal fog onset (October) degrades LiDAR range 20-30% at dawn
  3. New Turkish Airlines A350 began operations Tuesday (wider wingspan,
     different reflectivity profile than trained A320/B737)

Fixing only the OTA regression reduces interventions 40%, not 100%.
Fixing only the fog handling reduces interventions 20%.
Addressing the new aircraft type reduces interventions 15%.
All three together resolve 90%+ of the anomaly.

Interactions:
  - OTA regression only manifests in fog (personnel detection robust in clear weather)
  - Fog was tolerable before OTA regression (previous model handled reduced range)
  - New aircraft type causes confusion only at specific stands (B5-B8)
  
  The causes are not additive -- they interact multiplicatively.

This motivates the causal approach: correlation-based methods identify associations ("interventions correlate with fog") but cannot distinguish direct causes from confounders or identify interaction effects.

1.5 Relationship to Existing Systems

This document covers fleet-level anomaly attribution -- the "why" layer that sits above existing monitoring and below incident management. It consumes signals from, but does not duplicate, several existing systems:

Existing SystemDoc ReferenceWhat It Provides to Attribution
Sensor health monitoringsensor-degradation-health-monitoring.mdPer-sensor health scores, degradation type classification
Runtime verificationruntime-verification-monitoring.mdSTL violation events, OOD detection scores, safety envelope status
Predictive maintenance../fleet-management/fleet-predictive-maintenance.mdComponent RUL predictions, failure probability estimates
Data flywheeldata-flywheel-airside.mdTrigger events, retraining status, model performance metrics
Cloud backendcloud-backend-infrastructure.mdTelemetry pipeline, data lake, Airflow orchestration
Causal reasoningcausal-reasoning-counterfactual.mdSCM methodology for planning decisions (this doc extends to fleet operations)

2. Anomaly Detection at Fleet Scale

2.1 Fleet KPI Hierarchy

Anomaly detection requires a well-defined metric hierarchy. Each level aggregates from the one below, enabling drill-down from "fleet is unhealthy" to "LiDAR #3 on Vehicle #23 at Airport B has a contaminated lens."

FLEET KPI HIERARCHY

Level 0: Fleet
  Metrics: fleet_intervention_rate, fleet_availability, fleet_mission_success_rate
  Granularity: 1-hour rolling windows
  Normal: intervention_rate < 0.5/100km, availability > 95%, success > 98%

Level 1: Airport  
  Metrics: airport_{X}_intervention_rate, airport_{X}_availability
  Granularity: 1-hour rolling windows
  Normal: within 2 sigma of airport historical baseline

Level 2: Vehicle
  Metrics: vehicle_{ID}_intervention_rate, mission_completion, avg_speed
  Granularity: per-mission and 4-hour rolling
  Normal: within 3 sigma of vehicle historical baseline

Level 3: Subsystem
  Metrics: perception_mAP, localization_cov_trace, planning_jerk,
           control_tracking_error, sensor_health_score, compute_utilization
  Granularity: per-mission, 30-min rolling
  Normal: perception_mAP > 0.65, loc_cov_trace < 0.1, tracking_error < 0.15m

Level 4: Component
  Metrics: lidar_{N}_point_count, lidar_{N}_max_range, gtsam_iteration_count,
           centerpoint_latency, frenet_candidate_count, can_bus_error_rate
  Granularity: 1-second (streaming), aggregated to 1-min
  Normal: component-specific thresholds from commissioning baseline

2.2 Statistical Process Control Methods

Statistical Process Control (SPC) methods detect when a metric shifts from its baseline distribution. Two methods are complementary for fleet operations:

CUSUM (Cumulative Sum Control Chart)

CUSUM is optimal for detecting small persistent shifts in the mean -- exactly the signature of gradual sensor degradation or slow model drift.

python
class CUSUMDetector:
    """
    Tabular CUSUM for detecting mean shifts in fleet KPIs.
    
    Parameters:
        k: allowance (slack) -- typically 0.5 * shift_to_detect (in sigma units)
        h: decision threshold -- controls false-positive rate
           h=4 gives ARL_0 ~168 (false alarm every 168 samples on average)
           h=5 gives ARL_0 ~465
    
    For fleet metrics sampled hourly:
        h=4: false alarm every ~7 days (acceptable for Tier 1 alerts)
        h=5: false alarm every ~19 days (appropriate for Tier 0 escalations)
    """
    
    def __init__(self, k=0.5, h=4.0, baseline_mean=None, baseline_std=None):
        self.k = k
        self.h = h
        self.mu_0 = baseline_mean    # estimated from 30+ day commissioning period
        self.sigma = baseline_std
        self.S_pos = 0.0             # upper CUSUM statistic
        self.S_neg = 0.0             # lower CUSUM statistic
        self.run_length = 0
    
    def update(self, x):
        """Process new observation. Returns (alarm, direction, run_length)."""
        z = (x - self.mu_0) / self.sigma   # standardize
        
        self.S_pos = max(0, self.S_pos + z - self.k)
        self.S_neg = max(0, self.S_neg - z - self.k)
        self.run_length += 1
        
        if self.S_pos > self.h:
            self.S_pos = 0.0
            return True, 'increase', self.run_length
        elif self.S_neg > self.h:
            self.S_neg = 0.0
            return True, 'decrease', self.run_length
        
        return False, None, self.run_length
    
    def reset_baseline(self, new_mean, new_std):
        """Update baseline after confirmed operational change (e.g., new OTA)."""
        self.mu_0 = new_mean
        self.sigma = new_std
        self.S_pos = 0.0
        self.S_neg = 0.0
        self.run_length = 0

EWMA (Exponentially Weighted Moving Average)

EWMA smooths out high-frequency noise while preserving trend information. Better than CUSUM for metrics with high variance (e.g., per-mission intervention counts that are Poisson-distributed).

python
class EWMADetector:
    """
    EWMA control chart for noisy fleet metrics.
    
    lambda_ (smoothing): 0.05-0.25 typical
        0.05: sensitive to small shifts, slower to detect large shifts
        0.25: detects large shifts quickly, less sensitive to small shifts
    
    For intervention rates (high variance): lambda_=0.10
    For sensor health scores (low variance): lambda_=0.20
    """
    
    def __init__(self, lambda_=0.10, L=3.0, baseline_mean=None, baseline_std=None):
        self.lambda_ = lambda_
        self.L = L                   # control limit multiplier (3.0 = ~0.27% false alarm)
        self.mu_0 = baseline_mean
        self.sigma = baseline_std
        self.z = baseline_mean       # EWMA statistic
        self.n = 0
    
    def update(self, x):
        """Process new observation."""
        self.n += 1
        self.z = self.lambda_ * x + (1 - self.lambda_) * self.z
        
        # Time-varying control limits (accounts for EWMA startup)
        sigma_z = self.sigma * math.sqrt(
            (self.lambda_ / (2 - self.lambda_)) * 
            (1 - (1 - self.lambda_) ** (2 * self.n))
        )
        
        ucl = self.mu_0 + self.L * sigma_z
        lcl = self.mu_0 - self.L * sigma_z
        
        if self.z > ucl:
            return True, 'increase', self.z - self.mu_0
        elif self.z < lcl:
            return True, 'decrease', self.z - self.mu_0
        
        return False, None, self.z - self.mu_0

Which method for which KPI:

KPIRecommended MethodRationale
Intervention rate (fleet/airport)EWMA (lambda=0.10)High variance (Poisson), need smoothing
Perception mAP (per-vehicle)CUSUM (k=0.5, h=4)Low variance, detect small persistent drops
Localization covariance traceCUSUM (k=0.3, h=5)Safety-critical, want low false-alarm rate
Sensor health scoreEWMA (lambda=0.20)Moderate variance, want trend sensitivity
Mission success rateEWMA (lambda=0.15)Binomial variance, need smoothing
Compute utilizationCUSUM (k=1.0, h=4)Detect step changes from code changes
Planning jerk/tracking errorEWMA (lambda=0.10)High variance from scenario diversity

2.3 Multivariate Anomaly Detection

Univariate monitors catch single-metric deviations. But fleet anomalies often manifest as coordinated shifts across multiple metrics that individually remain within bounds.

Example: Perception mAP drops 5% (within normal variance) AND localization covariance increases 15% (within normal variance) AND planning jerk increases 10% (within normal variance). Each alone is not alarming. Together they indicate a systematic problem.

Isolation Forest for Fleet Telemetry

Isolation Forest (Liu et al., 2008) is effective for high-dimensional fleet telemetry because it handles mixed feature types, requires no distributional assumptions, and scales linearly.

python
from sklearn.ensemble import IsolationForest
import numpy as np

class FleetAnomalyDetector:
    """
    Multivariate anomaly detection on fleet telemetry snapshots.
    
    Each observation is a fixed-length feature vector representing
    one vehicle's state over a time window (e.g., 1 hour).
    """
    
    # Feature vector per vehicle per hour
    FEATURES = [
        'intervention_count',          # count in window
        'perception_mAP_mean',         # mean mAP over window
        'perception_mAP_std',          # mAP stability
        'localization_cov_trace_mean', # mean position uncertainty
        'localization_cov_trace_max',  # worst-case uncertainty
        'planning_jerk_p95',           # 95th percentile jerk
        'control_tracking_error_mean', # mean lateral/longitudinal error
        'sensor_health_min',           # worst sensor health in window
        'compute_gpu_util_mean',       # GPU utilization
        'compute_latency_p99',         # 99th percentile end-to-end latency
        'speed_mean',                  # average operating speed
        'distance_km',                 # distance driven in window
        'ota_version_hash',            # categorical -> ordinal encoding
        'map_age_hours',               # hours since last map update
        'weather_visibility_m',        # METAR visibility
        'weather_precip_intensity',    # precipitation rate
        'time_of_day_sin',            # cyclic encoding of hour
        'time_of_day_cos',
        'airport_id',                  # categorical -> one-hot
    ]
    
    def __init__(self, contamination=0.05, n_estimators=200):
        self.model = IsolationForest(
            contamination=contamination,  # expect ~5% anomalous observations
            n_estimators=n_estimators,
            max_samples='auto',
            random_state=42,
            n_jobs=-1,
        )
        self.scaler = None  # fit during training
    
    def fit(self, historical_data: np.ndarray):
        """Fit on 30+ days of normal fleet operation data."""
        self.scaler = RobustScaler().fit(historical_data)
        scaled = self.scaler.transform(historical_data)
        self.model.fit(scaled)
    
    def score(self, current_snapshot: np.ndarray):
        """
        Score current fleet state.
        Returns anomaly_scores (lower = more anomalous, <0 = anomaly).
        """
        scaled = self.scaler.transform(current_snapshot)
        return self.model.decision_function(scaled)

Autoencoder for Temporal Patterns

When the anomaly is not a point-in-time shift but a temporal pattern (e.g., oscillating localization error that correlates with aircraft departure schedule), a sequence autoencoder captures what Isolation Forest misses.

python
import torch
import torch.nn as nn

class FleetTemporalAutoencoder(nn.Module):
    """
    LSTM autoencoder for detecting anomalous temporal patterns
    in per-vehicle telemetry sequences.
    
    Input: (batch, seq_len=24, n_features=19)  -- 24 hours x 19 features
    Output: reconstruction of input
    Anomaly score: reconstruction error (MSE per feature, weighted)
    """
    
    def __init__(self, n_features=19, hidden_dim=64, latent_dim=16):
        super().__init__()
        
        # Encoder
        self.encoder_lstm = nn.LSTM(
            input_size=n_features,
            hidden_size=hidden_dim,
            num_layers=2,
            batch_first=True,
            dropout=0.1,
        )
        self.encoder_fc = nn.Linear(hidden_dim, latent_dim)
        
        # Decoder
        self.decoder_fc = nn.Linear(latent_dim, hidden_dim)
        self.decoder_lstm = nn.LSTM(
            input_size=hidden_dim,
            hidden_size=hidden_dim,
            num_layers=2,
            batch_first=True,
            dropout=0.1,
        )
        self.output_fc = nn.Linear(hidden_dim, n_features)
    
    def encode(self, x):
        _, (h_n, _) = self.encoder_lstm(x)
        return self.encoder_fc(h_n[-1])
    
    def decode(self, z, seq_len):
        h = self.decoder_fc(z).unsqueeze(1).repeat(1, seq_len, 1)
        out, _ = self.decoder_lstm(h)
        return self.output_fc(out)
    
    def forward(self, x):
        z = self.encode(x)
        return self.decode(z, x.size(1))
    
    def anomaly_score(self, x, feature_weights=None):
        """
        Per-feature weighted reconstruction error.
        
        feature_weights: higher weight for safety-critical features
          e.g., perception_mAP: 2.0, localization_cov: 2.0, 
                intervention_count: 3.0, sensor_health: 1.5
        """
        x_hat = self.forward(x)
        mse_per_feature = ((x - x_hat) ** 2).mean(dim=1)  # (batch, n_features)
        
        if feature_weights is not None:
            mse_per_feature = mse_per_feature * feature_weights
        
        return mse_per_feature.sum(dim=-1)  # scalar anomaly score per sample

2.4 Hierarchical Anomaly Detection

The key to avoiding alert fatigue is hierarchical detection that only escalates when lower levels cannot self-resolve.

HIERARCHICAL ANOMALY DETECTION FLOW

       ┌────────────────────────────────────────────────┐
       │ LEVEL 0: Fleet-wide                             │
       │ EWMA on fleet_intervention_rate                 │
       │ Isolation Forest on fleet telemetry snapshot     │
       │ Check: Is the fleet as a whole anomalous?       │
       └───────────────┬────────────────────────────────┘
                       │ ALARM

       ┌────────────────────────────────────────────────┐
       │ LEVEL 1: Airport decomposition                  │
       │ Per-airport CUSUM on intervention_rate           │
       │ Question: Which airport(s) drive the anomaly?   │
       │ Output: Airport B is anomalous, A and C normal  │
       └───────────────┬────────────────────────────────┘
                       │ AIRPORT B FLAGGED

       ┌────────────────────────────────────────────────┐
       │ LEVEL 2: Vehicle decomposition                  │
       │ Per-vehicle CUSUM within Airport B              │
       │ Question: All vehicles or subset?               │
       │ Output: 12/15 vehicles affected                 │
       │ Note: 12/15 -> likely environmental, not vehicle│
       │       3/15 -> likely vehicle-specific            │
       └───────────────┬────────────────────────────────┘
                       │ 12/15 VEHICLES -> ENVIRONMENTAL

       ┌────────────────────────────────────────────────┐
       │ LEVEL 3: Subsystem decomposition                │
       │ Per-subsystem metrics for affected vehicles     │
       │ Question: Perception, localization, or planning?│
       │ Output: Perception mAP degraded on 10/12,       │
       │         Localization normal on all 12            │
       └───────────────┬────────────────────────────────┘
                       │ PERCEPTION SUBSYSTEM

       ┌────────────────────────────────────────────────┐
       │ LEVEL 4: Component decomposition                │
       │ Per-sensor health, per-class mAP, per-range mAP│
       │ Question: Which sensor or class or range band?  │
       │ Output: Personnel class mAP dropped 25%,        │
       │         other classes stable                     │
       └───────────────┬────────────────────────────────┘
                       │ PERSONNEL DETECTION REGRESSION

       ┌────────────────────────────────────────────────┐
       │ ATTRIBUTION: Cross-reference with context       │
       │ OTA v3.2.1 deployed 2 days ago: YES             │
       │ Weather change: fog onset this week              │
       │ Conclusion: OTA regression on personnel class    │
       │             exacerbated by seasonal fog          │
       └────────────────────────────────────────────────┘

Fraction-of-vehicles heuristic for rapid triage:

Fraction AffectedLikely Root Cause CategoryRationale
> 80% at one airportEnvironment, map, airport-wide changeExternal factor affecting all vehicles equally
30-80% at one airportOTA regression, partial environmentSubset of scenarios or routes affected
< 30% at one airportVehicle-specific (sensor, calibration)Individual vehicle hardware or config
> 50% fleet-wideOTA regression, fleet-wide software bugOnly software changes propagate globally
Correlated with timeWeather, time-of-day, shift changeTemporal pattern indicates external driver
Correlated with zoneMap, construction, aircraft type at standsSpatial pattern indicates localized cause

2.5 Alert Fatigue Management

Alert fatigue is the primary failure mode of monitoring systems. The solution is aggressive filtering:

Alert tiering:

TierCriteriaNotificationResponse Time
P0: SafetySTL violation, collision/near-miss, geofence breachPager + SMS + dashboard< 5 min
P1: OperationalFleet-level KPI anomaly (CUSUM alarm on L0)Slack + dashboard< 1 hour
P2: DegradationAirport-level or multi-vehicle subsystem anomalyDashboard + daily digest< 4 hours
P3: InformationalSingle-vehicle anomaly, single-sensor degradationDaily digest< 24 hours
P4: AnalyticsTrend detected, no immediate impactWeekly reportNext planning cycle

Suppression rules:

  1. Deduplication: Same root cause on same vehicle within 4 hours -> suppress subsequent alerts
  2. Known-issue suppression: Acknowledged anomalies with open tickets -> suppress until resolution
  3. Maintenance window: Alerts during scheduled maintenance -> suppress automatically
  4. Correlated suppression: If airport-level alert fires, suppress all vehicle-level alerts at that airport for 30 min (give the attribution system time to diagnose)
  5. Self-resolving: If CUSUM resets within 2 hours (anomaly self-resolves), downgrade to P4

Target alert volume:

Fleet SizeRaw Anomalies/DayAfter Hierarchical FilteringAfter SuppressionAnalyst Load
20 vehicles5-102-51-3Manageable (1 analyst)
50 vehicles15-305-102-5Manageable (1-2 analysts)
100 vehicles40-8010-204-8Requires 2-3 analysts
200 vehicles80-16015-305-12Requires 3-4 analysts

Without hierarchical filtering and suppression, a 200-vehicle fleet would require 40-80 analysts to manually triage every anomaly. With the system, 3-4 analysts handle the residual that automated attribution cannot resolve.


3. Causal Discovery from Operational Data

3.1 Why Causal, Not Just Correlational

Correlation identifies that fog and high intervention rates co-occur. Causation identifies that fog causes LiDAR range reduction, which causes perception mAP drop, which causes missed personnel detections, which causes operator interventions. The difference matters because:

  1. Actionable remediation: Correlation says "fog = bad." Causation says "deploy radar-primary mode during fog" or "retrain personnel detector with fog augmentation."
  2. Multi-cause disentanglement: When fog AND OTA change co-occur, correlation cannot separate their contributions. Causal analysis can.
  3. Regulatory evidence: EU PLD 2024/2853 requires demonstrating causal contribution to incidents, not mere correlation.

3.2 The Fleet Operations Causal DAG

The causal DAG for fleet operations is distinct from the planning-level DAG in causal-reasoning-counterfactual.md. It operates at a higher abstraction level with different timescales.

FLEET OPERATIONS CAUSAL DAG

Exogenous (uncontrollable):
  WEATHER ──────────────────────┐
  TIME_OF_DAY ──────────────────┤
  AIRCRAFT_MIX ─────────────────┤
  CONSTRUCTION ─────────────────┤
  OPERATOR_EXPERIENCE ──────────┤

Controllable decisions:         │
  OTA_VERSION ──────────────┐   │
  MAP_VERSION ──────────────┤   │
  PARAMETER_SET ────────────┤   │
  CALIBRATION ──────────────┤   │
                            │   │
                            ▼   ▼
                    ┌───────────────────┐
  WEATHER ─────────→│ SENSOR_HEALTH     │
  TIME_OF_DAY ─────→│ (degradation rate │
  CONSTRUCTION ────→│  per vehicle)     │
                    └────────┬──────────┘


  OTA_VERSION ──────→ PERCEPTION_QUALITY ←── AIRCRAFT_MIX
  SENSOR_HEALTH ───→        │                (new types = OOD)


  MAP_VERSION ──────→ LOCALIZATION_QUALITY
  CONSTRUCTION ────→        │
  WEATHER ─────────→        │ (wet surface = GPS multipath)


  PERCEPTION ──────→ PLANNING_QUALITY ←── PARAMETER_SET
  LOCALIZATION ───→         │


  PLANNING ────────→ CONTROL_QUALITY ←── CALIBRATION
  VEHICLE_HEALTH ─→         │


  CONTROL ─────────→ INTERVENTION_RATE ←── OPERATOR_EXPERIENCE
  PERCEPTION ─────→         │                (threshold varies)
  ODD_COMPLIANCE ─→         │


                    MISSION_SUCCESS

Variable definitions for the DAG:

VariableTypeSourceGranularity
WEATHERContinuous (visibility, precip, wind, temp)METAR feed, on-vehicle sensors30 min (METAR), 1 Hz (vehicle)
TIME_OF_DAYCyclic (sin/cos encoding of hour)System clock1 hour
AIRCRAFT_MIXCategorical (type distribution at stands)Airport FIDS/A-CDMPer-flight
CONSTRUCTIONBinary/categorical per zoneAirport NOTAMs, fleet detectionDaily
OPERATOR_EXPERIENCEOrdinal (hours on system)HR/training databasePer-shift
OTA_VERSIONCategorical (model + code version hash)Fleet management systemPer-deployment
MAP_VERSIONOrdinal (age in hours + version ID)Map management systemPer-update
SENSOR_HEALTHContinuous [0, 1] per sensorHealth monitor (1 Hz)Aggregated hourly
PERCEPTION_QUALITYContinuous (mAP, per-class, per-range)Shadow evaluation or production metricsPer-mission
LOCALIZATION_QUALITYContinuous (cov trace, RMSE to ground truth)GTSAM output10 Hz, aggregated per-mission
PLANNING_QUALITYContinuous (jerk, tracking error, clearance)Planning node outputPer-mission
INTERVENTION_RATECount per 100 kmOperator input logPer-intervention
MISSION_SUCCESSBinaryFleet management systemPer-mission

3.3 NOTEARS for Structure Learning

NOTEARS (Zheng et al., 2018) formulates causal structure learning as a continuous optimization problem, converting the NP-hard combinatorial DAG search into a smooth optimization with an acyclicity constraint.

python
import numpy as np
from scipy.optimize import minimize

def notears_linear(X, lambda1=0.1, max_iter=100, h_tol=1e-8):
    """
    NOTEARS: continuous optimization for DAG structure learning.
    
    X: (n_samples, d_variables) -- fleet telemetry data
    lambda1: L1 penalty (sparsity)
    
    Returns: W (d x d) weighted adjacency matrix
             W[i,j] != 0 means variable i directly causes variable j
    
    For fleet operations with ~15-20 variables:
        d=20 -> 400 parameters -> tractable (seconds on CPU)
        Needs n >> d samples: 30 days x 24 hours x 50 vehicles = 36,000 samples
        More than sufficient for d=20
    """
    n, d = X.shape
    
    def _loss(W):
        """Least-squares loss for linear SEM: X = X @ W + noise."""
        M = X @ W
        R = X - M
        return 0.5 / n * (R ** 2).sum()
    
    def _h(W):
        """Acyclicity constraint: tr(e^{W * W}) - d = 0 iff W is DAG."""
        E = np.linalg.matrix_power(np.eye(d) + W * W / d, d)
        return np.trace(E) - d
    
    def _objective(w_flat, rho, alpha):
        W = w_flat.reshape(d, d)
        loss = _loss(W)
        h = _h(W)
        penalty = lambda1 * np.abs(W).sum()
        return loss + 0.5 * rho * h * h + alpha * h + penalty
    
    # Augmented Lagrangian optimization
    W = np.zeros((d, d))
    rho, alpha = 1.0, 0.0
    
    for i in range(max_iter):
        result = minimize(
            _objective, W.flatten(), args=(rho, alpha),
            method='L-BFGS-B',
            jac=True  # in practice, use autograd for gradient
        )
        W = result.x.reshape(d, d)
        h = _h(W)
        
        if abs(h) < h_tol:
            break
        
        alpha += rho * h
        rho *= 10
    
    # Threshold small edges
    W[np.abs(W) < 0.1] = 0.0
    return W

Practical considerations for fleet data:

  1. Non-stationarity: Fleet data is non-stationary (seasonal, OTA changes, fleet growth). Fit NOTEARS on rolling 90-day windows, compare learned DAGs across windows for structural stability.
  2. Mixed data types: Fleet variables mix continuous (mAP), ordinal (map age), and categorical (OTA version). Use NOTEARS-NONLINEAR (Zheng et al., 2020) with MLP basis functions, or discretize continuous variables.
  3. Latent confounders: Some causes are unobserved (e.g., runway resurfacing affects tire grip and sensor vibration simultaneously, but "runway condition" is not directly measured). FCI algorithm (Spirtes et al., 2000) handles latent confounders by identifying ancestral relationships rather than direct causes.

3.4 PC Algorithm for Constraint-Based Discovery

The PC algorithm (Spirtes et al., 2000) discovers causal structure via conditional independence tests. It complements NOTEARS by making weaker parametric assumptions.

python
from causallearn.search.ConstraintBased.PC import pc
from causallearn.utils.cit import fisherz

def discover_fleet_dag(telemetry_df, alpha=0.05):
    """
    PC algorithm for fleet causal structure discovery.
    
    Uses Fisher-z test for conditional independence (assumes Gaussian;
    for non-Gaussian, use KCI or RCIT kernel-based tests).
    
    alpha: significance level for conditional independence tests
           0.01 -> conservative (fewer edges, risk missing real causes)
           0.05 -> moderate (standard choice)
           0.10 -> aggressive (more edges, risk spurious causes)
    """
    data = telemetry_df[FLEET_VARIABLES].values
    
    cg = pc(
        data,
        alpha=alpha,
        indep_test=fisherz,
        stable=True,     # order-independent (recommended for reproducibility)
        uc_rule=0,       # conservative orientation
        uc_priority=-1,  # background knowledge can override
    )
    
    # Apply background knowledge (known causal directions)
    # Weather cannot be caused by vehicle state
    # OTA version cannot be caused by perception quality
    # These constraints reduce false discoveries
    background_knowledge = {
        ('WEATHER', '*'): 'forbidden',      # nothing causes weather
        ('TIME_OF_DAY', '*'): 'forbidden',  # nothing causes time
        ('*', 'OTA_VERSION'): 'forbidden',  # OTA is a decision, not caused by operations
    }
    
    return cg.G, cg.sepset  # graph + separation sets for causal reasoning

3.5 Granger Causality for Temporal Relationships

Some causal relationships are purely temporal: an OTA deployment today causes intervention rate increase tomorrow. Granger causality captures these time-lagged relationships.

python
from statsmodels.tsa.stattools import grangercausalitytests

def fleet_granger_analysis(telemetry_df, max_lag=48, alpha=0.05):
    """
    Pairwise Granger causality tests across fleet variables.
    
    max_lag=48: test up to 48 hours of lag (covers OTA effects that
                may take 1-2 days to manifest as interventions).
    
    Returns: dictionary of significant Granger-causal relationships
             with optimal lag and p-value.
    """
    variables = [
        'intervention_rate', 'perception_mAP', 'localization_cov',
        'sensor_health_min', 'weather_visibility', 'ota_version_age',
        'map_age_hours', 'compute_latency_p99',
    ]
    
    results = {}
    for cause_var in variables:
        for effect_var in variables:
            if cause_var == effect_var:
                continue
            
            data = telemetry_df[[effect_var, cause_var]].dropna()
            
            try:
                gc_result = grangercausalitytests(
                    data.values, maxlag=max_lag, verbose=False
                )
                
                # Find optimal lag (lowest p-value)
                best_lag = min(
                    gc_result.keys(),
                    key=lambda lag: gc_result[lag][0]['ssr_ftest'][1]
                )
                p_value = gc_result[best_lag][0]['ssr_ftest'][1]
                
                if p_value < alpha:
                    results[(cause_var, effect_var)] = {
                        'lag_hours': best_lag,
                        'p_value': p_value,
                        'f_stat': gc_result[best_lag][0]['ssr_ftest'][0],
                    }
            except Exception:
                continue  # insufficient data or singular matrix
    
    return results

Expected Granger-causal relationships for airside fleet:

CauseEffectExpected LagMechanism
ota_version_ageintervention_rate6-48 hoursOTA deployment -> operational exposure -> incidents
weather_visibilityperception_mAP0-1 hoursNear-immediate (fog onset -> range reduction)
weather_visibilitysensor_health_min2-24 hoursDelayed (moisture -> lens contamination)
map_age_hourslocalization_cov0-4 hoursProgressive (map drift -> localization uncertainty)
perception_mAPintervention_rate0-2 hoursNear-immediate (missed detection -> intervention)
sensor_health_minperception_mAP0-1 hoursNear-immediate (degraded sensor -> degraded perception)
compute_latency_p99planning_quality0 hoursImmediate (latency -> stale plan)

3.6 Combining Discovery Methods

No single method is sufficient. The practical approach combines all three:

CAUSAL DISCOVERY PIPELINE

  1. Expert prior DAG
     - Encode known physics (weather -> sensor degradation)
     - Encode known architecture (perception -> planning -> control)
     - Encode known impossibilities (vehicle state cannot cause weather)
     Result: skeleton with ~60-70% of edges pre-specified

  2. NOTEARS on 90-day fleet data
     - Discover data-driven edges not in expert prior
     - Validate expert edges (are they supported by data?)
     - Weight edges by coefficient magnitude
     Result: data-augmented DAG

  3. PC algorithm as validation
     - Independent method with different assumptions
     - Edges present in BOTH NOTEARS and PC have high confidence
     - Edges in only one method flagged for human review
     Result: confidence-annotated DAG

  4. Granger causality for temporal structure
     - Add lag information to each edge
     - Discover temporal-only relationships missed by instantaneous methods
     Result: temporally-annotated causal DAG

  5. Human expert review
     - Review edges present in data but not in prior (potential discovery)
     - Review edges in prior but not in data (potential data limitation)
     - Final DAG approved for production use
     Result: validated fleet operations causal DAG
     
  Update cadence: Re-run quarterly or after major fleet changes (new airport, 
                  new vehicle type, major OTA)

4. Shapley-Value Attribution

4.1 The Attribution Question

Given that the fleet is anomalous, how much of the anomaly is attributable to each factor? This is precisely the Shapley value question from cooperative game theory.

The game: Players are the potential root causes (weather, OTA version, map age, sensor health, etc.). The "payoff" is the anomaly score. The Shapley value of each player is its fair marginal contribution to the anomaly, averaged over all possible coalitions.

Formal definition:

phi_i = (1/|N|!) * sum over all permutations pi of N:
    [v(S_pi^i union {i}) - v(S_pi^i)]

where:
  N = set of all potential causes
  S_pi^i = set of causes that appear before i in permutation pi
  v(S) = anomaly score when only causes in S are at anomalous levels
         and all others are at baseline levels

4.2 SHAP for Anomaly Model Attribution

SHAP (SHapley Additive exPlanations) computes Shapley values efficiently for ML model predictions. Applied to the Isolation Forest anomaly model, it answers "which features contribute most to this observation being anomalous?"

python
import shap

class AnomalyAttributor:
    """
    SHAP-based attribution for fleet anomaly scores.
    
    Given an anomalous fleet state, decomposes the anomaly score
    into per-feature contributions using Shapley values.
    """
    
    def __init__(self, anomaly_model, background_data):
        """
        anomaly_model: trained IsolationForest or autoencoder
        background_data: representative normal fleet data (100-1000 samples)
        """
        self.model = anomaly_model
        
        # TreeExplainer for IsolationForest (exact Shapley, fast)
        # KernelSHAP for autoencoder (approximate, slower)
        if hasattr(anomaly_model, 'estimators_'):
            self.explainer = shap.TreeExplainer(
                anomaly_model, 
                data=background_data,
                feature_perturbation='interventional',  # causal, not observational
            )
        else:
            self.explainer = shap.KernelExplainer(
                anomaly_model.predict, 
                background_data,
                nsamples=500,  # approximation budget
            )
    
    def attribute(self, anomalous_observation):
        """
        Returns per-feature Shapley values for the anomaly.
        
        Positive value: feature contributes to anomaly (anomalous direction)
        Negative value: feature is normal (counteracts anomaly)
        Sum of all values = anomaly_score - expected_anomaly_score
        """
        shap_values = self.explainer.shap_values(anomalous_observation)
        
        # Rank features by absolute contribution
        feature_names = FleetAnomalyDetector.FEATURES
        contributions = sorted(
            zip(feature_names, shap_values[0]),
            key=lambda x: abs(x[1]),
            reverse=True,
        )
        
        return contributions
    
    def attribute_fleet(self, fleet_snapshot):
        """
        Attribute fleet-level anomaly to airports and vehicles.
        
        Uses distributed Shapley: treats each vehicle as a "player"
        and computes its marginal contribution to the fleet anomaly.
        """
        n_vehicles = len(fleet_snapshot)
        vehicle_shapley = np.zeros(n_vehicles)
        
        # For tractable fleet sizes (< 200), use permutation sampling
        n_permutations = min(1000, math.factorial(n_vehicles))
        
        for _ in range(n_permutations):
            perm = np.random.permutation(n_vehicles)
            prev_score = self._fleet_anomaly_score(fleet_snapshot[:0])  # empty
            
            for i, vehicle_idx in enumerate(perm):
                coalition = fleet_snapshot[perm[:i+1]]
                curr_score = self._fleet_anomaly_score(coalition)
                vehicle_shapley[vehicle_idx] += (curr_score - prev_score)
                prev_score = curr_score
        
        vehicle_shapley /= n_permutations
        return vehicle_shapley

4.3 Hierarchical Shapley Decomposition

For fleet attribution, a flat Shapley over all features x vehicles is computationally intractable. Hierarchical decomposition mirrors the KPI hierarchy:

HIERARCHICAL SHAPLEY DECOMPOSITION

Step 1: Fleet -> Airport attribution
  Players: {Airport_A, Airport_B, Airport_C}
  Game: v(S) = fleet anomaly score with only airports in S contributing
  Result: "Airport B contributes 72% of fleet anomaly"
  Cost: 2^3 = 8 evaluations (exact Shapley feasible)

Step 2: Airport B -> Vehicle attribution
  Players: {Vehicle_1, ..., Vehicle_15} at Airport B
  Game: v(S) = Airport B anomaly score with only vehicles in S
  Result: "Vehicles 3, 7, 11 contribute 60% of Airport B anomaly"
  Cost: ~1000 permutation samples (approximate Shapley)

Step 3: Per-vehicle -> Feature attribution
  Players: {perception_mAP, localization_cov, sensor_health, ...}
  Game: v(S) = vehicle anomaly score with only features in S at anomalous level
  Result: "perception_mAP accounts for 45% of Vehicle 7's anomaly"
  Cost: TreeSHAP = exact in O(TLD) per observation (milliseconds)

Total: Fleet anomaly decomposed as:
  "72% Airport B, of which 60% from 3 vehicles,
   of which perception mAP drop is the primary driver,
   specifically personnel class detection in fog conditions."

4.4 Computational Cost and Approximations

MethodExact CostApproximationAccuracyWall-Clock Time
TreeSHAP (per observation)O(TLD)Exact100%< 10 ms
KernelSHAP (per observation)O(2^M)M=500 samples~95%1-5 seconds
Permutation SHAP (fleet)O(N! * eval_cost)1000 permutations~90%5-30 seconds
Hierarchical SHAP (full)O(levels * per_level)Per above~85-90%10-60 seconds

For real-time fleet dashboards, pre-compute SHAP values on the most recent anomalous snapshot and cache. Update every 15-60 minutes.

4.5 Shapley for Cross-Airport Comparison

A powerful application: when multiple airports operate, Shapley values answer "does Airport B's anomaly look like a known pattern?"

python
def compute_airport_shapley_profile(airport_data, reference_profiles):
    """
    Compare current airport anomaly against known root-cause profiles.
    
    reference_profiles: dictionary of known anomaly signatures
      e.g., {
        'ota_regression': [high_perception, normal_localization, normal_sensor, ...],
        'fog_event': [moderate_perception, normal_localization, degraded_sensor, ...],
        'map_stale': [normal_perception, high_localization, normal_sensor, ...],
      }
    
    Returns: similarity to each known profile (cosine similarity of SHAP vectors)
    """
    current_shap = attributor.attribute(airport_data)
    current_vec = np.array([v for _, v in current_shap])
    
    similarities = {}
    for cause_name, ref_vec in reference_profiles.items():
        cos_sim = np.dot(current_vec, ref_vec) / (
            np.linalg.norm(current_vec) * np.linalg.norm(ref_vec) + 1e-8
        )
        similarities[cause_name] = cos_sim
    
    return similarities  # highest similarity = most likely root cause category

5. Bayesian Diagnosis Trees

5.1 Sequential Diagnosis as Bayesian Inference

When the anomaly detection and Shapley attribution narrow the problem to a small set of hypotheses, Bayesian diagnosis systematically eliminates alternatives by querying the cheapest diagnostic tests first.

The diagnosis as a decision tree:

                        Fleet anomaly detected

                    ┌──────────┴──────────┐
                    │ Check: How many      │
                    │ vehicles affected?   │
                    └──────────┬──────────┘
                     ┌─────────┴─────────┐
                     │                    │
                  > 50%                < 50%
                (fleet-wide)        (subset)
                     │                    │
              ┌──────┴──────┐      ┌──────┴──────┐
              │ Check: OTA   │      │ Check: Same  │
              │ deployed in  │      │ route/zone?  │
              │ last 72h?    │      └──────┬──────┘
              └──────┬──────┘        ┌─────┴─────┐
               ┌─────┴─────┐         │            │
              YES          NO      YES           NO
               │            │        │            │
        ┌──────┴──┐   ┌────┴────┐   │     ┌──────┴──────┐
        │Check:   │   │Check:   │   │     │Check: Same  │
        │Per-class│   │METAR    │   │     │vehicle(s)?  │
        │mAP drop?│   │change?  │   │     └──────┬──────┘
        └────┬────┘   └────┬────┘   │      ┌─────┴─────┐
          ┌──┴──┐       ┌──┴──┐     │    YES           NO
         YES   NO     YES   NO     │      │         (random)
          │     │      │     │      │      │
       OTA   Software  Weather  Map    Sensor    Investigate
    regression  bug    change  stale  degradation  further

5.2 Prior Probabilities from Fleet History

Prior probabilities are calibrated from historical fleet data. These update as the fleet matures and accumulates incidents.

python
class BayesianDiagnosisEngine:
    """
    Bayesian root-cause diagnosis with sequential evidence gathering.
    
    Maintains posterior probabilities over 8 root cause hypotheses,
    updates with evidence from diagnostic tests.
    """
    
    # Prior probabilities (calibrated from industry data + fleet history)
    # These should be updated quarterly from actual fleet incident data
    ROOT_CAUSES = {
        'sensor_degradation':  0.25,   # most common in airside operations
        'weather_environment':  0.20,   # second most common
        'ota_model_regression': 0.15,   # happens with every OTA
        'map_staleness':        0.12,   # increases with construction frequency
        'software_bug':         0.10,   # decreases as codebase matures
        'operator_behavior':    0.08,   # new operators, policy changes
        'new_aircraft_type':    0.05,   # seasonal schedule changes
        'construction_layout':  0.05,   # airport development
    }
    
    # Likelihood functions: P(evidence | root_cause)
    # Each test produces a binary or categorical result
    # Likelihoods estimated from historical diagnosis data
    
    LIKELIHOODS = {
        'fraction_vehicles_affected': {
            # P(>50% vehicles affected | cause)
            'high': {
                'sensor_degradation':  0.15,   # usually individual vehicles
                'weather_environment':  0.85,   # affects all vehicles
                'ota_model_regression': 0.75,   # affects all vehicles with new OTA
                'map_staleness':        0.60,   # affects vehicles on stale-map routes
                'software_bug':         0.70,   # affects all vehicles with new code
                'operator_behavior':    0.10,   # individual operator variation
                'new_aircraft_type':    0.30,   # affects vehicles at specific stands
                'construction_layout':  0.20,   # affects vehicles in specific zones
            },
        },
        'ota_deployed_recently': {
            # P(OTA deployed in last 72h | cause)
            'yes': {
                'sensor_degradation':  0.10,   # coincidental
                'weather_environment':  0.10,   # coincidental
                'ota_model_regression': 0.95,   # direct cause
                'map_staleness':        0.10,   # coincidental
                'software_bug':         0.80,   # code change introduces bug
                'operator_behavior':    0.05,   # coincidental
                'new_aircraft_type':    0.05,   # coincidental
                'construction_layout':  0.05,   # coincidental
            },
        },
        'per_class_mAP_drop': {
            # P(specific class mAP degraded | cause)
            'yes': {
                'sensor_degradation':  0.40,   # affects all classes (range-dependent)
                'weather_environment':  0.50,   # affects range-dependent classes
                'ota_model_regression': 0.85,   # often class-specific regression
                'map_staleness':        0.05,   # map does not affect detection
                'software_bug':         0.30,   # can affect specific detection paths
                'operator_behavior':    0.02,   # does not affect perception
                'new_aircraft_type':    0.70,   # new type has different features
                'construction_layout':  0.10,   # may cause OOD detections
            },
        },
        'metar_changed': {
            # P(significant METAR change in last 24h | cause)
            'yes': {
                'sensor_degradation':  0.20,   # weather accelerates degradation
                'weather_environment':  0.90,   # direct indicator
                'ota_model_regression': 0.10,   # coincidental
                'map_staleness':        0.05,   # coincidental
                'software_bug':         0.05,   # coincidental
                'operator_behavior':    0.10,   # weather may change behavior
                'new_aircraft_type':    0.05,   # coincidental
                'construction_layout':  0.05,   # coincidental
            },
        },
        'localization_degraded': {
            # P(localization quality degraded | cause)
            'yes': {
                'sensor_degradation':  0.40,   # LiDAR degradation affects GTSAM
                'weather_environment':  0.30,   # GPS multipath in rain, fog
                'ota_model_regression': 0.05,   # OTA rarely changes localization
                'map_staleness':        0.85,   # primary indicator of map issues
                'software_bug':         0.20,   # can affect GTSAM pipeline
                'operator_behavior':    0.02,   # does not affect localization
                'new_aircraft_type':    0.05,   # does not affect localization
                'construction_layout':  0.75,   # layout change = map mismatch
            },
        },
        'zone_correlated': {
            # P(anomaly concentrated in specific zone | cause)
            'yes': {
                'sensor_degradation':  0.10,   # not zone-dependent
                'weather_environment':  0.15,   # sometimes micro-climate zones
                'ota_model_regression': 0.15,   # usually global
                'map_staleness':        0.70,   # construction is localized
                'software_bug':         0.10,   # usually global
                'operator_behavior':    0.20,   # operators assigned to zones
                'new_aircraft_type':    0.80,   # new aircraft at specific stands
                'construction_layout':  0.90,   # construction is localized
            },
        },
    }
    
    def __init__(self):
        self.posteriors = dict(self.ROOT_CAUSES)  # start with priors
        self.evidence_collected = []
    
    def update(self, test_name, result):
        """
        Bayesian update: P(cause | evidence) proportional to
        P(evidence | cause) * P(cause)
        """
        likelihoods = self.LIKELIHOODS[test_name][result]
        
        # Bayes rule
        new_posteriors = {}
        for cause, prior in self.posteriors.items():
            likelihood = likelihoods.get(cause, 0.5)  # default uniform
            new_posteriors[cause] = likelihood * prior
        
        # Normalize
        total = sum(new_posteriors.values())
        for cause in new_posteriors:
            new_posteriors[cause] /= total
        
        self.posteriors = new_posteriors
        self.evidence_collected.append((test_name, result))
        
        return self.get_diagnosis()
    
    def get_diagnosis(self):
        """Return ranked hypotheses with confidence."""
        ranked = sorted(
            self.posteriors.items(), 
            key=lambda x: x[1], 
            reverse=True
        )
        
        top_cause, top_prob = ranked[0]
        second_cause, second_prob = ranked[1]
        
        confidence = 'high' if top_prob > 0.6 else (
            'medium' if top_prob > 0.35 else 'low'
        )
        
        return {
            'top_hypothesis': top_cause,
            'probability': top_prob,
            'confidence': confidence,
            'ratio_to_second': top_prob / max(second_prob, 1e-6),
            'all_hypotheses': ranked,
            'evidence_count': len(self.evidence_collected),
        }
    
    def next_best_test(self):
        """
        Select the diagnostic test with highest expected information gain.
        
        Uses entropy reduction: pick the test that minimizes expected
        posterior entropy (equivalently, maximizes mutual information
        between test result and root cause).
        """
        remaining_tests = [
            t for t in self.LIKELIHOODS.keys()
            if t not in [e[0] for e in self.evidence_collected]
        ]
        
        best_test = None
        best_info_gain = -1
        
        current_entropy = self._entropy(self.posteriors)
        
        for test in remaining_tests:
            expected_posterior_entropy = 0
            
            for result in self.LIKELIHOODS[test].keys():
                # P(result) = sum over causes of P(result|cause) * P(cause)
                p_result = sum(
                    self.LIKELIHOODS[test][result].get(c, 0.5) * p
                    for c, p in self.posteriors.items()
                )
                
                # Posterior after observing this result
                temp_posteriors = {}
                for cause, prior in self.posteriors.items():
                    l = self.LIKELIHOODS[test][result].get(cause, 0.5)
                    temp_posteriors[cause] = l * prior
                total = sum(temp_posteriors.values())
                for c in temp_posteriors:
                    temp_posteriors[c] /= max(total, 1e-10)
                
                posterior_entropy = self._entropy(temp_posteriors)
                expected_posterior_entropy += p_result * posterior_entropy
            
            info_gain = current_entropy - expected_posterior_entropy
            if info_gain > best_info_gain:
                best_info_gain = info_gain
                best_test = test
        
        return best_test, best_info_gain
    
    @staticmethod
    def _entropy(distribution):
        return -sum(
            p * math.log2(p + 1e-10) for p in distribution.values()
        )

5.3 Worked Example: The Tuesday Morning Problem

BAYESIAN DIAGNOSIS TRACE

Initial priors:
  sensor_degradation:  0.25
  weather_environment:  0.20
  ota_model_regression: 0.15
  map_staleness:        0.12
  software_bug:         0.10
  operator_behavior:    0.08
  new_aircraft_type:    0.05
  construction_layout:  0.05

Step 1: Check fraction of vehicles affected -> HIGH (12/15 = 80%)
  Updated posteriors:
  weather_environment:  0.37  (+0.17)  <-- now leads
  ota_model_regression: 0.24  (+0.09)
  software_bug:         0.15  (+0.05)
  map_staleness:        0.12  (stable)
  sensor_degradation:   0.06  (-0.19)  <-- eliminated (individual vehicle issue)
  ...
  Confidence: low (0.37)
  Next best test: ota_deployed_recently (info gain: 0.41 bits)

Step 2: Check OTA deployed recently -> YES (v3.2.1, 2 days ago)
  Updated posteriors:
  ota_model_regression: 0.51  (+0.27)  <-- now leads strongly
  software_bug:         0.27  (+0.12)
  weather_environment:  0.08  (-0.29)  <-- weather unlikely given OTA timing
  ...
  Confidence: medium (0.51)
  Next best test: per_class_mAP_drop (info gain: 0.35 bits)

Step 3: Check per-class mAP -> YES (personnel class down 25%)
  Updated posteriors:
  ota_model_regression: 0.68  (+0.17)  <-- high confidence
  software_bug:         0.13  (-0.14)
  new_aircraft_type:    0.06  (stable)
  ...
  Confidence: high (0.68)
  Ratio to second: 5.2x

Step 4: Check METAR change -> YES (fog onset, visibility 800m)
  Updated posteriors:
  ota_model_regression: 0.59  (-0.09)
  weather_environment:  0.22  (+0.14)  <-- weather is contributing factor
  ...
  Note: Both OTA and weather are contributing. Multi-cause scenario.

DIAGNOSIS: OTA model regression on personnel class (primary, P=0.59)
           exacerbated by seasonal fog onset (secondary, P=0.22)

RECOMMENDED ACTIONS:
  1. Immediate: Rollback OTA v3.2.1 to v3.2.0 on Airport B fleet
  2. Short-term: Retrain personnel detection with fog augmentation
  3. Medium-term: Add fog-specific validation gate to CI/CD pipeline

Time to diagnosis: 4 automated tests, < 2 minutes
Manual triage equivalent: 8-16 hours

5.4 Multi-Hypothesis Tracking

When posterior remains split between two or more hypotheses after all automated tests, maintain parallel investigation tracks:

python
class MultiHypothesisTracker:
    """
    When Bayesian diagnosis cannot reach high confidence,
    maintain and investigate multiple hypotheses in parallel.
    
    Trigger: top hypothesis probability < 0.50 after all automated tests
    """
    
    def __init__(self, diagnosis_result, max_hypotheses=3):
        self.hypotheses = [
            (cause, prob) 
            for cause, prob in diagnosis_result['all_hypotheses'][:max_hypotheses]
        ]
        self.investigations = {}
    
    def generate_investigation_plan(self):
        """
        For each hypothesis, generate the specific diagnostic steps
        that would confirm or refute it.
        """
        plans = {
            'sensor_degradation': [
                'Run per-sensor health diagnostic on all affected vehicles',
                'Compare point cloud density across affected vs unaffected vehicles',
                'Check cleaning/maintenance logs for affected vehicles',
            ],
            'ota_model_regression': [
                'Run shadow evaluation of v_old vs v_new on captured rosbags',
                'Compare per-class mAP between canary and control vehicles',
                'Check model training logs for data distribution changes',
            ],
            'weather_environment': [
                'Correlate intervention timestamps with METAR visibility',
                'Check if interventions cluster in specific time-of-day',
                'Compare same vehicles on clear-weather days (control)',
            ],
            'map_staleness': [
                'Compare current SLAM output to reference map for discrepancies',
                'Check AMDB/NOTAM for recent changes at affected zones',
                'Measure localization RMSE in affected vs unaffected zones',
            ],
            'new_aircraft_type': [
                'Cross-reference intervention locations with aircraft stand assignments',
                'Check FIDS for new airline/aircraft type at affected stands',
                'Review false-negative detections for unknown aircraft geometry',
            ],
            'construction_layout': [
                'Query NOTAM feed for active construction at airport',
                'Compare fleet SLAM maps from this week vs last week',
                'Check if interventions cluster in specific geographic zones',
            ],
            'operator_behavior': [
                'Compare intervention rates by operator ID',
                'Check for new operators on shift during anomaly window',
                'Review intervention justification notes for patterns',
            ],
            'software_bug': [
                'Check error logs for new exception types or increased error rates',
                'Review code changes deployed in same OTA as model update',
                'Run affected scenarios in SIL with verbose logging',
            ],
        }
        
        return {
            cause: plans.get(cause, ['Manual investigation required'])
            for cause, prob in self.hypotheses
        }

6. OTA Regression Attribution

6.1 The OTA Regression Problem

Every OTA model update risks regression on specific classes, scenarios, or conditions that were not adequately represented in the validation set. The CI/CD pipeline (see ../../40-runtime-systems/ml-deployment/av-cicd-devops-pipeline.md) catches many regressions pre-deployment, but production-time regressions occur because:

  1. Validation set coverage gap: Offline validation cannot cover all airport x weather x aircraft x time-of-day combinations
  2. Distribution shift: The training distribution may not match the current operational distribution (e.g., new seasonal fog pattern)
  3. Interaction effects: Model performs well on fog OR on new aircraft, but fails on fog AND new aircraft together
  4. Calibration shift: Model accuracy unchanged but confidence calibration drifts, causing different downstream decisions

6.2 A/B Testing Between OTA Versions

The gold standard for OTA regression detection is a controlled A/B test within the fleet.

OTA A/B TEST ARCHITECTURE

Airport B fleet (15 vehicles):

  Control group (5 vehicles):     Treatment group (10 vehicles):
  OTA v3.2.0 (previous)           OTA v3.2.1 (new)
  Vehicle IDs: 1, 4, 7, 11, 14   Vehicle IDs: 2, 3, 5, 6, 8, 9, 10, 12, 13, 15
  
  Assignment: stratified random
  - Same route distribution between groups
  - Same shift/operator distribution
  - Same vehicle model mix
  
  Duration: 7-14 days (minimum for statistical power)
  
  Primary metric: intervention_rate (events per 100 km)
  Secondary metrics: per_class_mAP, localization_cov, mission_success_rate
  
  Analysis: two-sample t-test (parametric) or Mann-Whitney U (non-parametric)
  Power: 80% to detect 50% intervention rate increase
  Required sample size: ~200 missions per group (achievable in 3-5 days)

Automated regression detection from A/B results:

python
from scipy import stats

class OTARegressionDetector:
    """
    Automated detection of OTA model regression from fleet A/B test.
    """
    
    def __init__(self, significance_level=0.05, effect_size_threshold=0.20):
        self.alpha = significance_level
        self.min_effect = effect_size_threshold  # 20% relative change
    
    def detect_regression(self, control_metrics, treatment_metrics):
        """
        Compare control (old OTA) vs treatment (new OTA) metrics.
        
        Returns: regression detected (bool), details (dict)
        """
        results = {}
        
        for metric_name in control_metrics.columns:
            control = control_metrics[metric_name].dropna()
            treatment = treatment_metrics[metric_name].dropna()
            
            # Mann-Whitney U test (non-parametric, handles non-normal distributions)
            stat, p_value = stats.mannwhitneyu(
                control, treatment, alternative='two-sided'
            )
            
            # Effect size (rank-biserial correlation)
            n1, n2 = len(control), len(treatment)
            effect_size = 1 - (2 * stat) / (n1 * n2)
            
            # Relative change
            relative_change = (treatment.mean() - control.mean()) / (
                control.mean() + 1e-10
            )
            
            is_regression = (
                p_value < self.alpha and
                abs(relative_change) > self.min_effect and
                self._is_degradation(metric_name, relative_change)
            )
            
            results[metric_name] = {
                'p_value': p_value,
                'effect_size': effect_size,
                'relative_change': relative_change,
                'control_mean': control.mean(),
                'treatment_mean': treatment.mean(),
                'is_regression': is_regression,
            }
        
        any_regression = any(r['is_regression'] for r in results.values())
        return any_regression, results
    
    @staticmethod
    def _is_degradation(metric_name, relative_change):
        """Determine if change direction is a regression."""
        # Higher is worse for these metrics
        worse_if_higher = [
            'intervention_rate', 'localization_cov', 'planning_jerk',
            'control_tracking_error', 'compute_latency',
        ]
        # Higher is better for these metrics
        worse_if_lower = [
            'perception_mAP', 'mission_success_rate', 'sensor_health',
        ]
        
        if metric_name in worse_if_higher:
            return relative_change > 0
        elif metric_name in worse_if_lower:
            return relative_change < 0
        return False

6.3 Per-Class and Per-Scenario Regression Detection

Overall mAP may improve while a specific class regresses -- the "accuracy paradox" for safety-critical systems.

python
class PerClassRegressionDetector:
    """
    Detect per-class regression that may be masked by overall improvement.
    
    Example: OTA v3.2.1 improves aircraft detection +3% but regresses
    personnel detection -8%. Overall mAP: +1%. But personnel regression
    is safety-critical and must be caught.
    """
    
    SAFETY_CRITICAL_CLASSES = {
        'personnel': {'min_mAP': 0.60, 'max_regression': 0.03},
        'aircraft':  {'min_mAP': 0.70, 'max_regression': 0.05},
        'fod':       {'min_mAP': 0.50, 'max_regression': 0.02},
    }
    
    def check_per_class(self, old_model_eval, new_model_eval):
        """
        Compare per-class metrics between old and new model.
        
        old_model_eval / new_model_eval: dict[class_name -> mAP]
        """
        regressions = []
        
        for cls, thresholds in self.SAFETY_CRITICAL_CLASSES.items():
            old_mAP = old_model_eval.get(cls, 0)
            new_mAP = new_model_eval.get(cls, 0)
            
            delta = new_mAP - old_mAP
            
            if delta < -thresholds['max_regression']:
                regressions.append({
                    'class': cls,
                    'old_mAP': old_mAP,
                    'new_mAP': new_mAP,
                    'regression': abs(delta),
                    'severity': 'critical' if cls in ('personnel', 'fod') else 'high',
                    'below_minimum': new_mAP < thresholds['min_mAP'],
                })
            
            if new_mAP < thresholds['min_mAP']:
                regressions.append({
                    'class': cls,
                    'new_mAP': new_mAP,
                    'minimum': thresholds['min_mAP'],
                    'severity': 'critical',
                    'reason': 'below_absolute_minimum',
                })
        
        return regressions

6.4 Counterfactual OTA Analysis

For each intervention that occurred after an OTA update, ask: "Would this intervention have occurred on the previous model version?"

python
class CounterfactualOTAAnalysis:
    """
    Counterfactual analysis: replay intervention events through
    the previous model version to determine if OTA caused them.
    
    Requires: rosbag capture from intervention events (standard trigger
    in data-flywheel-airside.md) + previous model version available
    in artifact registry.
    """
    
    def analyze_intervention(self, rosbag_path, old_model, new_model):
        """
        Replay intervention rosbag through both model versions.
        
        Returns: counterfactual result
          - 'ota_caused': intervention would NOT have occurred with old model
          - 'ota_independent': intervention WOULD have occurred with old model
          - 'ota_exacerbated': intervention occurred with both, but new model
                               responded later/worse
        """
        # Extract sensor data from rosbag
        sensor_data = extract_sensor_frames(rosbag_path)
        
        # Run old model on same sensor data
        old_detections = old_model.run_batch(sensor_data)
        old_planning = planning_sim(old_detections)
        old_would_intervene = check_intervention_needed(old_planning)
        
        # Run new model on same sensor data
        new_detections = new_model.run_batch(sensor_data)
        new_planning = planning_sim(new_detections)
        new_would_intervene = check_intervention_needed(new_planning)
        
        if new_would_intervene and not old_would_intervene:
            return 'ota_caused'
        elif new_would_intervene and old_would_intervene:
            # Check if new model was worse (later detection, smaller margin)
            old_margin = compute_safety_margin(old_planning)
            new_margin = compute_safety_margin(new_planning)
            if new_margin < old_margin * 0.8:
                return 'ota_exacerbated'
            return 'ota_independent'
        else:
            return 'ota_independent'

6.5 Automatic Rollback Triggers

OTA ROLLBACK DECISION MATRIX

                    Per-class       Safety         Overall
                    regression?     critical?      mAP impact?     Action
                    ──────────────────────────────────────────────────────────
                    YES             YES (person/   Any              ROLLBACK
                                    FOD)                            immediately
                    
                    YES             NO (GSE,       > -5%            ROLLBACK
                                    other)                          within 24h
                    
                    YES             NO             < -5%            MONITOR
                                                                    for 72h
                    
                    NO              N/A            > -3%            ROLLBACK
                                                                    within 24h
                    
                    NO              N/A            < -3%            MONITOR
                                                                    for 7 days

Rollback procedure:
  1. Stop canary expansion (halt OTA to remaining vehicles)
  2. Revert treatment group to control version (OTA v_old)
  3. Capture all rosbags from anomalous period for analysis
  4. Create Jira ticket with counterfactual analysis results
  5. Block v_new from re-deployment until regression fixed

7. Map Staleness Attribution

7.1 Map Staleness Detection

Map staleness manifests as increased localization error, path deviations, and unexpected obstacles. The fleet itself is the best sensor for map changes (see hd-map-change-detection-maintenance.md for the maintenance pipeline). Here we focus on attributing anomalies to map staleness.

python
class MapStalenessAttributor:
    """
    Attributes fleet anomalies to map staleness by correlating
    localization quality with map age and spatial patterns.
    """
    
    def compute_staleness_score(self, fleet_telemetry, map_metadata):
        """
        Score the likelihood that map staleness is causing the anomaly.
        
        Signals:
          1. Localization covariance increase correlated with map age
          2. Spatial clustering of localization errors
          3. GTSAM iteration count increase (harder to converge)
          4. AMDB/NOTAM changes since last map update
          5. Fleet SLAM discrepancies vs reference map
        """
        signals = {}
        
        # Signal 1: Map age vs localization quality correlation
        map_ages = fleet_telemetry['map_age_hours']
        loc_quality = fleet_telemetry['localization_cov_trace']
        correlation = np.corrcoef(map_ages, loc_quality)[0, 1]
        signals['age_loc_correlation'] = correlation
        
        # Signal 2: Spatial clustering
        error_locations = fleet_telemetry[
            fleet_telemetry['localization_cov_trace'] > 
            fleet_telemetry['localization_cov_trace'].quantile(0.9)
        ][['x', 'y']]
        
        if len(error_locations) > 10:
            from sklearn.cluster import DBSCAN
            clusters = DBSCAN(eps=50, min_samples=5).fit(error_locations)
            n_clusters = len(set(clusters.labels_)) - (1 if -1 in clusters.labels_ else 0)
            signals['spatial_clusters'] = n_clusters  # > 0 suggests localized map issue
        
        # Signal 3: GTSAM convergence difficulty
        gtsam_iters = fleet_telemetry['gtsam_iteration_count']
        signals['gtsam_iter_increase'] = (
            gtsam_iters.tail(100).mean() / gtsam_iters.head(100).mean()
        )
        
        # Signal 4: NOTAM changes
        active_notams = self.query_notams(map_metadata['airport_id'])
        notams_since_map = [
            n for n in active_notams
            if n['effective_date'] > map_metadata['last_update']
        ]
        signals['notam_changes'] = len(notams_since_map)
        
        # Composite score
        score = (
            0.30 * min(max(correlation, 0), 1.0) +
            0.25 * min(signals.get('spatial_clusters', 0) / 3.0, 1.0) +
            0.20 * min(max(signals['gtsam_iter_increase'] - 1.0, 0) / 2.0, 1.0) +
            0.25 * min(signals['notam_changes'] / 5.0, 1.0)
        )
        
        return score, signals

7.2 Construction Activity Detection

The fleet detects construction and layout changes from perception data before the map is updated.

CONSTRUCTION DETECTION PIPELINE

  Fleet SLAM maps (daily aggregate)


  Difference detection vs reference map
  ┌────────────────────────────────────────┐
  │ For each 10m x 10m tile:               │
  │   diff = |current_occupancy - ref_map| │
  │   if diff > threshold for 3+ days:     │
  │     → flag as potential construction    │
  └──────────────┬─────────────────────────┘


  Cross-reference with NOTAM feed
  ┌────────────────────────────────────────┐
  │ Active construction NOTAM in zone?     │
  │   YES → confirmed construction         │
  │   NO  → possible unreported change     │
  │         → escalate to operations       │
  └──────────────┬─────────────────────────┘


  Impact assessment
  ┌────────────────────────────────────────┐
  │ Does construction zone overlap with:   │
  │   - Active routes? → reroute           │
  │   - Map features? → trigger map update │
  │   - Localization landmarks? → alert    │
  └────────────────────────────────────────┘

7.3 Automatic Map Refresh Triggers

TriggerThresholdActionPriority
Localization RMSE in zone > 0.5m for > 4 hours0.5m sustainedSchedule map survey of zoneP2
GTSAM iteration count increase > 50%50% over baselineSchedule map surveyP3
NOTAM issued for construction in mapped areaAny relevant NOTAMAssess impact, schedule if neededP2
Map age > 90 days with no validation90 daysMandatory resurvey or fleet validationP3
Fleet SLAM detects > 10 new static objects in zone10+ new objectsTrigger incremental map updateP2
3+ vehicles report path deviation in same zone3 vehiclesEmergency map investigationP1

8. Environmental Attribution

8.1 METAR Correlation Engine

METAR (Meteorological Aerodrome Report) is the standard aviation weather report, available every 30-60 minutes for every airport with IFR operations. Correlating fleet performance with METAR is the fastest environmental attribution method.

python
class METARCorrelationEngine:
    """
    Correlate fleet KPIs with METAR weather observations.
    
    METAR fields relevant to GSE operations:
      - visibility_m: prevailing visibility in meters
      - ceiling_ft: cloud base height (affects lighting)
      - temp_c: temperature (affects sensor performance, tarmac)
      - dewpoint_c: dew point (fog likelihood when temp-dewpoint < 3C)
      - wind_speed_kt: wind speed (affects jet blast, FOD movement)
      - wind_gust_kt: gust speed
      - precip_type: RA (rain), SN (snow), FG (fog), BR (mist), HZ (haze)
      - precip_intensity: - (light), (moderate), + (heavy)
    """
    
    def correlate(self, fleet_kpis, metar_history, window_hours=24):
        """
        Compute time-lagged correlations between METAR conditions
        and fleet performance metrics.
        """
        results = {}
        
        metar_features = [
            'visibility_m', 'temp_c', 'dewpoint_c', 'wind_speed_kt',
            'precip_intensity_encoded',  # 0=none, 1=light, 2=moderate, 3=heavy
        ]
        
        kpi_features = [
            'intervention_rate', 'perception_mAP', 'sensor_health_min',
            'localization_cov_trace',
        ]
        
        for metar_feat in metar_features:
            for kpi_feat in kpi_features:
                # Test multiple lags (0, 1, 2, 4, 8, 12, 24 hours)
                for lag in [0, 1, 2, 4, 8, 12, 24]:
                    metar_lagged = metar_history[metar_feat].shift(lag)
                    kpi_aligned = fleet_kpis[kpi_feat]
                    
                    valid = metar_lagged.notna() & kpi_aligned.notna()
                    if valid.sum() < 30:
                        continue
                    
                    r, p = stats.pearsonr(
                        metar_lagged[valid], kpi_aligned[valid]
                    )
                    
                    if p < 0.05 and abs(r) > 0.3:
                        key = (metar_feat, kpi_feat, lag)
                        results[key] = {'r': r, 'p': p, 'lag_hours': lag}
        
        return results

8.2 Seasonal Drift Detection

Airside operations experience strong seasonal patterns that must be separated from anomalies.

SEASONAL PATTERNS IN AIRSIDE FLEET PERFORMANCE

     Jan  Feb  Mar  Apr  May  Jun  Jul  Aug  Sep  Oct  Nov  Dec
     ────────────────────────────────────────────────────────────
mAP: Low  Low  Med  Med  High High Med  High High Med  Med  Low
     (fog, de-icing)              (heat shimmer)    (fog onset)
     
Int: High High Med  Med  Low  Med  Med  Low  Low  Med  Med  High
rate (vis, ice)                (thermal)           (fog, early dark)

Sensor
health: Low  Low  Med  High High Med  Med  High High Med  Low  Low
        (glycol, salt)        (heat stress)       (debris, ice)

Key seasonal transitions (highest risk of misattribution):
  - Oct: Fog onset -- first foggy day of season often causes spike
         that looks like OTA regression if OTA was deployed in Sep
  - Nov-Dec: De-icing onset -- sensor degradation accelerates
  - Mar-Apr: Pollen season -- gradual sensor contamination
  - Jun-Jul: Heat shimmer -- LiDAR range reduction on hot tarmac

Seasonal baseline adjustment:

python
class SeasonalBaselineAdjuster:
    """
    Adjust fleet baselines for seasonal patterns to prevent
    seasonal effects from being misattributed as anomalies.
    
    Requires: 12+ months of fleet data for robust seasonal estimation.
    For new airports: use data from similar airports (same climate zone)
    as initial seasonal prior, update with local data.
    """
    
    def __init__(self, historical_data, period='weekly'):
        """
        Fit seasonal decomposition on historical fleet data.
        
        Uses STL (Seasonal and Trend decomposition using Loess)
        for robust decomposition that handles outliers.
        """
        from statsmodels.tsa.seasonal import STL
        
        self.decompositions = {}
        for metric in historical_data.columns:
            stl = STL(
                historical_data[metric],
                period=52 if period == 'weekly' else 365,
                seasonal=13,    # seasonal smoother window
                trend=53,       # trend smoother window
                robust=True,    # robust to outliers
            )
            self.decompositions[metric] = stl.fit()
    
    def get_adjusted_baseline(self, metric, date):
        """Return seasonally-adjusted baseline for given metric and date."""
        decomp = self.decompositions[metric]
        seasonal_component = decomp.seasonal[date]
        trend_component = decomp.trend[date]
        return trend_component + seasonal_component
    
    def is_anomalous(self, metric, date, observed_value, n_sigma=3):
        """
        Check if observed value is anomalous relative to
        seasonally-adjusted baseline.
        """
        baseline = self.get_adjusted_baseline(metric, date)
        residual_std = self.decompositions[metric].resid.std()
        
        z_score = (observed_value - baseline) / residual_std
        return abs(z_score) > n_sigma, z_score

8.3 Time-of-Day Effects

Several airside phenomena have strong diurnal patterns:

Time WindowEnvironmental EffectImpact on AV Operations
04:00-07:00Dawn fog, dew, thermal inversionLiDAR range reduction 20-40%, GPS multipath
07:00-09:00Morning rush, de-icing (winter)High traffic density, sensor contamination
10:00-14:00Solar heating, heat shimmerTarmac heat shimmer, thermal camera saturation
14:00-17:00Afternoon thunderstorms (summer)Sudden visibility drops, standing water
17:00-20:00Evening rush, low sun angleGlare on cameras, long shadows in LiDAR
20:00-04:00Night operationsNo natural lighting, thermal-primary perception

8.4 Airport-Specific Environmental Signatures

Each airport has unique environmental characteristics that create distinct anomaly patterns:

python
AIRPORT_ENVIRONMENTAL_PROFILES = {
    'LHR': {
        'primary_hazards': ['fog', 'rain', 'de-icing_Nov-Mar'],
        'fog_frequency': 0.15,  # fraction of operating hours
        'de_icing_months': [11, 12, 1, 2, 3],
        'heat_stress_rare': True,
        'expected_mAP_reduction_fog': 0.15,  # 15% mAP drop in fog is normal
    },
    'DXB': {
        'primary_hazards': ['heat_shimmer', 'sand', 'humidity'],
        'heat_months': [5, 6, 7, 8, 9],
        'max_tarmac_temp_c': 65,
        'sand_storm_frequency': 0.02,
        'expected_thermal_throttling': True,  # Orin throttles in summer
    },
    'NRT': {
        'primary_hazards': ['typhoon_season', 'humidity', 'snow'],
        'typhoon_months': [8, 9, 10],
        'snow_months': [12, 1, 2],
        'expected_mAP_reduction_rain': 0.10,
    },
}

9. Implementation Architecture

9.1 System Architecture

FLEET ANOMALY ATTRIBUTION ARCHITECTURE

┌──────────────────────────────────────────────────────────────────────────┐
│                                                                          │
│  VEHICLE TIER (On-Vehicle, Orin)                                        │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ Telemetry Emitter (1 Hz)                                        │    │
│  │ - Perception metrics (mAP, per-class, latency)                  │    │
│  │ - Localization metrics (cov trace, GTSAM iters, RMSE)          │    │
│  │ - Sensor health scores (per-sensor, per-check)                  │    │
│  │ - Planning metrics (jerk, clearance, candidate count)           │    │
│  │ - Control metrics (tracking error, actuator status)             │    │
│  │ - System metrics (GPU util, temp, memory, latency)              │    │
│  │ - OTA/map version identifiers                                    │    │
│  │ Format: Protobuf -> 5G/WiFi -> Kafka                           │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
├──────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  STREAMING TIER (Airport Edge or Cloud)                                 │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ Apache Kafka                                                     │    │
│  │ Topics: telemetry.{airport}.{vehicle}, alerts.{tier},           │    │
│  │         interventions.{airport}, metar.{airport}                │    │
│  │ Retention: 7 days (hot), 90 days (cold via S3 sink)             │    │
│  └────────────────────┬────────────────────────────────────────────┘    │
│                       │                                                  │
│  ┌────────────────────▼────────────────────────────────────────────┐    │
│  │ Apache Flink (Stream Processing)                                 │    │
│  │                                                                   │    │
│  │ Job 1: Per-vehicle CUSUM/EWMA (1-min windows)                   │    │
│  │   - Update SPC charts for each vehicle x metric                  │    │
│  │   - Emit to alerts.p3 on single-vehicle alarm                   │    │
│  │                                                                   │    │
│  │ Job 2: Per-airport aggregation (15-min windows)                  │    │
│  │   - Aggregate vehicle metrics to airport level                   │    │
│  │   - CUSUM on airport-level KPIs                                  │    │
│  │   - Emit to alerts.p2 on airport alarm                          │    │
│  │                                                                   │    │
│  │ Job 3: Fleet-wide aggregation (1-hour windows)                  │    │
│  │   - Isolation Forest scoring on fleet snapshot                   │    │
│  │   - EWMA on fleet-wide intervention rate                        │    │
│  │   - Emit to alerts.p1 on fleet alarm                            │    │
│  │                                                                   │    │
│  │ Job 4: METAR enrichment                                          │    │
│  │   - Join telemetry with METAR feed                               │    │
│  │   - Compute environmental context features                       │    │
│  │                                                                   │    │
│  │ Job 5: Alert suppression and deduplication                       │    │
│  │   - Apply suppression rules (Section 2.5)                       │    │
│  │   - Route to appropriate tier                                    │    │
│  └────────────────────┬────────────────────────────────────────────┘    │
│                       │                                                  │
├──────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  ATTRIBUTION TIER (Cloud)                                               │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ Attribution Engine (triggered by P1/P2 alerts)                   │    │
│  │                                                                   │    │
│  │ Step 1: Hierarchical decomposition                               │    │
│  │   Fleet -> Airport -> Vehicle -> Subsystem -> Component          │    │
│  │                                                                   │    │
│  │ Step 2: Shapley attribution                                      │    │
│  │   Per-feature SHAP values for anomalous observations             │    │
│  │   Airport-level and vehicle-level Shapley decomposition          │    │
│  │                                                                   │    │
│  │ Step 3: Bayesian diagnosis                                       │    │
│  │   Sequential evidence gathering                                   │    │
│  │   Prior update from fleet history                                │    │
│  │   Multi-hypothesis tracking if ambiguous                         │    │
│  │                                                                   │    │
│  │ Step 4: Context enrichment                                       │    │
│  │   Cross-reference with OTA deployment log                        │    │
│  │   Cross-reference with METAR history                             │    │
│  │   Cross-reference with NOTAM feed                                │    │
│  │   Cross-reference with FIDS (flight schedule / aircraft types)   │    │
│  │                                                                   │    │
│  │ Step 5: Report generation                                        │    │
│  │   Root cause hypothesis with confidence                          │    │
│  │   Recommended actions                                             │    │
│  │   Evidence trail for audit                                        │    │
│  └────────────────────┬────────────────────────────────────────────┘    │
│                       │                                                  │
├──────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  BATCH TIER (Cloud, Airflow)                                            │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ DAG: daily_fleet_analysis (runs 02:00 UTC)                      │    │
│  │   Task 1: Aggregate 24h telemetry from S3                       │    │
│  │   Task 2: Run Isolation Forest + autoencoder on daily snapshot   │    │
│  │   Task 3: Update seasonal baselines (STL decomposition)         │    │
│  │   Task 4: Granger causality analysis (90-day rolling window)    │    │
│  │   Task 5: Generate daily fleet health report                    │    │
│  │                                                                   │    │
│  │ DAG: quarterly_causal_discovery (runs quarterly)                 │    │
│  │   Task 1: Run NOTEARS on 90-day fleet telemetry                 │    │
│  │   Task 2: Run PC algorithm for validation                       │    │
│  │   Task 3: Compare discovered DAG with prior DAG                 │    │
│  │   Task 4: Generate causal structure report for human review      │    │
│  │                                                                   │    │
│  │ DAG: ota_regression_analysis (triggered by OTA deployment)       │    │
│  │   Task 1: Define A/B groups                                      │    │
│  │   Task 2: Collect 7-day metrics from control + treatment         │    │
│  │   Task 3: Run regression detection tests                        │    │
│  │   Task 4: Counterfactual replay on captured rosbags              │    │
│  │   Task 5: Generate OTA regression report                        │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
├──────────────────────────────────────────────────────────────────────────┤
│                                                                          │
│  PRESENTATION TIER                                                      │
│  ┌─────────────────────────────────────────────────────────────────┐    │
│  │ Grafana Dashboard                                                │    │
│  │   Panel 1: Fleet KPI overview (SPC charts per level)            │    │
│  │   Panel 2: Active anomalies with attribution status             │    │
│  │   Panel 3: Shapley waterfall chart (feature contributions)      │    │
│  │   Panel 4: Bayesian diagnosis progress (posterior evolution)    │    │
│  │   Panel 5: OTA A/B comparison (treatment vs control)            │    │
│  │   Panel 6: Map staleness heat map (per zone per airport)        │    │
│  │   Panel 7: Environmental correlation matrix                     │    │
│  │                                                                   │    │
│  │ Slack Integration                                                │    │
│  │   P0: @channel with one-click acknowledge                       │    │
│  │   P1: Attribution summary with drill-down link                  │    │
│  │   P2: Daily digest thread                                       │    │
│  │                                                                   │    │
│  │ Jira Integration                                                 │    │
│  │   Auto-create ticket for P0/P1 with attribution evidence        │    │
│  │   Link to related rosbags, SHAP plots, Bayesian trace           │    │
│  └─────────────────────────────────────────────────────────────────┘    │
│                                                                          │
└──────────────────────────────────────────────────────────────────────────┘

9.2 Data Flow Summary

Data SourceVolumeLatencyStorageRetention
Vehicle telemetry (1 Hz)500 KB/vehicle/hour< 5 secKafka -> S37 days hot, 90 days cold
Interventions1-5 events/vehicle/day< 1 secKafka -> PostgreSQLPermanent
METAR1 report/30 min/airport< 5 minPostgreSQLPermanent
NOTAMAs issued< 15 minPostgreSQLPermanent
Sensor health1 Hz (aggregated to 1 min)< 30 secKafka -> S330 days
OTA deployment logPer deployment< 1 minPostgreSQLPermanent
Map metadataPer map update< 1 minPostgreSQLPermanent
FIDS/A-CDMPer flight event< 5 minPostgreSQL1 year
Attribution resultsPer anomaly< 5 minPostgreSQL + S3Permanent

9.3 Integration Points

INTEGRATION WITH EXISTING REFERENCE AIRSIDE AV STACK SYSTEMS

┌────────────────────────────────────────────────────────────────┐
│ UPSTREAM (data sources)                                        │
│                                                                 │
│ sensor-degradation-health-monitoring.md                         │
│   → Provides: per-sensor health scores, degradation type        │
│   → Integration: Flink reads from health_monitor Kafka topic    │
│                                                                 │
│ runtime-verification-monitoring.md                              │
│   → Provides: STL violation events, OOD scores                  │
│   → Integration: Flink reads from stl_monitor Kafka topic       │
│                                                                 │
│ cloud-backend-infrastructure.md                                 │
│   → Provides: data lake, Airflow, telemetry pipeline            │
│   → Integration: Attribution is an additional Airflow DAG       │
│                   and Flink job set on existing infrastructure   │
│                                                                 │
│ data-flywheel-airside.md                                        │
│   → Provides: trigger events, model eval metrics                │
│   → Integration: Attribution feeds edge-case mining priorities   │
│                                                                 │
│ ../fleet-management/fleet-predictive-maintenance.md                                 │
│   → Provides: component RUL, failure predictions                │
│   → Integration: PHM signals are features in anomaly model      │
├────────────────────────────────────────────────────────────────┤
│ DOWNSTREAM (consumers)                                          │
│                                                                 │
│ ../ota/ota-fleet-management.md                                         │
│   ← Consumes: OTA regression alerts, rollback triggers          │
│                                                                 │
│ hd-map-change-detection-maintenance.md                          │
│   ← Consumes: map staleness alerts, refresh triggers            │
│                                                                 │
│ ../../40-runtime-systems/monitoring-observability/hmi-operator-interface.md                                       │
│   ← Consumes: attribution summaries for operator dashboard      │
│                                                                 │
│ ../../40-runtime-systems/ml-deployment/av-cicd-devops-pipeline.md                                      │
│   ← Consumes: per-class regression signals for CI gates         │
│                                                                 │
│ weather-adaptive-odd-management.md                              │
│   ← Consumes: environmental attribution for ODD tuning          │
└────────────────────────────────────────────────────────────────┘

9.4 Technology Stack

ComponentTechnologyRationale
Message busApache KafkaAlready in cloud backend architecture. High throughput, replay capability
Stream processingApache FlinkExactly-once semantics, event-time processing, sub-second latency
Batch processingApache AirflowAlready in cloud backend architecture. DAG scheduling, dependency management
Anomaly modelsscikit-learn (IsolationForest), PyTorch (autoencoder)Standard, well-tested implementations
Causal discoverycausal-learn (NOTEARS, PC), statsmodels (Granger)Most mature Python causal inference library
SHAP computationshap libraryStandard implementation, TreeSHAP for exact computation
DatabasePostgreSQL + TimescaleDBTime-series extension for efficient temporal queries
DashboardGrafanaAlready likely in stack. Custom panels via JSON API
AlertingGrafana Alerting + PagerDutyMulti-channel alert routing
Ticket managementJira APIAuto-create tickets with evidence links

10. Cost-Benefit and Metrics

10.1 Implementation Cost

PhaseScopeDurationCost
Phase 1: Anomaly detectionSPC (CUSUM/EWMA) on fleet KPIs, hierarchical decomposition, alert tiering, Grafana dashboards6-8 weeks$15-25K
Phase 2: Shapley attributionIsolation Forest training, TreeSHAP integration, hierarchical Shapley decomposition, feature attribution dashboard4-6 weeks$12-20K
Phase 3: Bayesian diagnosisDiagnosis engine, prior calibration, sequential evidence gathering, multi-hypothesis tracking4-6 weeks$12-20K
Phase 4: OTA regressionA/B test framework, per-class regression detection, counterfactual replay, automatic rollback triggers4-6 weeks$10-18K
Phase 5: Environmental + mapMETAR correlation, seasonal baseline, map staleness scoring, construction detection3-4 weeks$8-15K
Phase 6: Causal discoveryNOTEARS pipeline, PC validation, Granger analysis, quarterly update DAG3-4 weeks$8-15K
Total16-22 weeks$55-95K

Infrastructure cost (incremental to existing cloud backend):

ComponentMonthly Cost (50 vehicles)Monthly Cost (200 vehicles)
Flink cluster (3 nodes)$300-500$800-1,500
TimescaleDB$100-200$300-600
Additional S3 storage$50-100$200-400
Grafana (hosted)$100$200
PagerDuty$100-200$200-400
Compute (batch attribution)$100-200$300-600
Total$750-1,300/month$1,800-3,500/month

10.2 Benefit Quantification

MTTR reduction:

MetricBefore Attribution SystemAfter (Phase 1-2)After (Full System)
Mean Time To Detection1-4 hours5-15 minutes2-10 minutes
Mean Time To Attribution4-24 hours30-120 minutes10-30 minutes
Mean Time To Resolution8-48 hours2-8 hours1-4 hours
False positive alerts/day (50 vehicles)N/A (no alerts)3-51-2
Anomalies requiring manual triage100%40-60%15-25%
Analyst headcount (50 vehicles)2-3 (reactive)1-21
Analyst headcount (200 vehicles)8-12 (impossible)3-42-3

Financial impact (50-vehicle fleet):

Benefit CategoryAnnual ValueBasis
Reduced downtime (faster MTTR)$80-150K20-40 hours/month saved x $100-200/hour fleet-hour value
Avoided cascading failures$30-60K2-5 prevented per year x $15-30K per cascade
Reduced analyst headcount$50-100K1-2 fewer analysts needed
OTA regression early detection$20-50K1-3 regressions caught 24-48 hours earlier per year
Improved fleet availability$40-80K1-2% availability improvement x fleet revenue
Total annual benefit$220-440K
Total annual cost$25-45KInfrastructure + maintenance
Net annual benefit$195-395K
ROI (including implementation)2.5-5x in Year 1

10.3 Performance Metrics for the Attribution System

MetricTargetMeasurement Method
Attribution accuracy> 80% correct root cause identificationValidated against manual investigation on sample
Attribution coverage> 90% of P1/P2 anomalies get automated attributionFraction with high-confidence diagnosis
MTTD (detection)< 15 min for P1, < 60 min for P2Time from anomaly onset to alert
MTTA (attribution)< 30 min for single-cause, < 60 min for multi-causeTime from alert to root cause hypothesis
False positive rate< 5% of P1 alerts, < 15% of P2 alertsManual review of 10% sample
Alert volume< 5 P1-P2 alerts/day for 50-vehicle fleetDashboard count
Bayesian diagnosis depth< 4 tests to reach high confidenceAverage tests before P > 0.60
Seasonal adjustment accuracy< 10% MAPE on seasonal baselineBack-test against held-out year
Causal DAG stability> 80% edge agreement across quartersCompare quarterly NOTEARS outputs

10.4 Break-Even Analysis

BREAK-EVEN ANALYSIS

Fixed cost: $55-95K implementation
Variable cost: $750-1,300/month infrastructure (50 vehicles)
Annual operating: $25-45K

Annual benefit per vehicle: $4,400-8,800 (from MTTR reduction, availability, analyst savings)

Break-even fleet size (Year 1 including implementation):
  Conservative: $55K / ($4,400/vehicle) = ~13 vehicles
  Moderate:     $75K / ($6,600/vehicle) = ~12 vehicles
  Optimistic:   $95K / ($8,800/vehicle) = ~11 vehicles

Break-even fleet size (Year 2+, operating cost only):
  Conservative: $45K / ($4,400/vehicle) = ~11 vehicles
  Moderate:     $35K / ($6,600/vehicle) = ~6 vehicles

Conclusion: ROI-positive at 10-15 vehicles. Build before fleet 
reaches 20 vehicles (the manual triage scaling wall).

11. Key Takeaways

  1. Manual triage does not scale past 20 vehicles. At 50 vehicles, anomaly generation rate (15-30/day) exceeds analyst capacity (2-3/day) by an order of magnitude. The backlog grows permanently, and undiagnosed anomalies cascade into safety events.

  2. Hierarchical anomaly detection reduces alert volume 80-90%. Fleet -> airport -> vehicle -> subsystem -> component decomposition, combined with CUSUM/EWMA and Isolation Forest, filters noise before it reaches analysts. Target: < 5 actionable P1/P2 alerts per day for a 50-vehicle fleet.

  3. Fraction-of-vehicles-affected is the single most informative diagnostic signal. If > 80% of vehicles at an airport are affected, the cause is environmental or software (fleet-wide). If < 30%, the cause is vehicle-specific (sensor, calibration). This one check eliminates half the hypothesis space in seconds.

  4. Causal discovery (NOTEARS + PC + Granger) provides the structural backbone. The learned causal DAG encodes which variables can cause which, preventing spurious attribution (e.g., attributing a fog-caused intervention spike to a coincidental OTA update). Re-learn quarterly as fleet operations evolve.

  5. Shapley values provide fair, decomposable attribution. SHAP decomposes the anomaly score into per-feature contributions with exact or near-exact computation. Hierarchical Shapley (fleet -> airport -> vehicle -> feature) makes the computation tractable for large fleets.

  6. Bayesian diagnosis trees reach high confidence in 3-4 tests. Sequential evidence gathering, guided by information gain, reaches > 60% posterior on the true root cause in typically 3-4 automated tests (2-5 minutes), compared to 4-24 hours of manual investigation.

  7. 30-40% of fleet anomalies have multiple co-occurring causes. Single-cause diagnosis leads to partial fixes and recurrence. The Bayesian and causal framework explicitly handles multi-cause scenarios by tracking joint posteriors and interaction effects.

  8. OTA regression is the most dangerous single cause because it is self-inflicted and fleet-wide. A/B testing between OTA versions, per-class regression detection, and counterfactual replay ("would this intervention have happened on the old model?") are non-negotiable for production fleet safety.

  9. Seasonal baselines prevent misattribution. Without seasonal adjustment, the first fog of autumn looks like a system failure. STL decomposition on 12+ months of data creates robust seasonal baselines. For new airports, borrow seasonal profiles from same-climate airports.

  10. EU PLD 2024/2853 creates legal urgency for causal attribution. The rebuttable presumption of causality means if reference airside AV stack cannot demonstrate what caused an incident, courts may presume the AV caused it. Automated causal attribution provides the evidence trail that regulatory defense requires.

  11. Implementation cost $55-95K, break-even at 10-15 vehicles. The system pays for itself through MTTR reduction (8-48h to 15-90min), analyst headcount reduction (2-3 fewer for 50 vehicles), and avoided cascading failures ($30-60K/year).

  12. Build before you need it. Retrofitting attribution under operational pressure (when the fleet hits 20+ vehicles and anomalies pile up) costs 2-3x more than building it proactively. The 16-22 week implementation timeline means starting at 10-15 vehicles to be ready at 20.


12. References

Anomaly Detection and SPC

  1. Page, E. S. (1954). "Continuous Inspection Schemes." Biometrika, 41(1/2), 100-115. [Original CUSUM paper]
  2. Roberts, S. W. (1959). "Control Chart Tests Based on Geometric Moving Averages." Technometrics, 1(3), 239-250. [EWMA control charts]
  3. Liu, F. T., Ting, K. M., & Zhou, Z. H. (2008). "Isolation Forest." ICDM, 413-422. [Isolation Forest for anomaly detection]
  4. Malhotra, P. et al. (2016). "LSTM-based Encoder-Decoder for Multi-sensor Anomaly Detection." ICML Workshop on Anomaly Detection, arXiv:1607.00148. [Autoencoder for time-series anomaly detection]

Causal Discovery

  1. Zheng, X. et al. (2018). "DAGs with NO TEARS: Continuous Optimization for Structure Learning." NeurIPS, 9472-9483. [NOTEARS algorithm]
  2. Zheng, X. et al. (2020). "Learning Sparse Nonparametric DAGs." AISTATS, 3414-3425. [NOTEARS-NONLINEAR]
  3. Spirtes, P., Glymour, C., & Scheines, R. (2000). Causation, Prediction, and Search. MIT Press. [PC algorithm, FCI algorithm]
  4. Granger, C. W. J. (1969). "Investigating Causal Relations by Econometric Models and Cross-spectral Methods." Econometrica, 37(3), 424-438. [Granger causality]
  5. Zheng, X. et al. (2024). "causal-learn: Causal Discovery in Python." JMLR, 25(60), 1-8. [causal-learn library]

Shapley Values and Attribution

  1. Lundberg, S. M. & Lee, S. I. (2017). "A Unified Approach to Interpreting Model Predictions." NeurIPS, 4765-4774. [SHAP framework]
  2. Lundberg, S. M. et al. (2020). "From local explanations to global understanding with explainable AI for trees." Nature Machine Intelligence, 2, 56-67. [TreeSHAP]
  3. Covert, I. et al. (2021). "Explaining by Removing: A Unified Framework for Model Explanation." JMLR, 22(209), 1-90. [Shapley for feature attribution theory]

Bayesian Diagnosis

  1. Pearl, J. (2009). Causality: Models, Reasoning, and Inference. 2nd edition. Cambridge University Press.
  2. Heckerman, D. (1995). "A Tutorial on Learning with Bayesian Networks." Technical Report MSR-TR-95-06, Microsoft Research.
  3. Murphy, K. P. (2012). Machine Learning: A Probabilistic Perspective. MIT Press. [Bayesian network diagnosis]

Fleet Operations and AV Monitoring

  1. Uber ATG (2020). "Automated Root Cause Analysis for Autonomous Vehicle Failures." Internal technical report (disclosed in NTSB investigations).
  2. Waymo (2023). "Fleet-Level Safety Monitoring and Anomaly Detection." Waymo Safety Report, 4th Edition.
  3. ISO 3691-4:2023. "Industrial trucks -- Safety requirements and verification -- Part 4: Driverless industrial trucks and their systems."
  4. EU Product Liability Directive 2024/2853. "Directive on liability for defective products." (Transpose deadline December 2026.)
  5. ISO 26262:2018. "Road vehicles -- Functional safety." Parts 1-12.

Seasonal Decomposition and Environmental Analysis

  1. Cleveland, R. B. et al. (1990). "STL: A Seasonal-Trend Decomposition Procedure Based on Loess." Journal of Official Statistics, 6(1), 3-73.
  2. ICAO Annex 3 (2018). "Meteorological Service for International Air Navigation." 20th edition. [METAR standard]
  3. FAA Advisory Circular 00-45H (2016). "Aviation Weather Services." [METAR interpretation]

Time-Series Analysis

  1. Hamilton, J. D. (1994). Time Series Analysis. Princeton University Press. [Granger causality, VAR models]
  2. Hyndman, R. J. & Athanasopoulos, G. (2021). Forecasting: Principles and Practice. 3rd edition. OTexts. [STL, ARIMA, seasonal adjustment]

Public research notes collected from public sources.