Skip to content

Failure Modes and Safety Analysis for World-Model-Based AV

Taxonomy of Failures, Edge Cases, and Mitigation Strategies


1. World Model Failure Taxonomy

1.1 Hallucination

The world model predicts objects or events that don't exist in reality.

TypeDescriptionExampleSeverityDetection
Phantom objectsPredicts occupied space that is actually free"Ghost vehicle" appears in predictionMedium — causes unnecessary brakingCompare prediction vs next observation
Object duplicationSame object appears multiple timesAircraft predicted in two positionsMediumConsistency check across frames
Temporal hallucinationFrame ordering errors, events appear out of sequencePushback predicted before doors closeLow-MediumCheck physical plausibility
Causal hallucinationIncorrect cause-effect chain"Vehicle braked because traffic light" (no traffic lights on apron)LowDomain-specific plausibility check

Mitigation:

  • Ensemble disagreement: if 3 model copies disagree on an object's existence, flag as hallucination
  • Persistence filter: require predicted objects to appear in 3+ consecutive frames
  • VQ-VAE reconstruction quality: high reconstruction error → model is uncertain → flag

1.2 Mode Collapse

The world model predicts only the average future, missing multi-modal possibilities.

ManifestationDescriptionImpact
Mean trajectoryPredicts average of possible futuresVehicle predicted to stop halfway instead of "go left" OR "go right"
Missing rare modesDoesn't predict low-probability eventsEmergency vehicle not predicted
OverconfidentAssigns high probability to single futureDoesn't plan for alternative outcomes

Mitigation:

  • Use diffusion or flow matching (inherently multi-modal) instead of regression
  • Sample multiple futures and evaluate each
  • Train on balanced datasets (oversample rare scenarios)

1.3 Temporal Drift

Predictions diverge from reality over longer horizons due to compounding errors.

Prediction error at time step k:
  ε_k ≈ ε_1 × γ^k    (exponential growth, γ > 1)

For autoregressive models (OccWorld, DrivingGPT):
  Error compounds because each prediction feeds into the next

For diffusion models (GAIA-2, DriveDreamer):
  Less drift because entire sequence is generated at once

Mitigation:

  • Limit prediction horizon (2-4s is practical, beyond 4s reliability drops)
  • Confidence decay: weight predictions by 1/k for planning cost
  • Re-predict every cycle (don't rely on old predictions)
  • Shortcut forcing (DreamerV4): predict directly to future, skip intermediate steps

1.4 Action Infidelity

The world model prediction doesn't correctly reflect the ego action.

Problem: You feed trajectory τ_A, but the prediction looks the same as for τ_B
→ The world model learned to ignore the action input

ACT-Bench finding: Vista achieves only 30.72% action fidelity
→ Most driving world models have this problem

Mitigation:

  • Action-conditioned training (Drive-OccWorld explicitly conditions on action)
  • Action dropout during training (force model to use action when available)
  • Evaluate action fidelity separately from visual quality

2. Sensor Failure Modes

2.1 LiDAR Degradation

ConditionEffect on LiDARDetectionResponse
Heavy rainPoint density drops 30-50%, spurious returnsPoint count monitoringSwitch to radar-primary
FogRange reduced, backscatter noiseIntensity pattern analysisReduce ODD, slow down
De-icing sprayLens contamination, complete blockageSudden point count drop to near-zeroEmergency stop, wait for clearing
Snow on sensorPartial/full blockageAsymmetric point densityActivate heater, alert
Jet blast vibrationMisalignment, noisy returnsIMU vibration detectionStop, recalibrate
Sun glare on tarmacSpurious high-intensity returnsIntensity outlier detectionFilter, use geometric features only
Standing waterMirror reflections, phantom ground planeMulti-return detectionRaise ground filter threshold

2.2 GPS/GNSS Degradation

ConditionEffectDetectionResponse
Multipath near terminalsPosition error 2-10mHDOP monitoring, position jumpingSwitch to LiDAR SLAM localization
Near large aircraftSignal shadowingSatellite count dropUse UWB beacons as backup
RF interferenceComplete GPS lossPVT status monitoringDead reckoning (IMU + wheel odometry)

2.3 Camera Degradation (When Added)

ConditionEffectDetectionResponse
Night + poor apron lightingLow SNR, missed detectionsBrightness histogramHDR mode, rely on LiDAR
De-icing glycol on lensBlurred/obscured imageSharpness metric dropLiDAR-only mode
Sun in frameBloom, saturationExposure analysisMask affected regions
Vibration blurMotion blur at low shutterIMU correlationIncrease shutter speed

3. Software and System Failures

3.1 GPU/Inference Failures

FailureImpactDetectionResponse
CUDA OOMModel inference crashesTry/catch CUDA errorsFallback to current stack
TensorRT engine corruptionWrong outputsOutput range validationReload engine, restart node
GPU thermal throttlingLatency increasesTemperature monitoringReduce model complexity, use Lite tier
Inference timeoutStale predictionsWatchdog timerUse last-known prediction, alert
NaN in outputsUnpredictable behaviorNaN check on every outputDiscard, use fallback

3.2 ROS Communication Failures

FailureImpactDetectionResponse
Topic dropoutMissing sensor dataHeartbeat monitoringUse last-known data, degrade gracefully
Clock skewMisaligned sensor fusionTF timestamp validationReject out-of-sync data
Message queue overflowStale dataQueue size monitoringDrop old messages, process latest
Node crashComponent offlineLifecycle managementAuto-restart, log incident

4. Airside-Specific Failure Scenarios

4.1 Critical Scenarios

ScenarioWhy It's HardCurrent Stack Handles?World Model Helps?
Ground crew walks behind aircraft noseOccluded by aircraft bodyNo — can't see themYes — predicts pedestrian emergence from occlusion
Aircraft pushback starts unexpectedlyLarge object starts movingNo — reactive onlyYes — predicts pushback from turnaround phase
FOD on taxiwaySmall object, high speed (taxi)No — no FOD detectionYes — anomaly in occupancy prediction
Emergency vehicle approachingMust yield immediatelyPartially — if detectedYes — predicts emergency vehicle trajectory
Jet blast from engine startupInvisible hazardNo — no jet blast awarenessYes — ADS-B + aircraft type → hazard zone
De-icing spray hits sensorsSudden sensor degradationNo — relies on all sensorsYes — OOD detection, graceful degradation
Two GSE approaching same standCoordination requiredNo — no multi-agent reasoningYes — shared world model predicts conflict
Construction zone not in mapMap doesn't match realityPartial — if obstacles detectedYes — NOTAM integration + anomaly detection

4.2 Long-Tail Distribution

The "long tail" of rare scenarios is particularly challenging for airside:

Frequency distribution of airside scenarios:
  90%: Normal operations (straight driving, parking, loading)
  9%:  Common variations (weather, night, busy apron)
  0.9%: Unusual (emergency vehicle, equipment failure, bird strike)
  0.1%: Rare (aircraft abort, tire blowout, fuel spill)
  0.01%: Extremely rare (runway incursion, security incident)

The world model must handle ALL of these safely.
Training data will cover the top 90% well, partially cover 9%,
and barely cover the rest.

Strategy for the long tail:

  1. Adversarial scenario generation: Use world model to imagine worst cases (SafeDreamer)
  2. Foundation model generalization: VLAs can reason about novel situations via language
  3. Conservative fallback: When uncertain, stop and request teleoperation
  4. Scenario mining from fleet: Automatically extract rare events from continuous operation

5. SOTIF Analysis for World Models

5.1 ISO 21448 SOTIF Framework

SOTIF (Safety of the Intended Functionality) addresses failures that arise from the intended behavior of the system, not from hardware/software faults.

Four-quadrant model:

                Known          Unknown
Safe          1: Known Safe    3: Unknown Safe
              (normal ops)     (untested but OK)

Unsafe        2: Known Unsafe  4: Unknown Unsafe
              (identified      (unidentified
               triggers)        triggers — THE DANGER)

Goal: Minimize areas 3 and 4 (move them to 1 and 2).

5.2 Triggering Conditions for World Models

Triggering ConditionSOTIF QuadrantFunctional InsufficiencyMitigation
LiDAR in heavy rain2 (Known Unsafe)Reduced point density → occupancy prediction failsRadar fusion, weather-adaptive thresholds
Novel aircraft type not in training4 (Unknown Unsafe)World model doesn't know how this aircraft behavesOpen-vocab detection, occupancy is class-agnostic
GPS multipath near terminal2 (Known Unsafe)Localization error → wrong BEV alignmentLiDAR SLAM fallback
Simultaneous pushback of adjacent aircraft4 (Unknown Unsafe)World model never trained on this scenarioAdversarial scenario generation, test in sim
World model predicts static but aircraft moves2 (Known Unsafe)Model lag, insufficient turnaround contextA-CDM integration, shorter prediction horizon
Reflective wet tarmac at night4 (Unknown Unsafe)LiDAR phantom returns → occupancy hallucinationStanding water detection, multi-frame consistency

5.3 SOTIF Verification Strategy

1. Identify triggering conditions (above table)
2. For each: estimate probability × severity → risk level
3. High-risk items: test extensively in simulation
4. Build test scenarios covering each triggering condition
5. Monitor in shadow mode: does the trigger actually cause failure?
6. Iterate until residual risk is below threshold

6. Formal Verification Limits

6.1 What CAN Be Formally Verified

PropertyMethodFeasibility
RSS safety envelopeMathematical proof (Mobileye)High — rules are simple
Bounded output rangeNeural network verification (alpha-beta-CROWN)Medium — for small networks
Geofence compliancePolygon containment checkHigh — geometric
Speed limit complianceSimple threshold checkHigh — trivial
Watchdog timeoutFormal timing analysisHigh — well-understood

6.2 What CANNOT Be Formally Verified

PropertyWhy NotAlternative
World model prediction accuracyModel is too large for verification toolsStatistical testing + conformal prediction
Correct behavior in all scenariosInfinite scenario spaceRisk-based testing + SOTIF analysis
No hallucination everGenerative models can always hallucinateRuntime detection + ensemble disagreement
OOD detection completenessUnknown unknownsMulti-layer detection + conservative fallback

6.3 The Verification Gap

Formal verification can handle: ~10^6 parameters (small networks)
World models have: ~10^8 parameters (100M+)

Gap: 100x

Approaches to bridge:
1. Verify the safety monitor (small) not the world model (large)
2. Verify properties of the combined system (Simplex guarantees)
3. Statistical verification with conformal prediction bounds
4. Runtime monitoring as a "continuous verification" substitute

7. Defense-in-Depth Safety Architecture

Layer 0: DESIGN
├── World model trained with safety-aware objectives (SafeDreamer)
├── Occupancy prediction with calibrated uncertainty
└── RSS constraints built into planning cost function

Layer 1: RUNTIME MONITORING
├── OOD detection (ensemble disagreement + reconstruction error)
├── Prediction consistency check (temporal, spatial)
├── Sensor health monitoring (per-sensor diagnostics)
└── Confidence calibration (conformal prediction)

Layer 2: SAFETY CONTROLLER
├── RSS envelope check on every proposed trajectory
├── Occupancy collision check (predicted + current)
├── Geofence check (NOTAM zones, airport boundary)
└── Speed limit enforcement

Layer 3: SIMPLEX ARBITRATION
├── If Layer 1 or 2 fails → switch to fallback stack
├── Fallback stack: proven Frenet planner (current reference airside AV stack)
├── Hysteresis prevents rapid switching
└── All transitions logged

Layer 4: GRACEFUL DEGRADATION
├── Reduced capability mode (slow, wide margins)
├── Controlled stop (safe position, parking brake)
├── Teleoperation request (remote operator)
└── Hardware e-stop (physical button)

Layer 5: PHYSICAL SAFETY
├── Mechanical speed limiter
├── Hardware e-stop circuit (independent of software)
├── Bumper/contact sensors
└── Emergency lighting and horn

Key principle: No single layer failure leads to an unsafe outcome. An adversary (or bug) must defeat ALL layers simultaneously to cause harm.


Sources

  • ISO 21448:2022 "Road vehicles — Safety of the intended functionality"
  • ISO/PAS 8800:2024 "Road vehicles — Safety and artificial intelligence"
  • Shalev-Shwartz et al. "On a Formal Model of Safe and Scalable Self-driving Cars" (RSS)
  • Katz et al. "Marabou: A Framework for Verification and Analysis of Deep Neural Networks"
  • NTSB accident reports on AV incidents (Uber ATG, Cruise)
  • Burton et al. "Mind the gaps: Assuring the safety of autonomous systems from an engineering perspective"
  • AMLAS Methodology (University of York)
  • UL 4600 Standard for Evaluation of Autonomous Products

Public research notes collected from public sources.