World-Model-Powered Autonomous Vehicle Stack: Airside Reference ODD
Design Specification
Date: 2026-03-21 Type: Research Design — Next-Generation AV Stack with Airport Airside as Reference ODD Context: Greenfield parallel track alongside existing reference airside AV stack ROS Noetic stack (Simplex architecture) Approach: Dual-Track Foundation (Data Engine + World Model Stack) Domain stance: The concrete examples are airside because that is the best-developed reference ODD. The architecture should keep domain context explicit so the same stack can be re-evaluated for road, warehouse, yard, port, mining, construction, agriculture, delivery robot, and campus deployments.
1. Problem Statement
The current reference airside AV stack ADS stack is a production-tested, modular ROS Noetic pipeline optimized for waypoint-following on airport airside. It works — but it has fundamental limitations that world models and foundation AI can address:
| Current Limitation | Root Cause | Impact |
|---|---|---|
| Can only detect 3 object types (deck, ULD, trailer) | Hand-crafted RANSAC perception, no learned models | Cannot detect aircraft, personnel, FOD, other GSE |
| No prediction of other agents' behavior | Frenet planner treats obstacles as static | Reactive-only — cannot anticipate pushback, pedestrian intent |
| Requires per-airport waypoint authoring + PCD maps | No generalization capability | Weeks of setup per new airport deployment |
| Kinematic-only simulation (no 3D scene, no sensors) | Python bicycle model, no rendering | Cannot test perception, cannot generate diverse scenarios |
| No explainability for decisions | Rule-based FSM with no reasoning trace | Hard to build safety cases, impossible to debug edge cases |
| LiDAR-only perception | No camera integration | Misses color, texture, signage, markings — critical airside info |
| No adverse condition handling | No rain/fog/de-icing compensation beyond SOR filter | Degraded or unsafe operation in real airport weather |
The thesis: A world model that learns the dynamics of the airside environment — how aircraft move, how ground crews behave, how turnarounds sequence — combined with foundation model perception and VLA reasoning, can address all of these simultaneously.
2. System Architecture
2.1 Simplex Architecture Overview
┌─────────────────────────────────────────────────────────────────────────┐
│ VEHICLE HARDWARE LAYER │
│ Sensors: LiDAR (4-8x RoboSense) + IMU + RTK-GPS + Wheel Encoders │
│ Future: + Surround Cameras (6-8x) + 4D Radar (2-4x) + ADS-B + UWB │
│ Actuators: Hydraulic/Electric steering + braking via CAN │
└───────────────────────────────┬─────────────────────────────────────────┘
│ Raw sensor data
▼
┌─────────────────────────────────────────────────────────────────────────┐
│ SENSOR ABSTRACTION LAYER │
│ Unified multi-modal representation regardless of sensor configuration │
│ Handles: timestamping, calibration, ego-motion compensation │
│ Output: Synchronized, ego-compensated sensor bundle @ 10-20Hz │
└──────────┬────────────────────────────────────────────────┬─────────────┘
│ │
▼ ▼
┌──────────────────────────┐ ┌──────────────────────────┐
│ NEW STACK │ │ CURRENT STACK │
│ (High-Performance) │ │ (Verified Fallback) │
│ │ │ │
│ ┌─────────────────────┐ │ │ ┌─────────────────────┐ │
│ │ Perception Backbone │ │ │ │ LiDAR Perception │ │
│ │ BEV Encoder │ │ │ │ (RANSAC pipeline) │ │
│ │ Foundation Features │ │ │ └─────────┬───────────┘ │
│ └─────────┬───────────┘ │ │ │ │
│ │ │ │ ┌─────────▼───────────┐ │
│ ┌─────────▼───────────┐ │ │ │ GTSAM Localization │ │
│ │ World Model │ │ │ └─────────┬───────────┘ │
│ │ 4D Occupancy Pred. │ │ │ │ │
│ │ Motion Forecasting │ │ │ ┌─────────▼───────────┐ │
│ │ Scene Understanding │ │ │ │ Frenet Planner │ │
│ └─────────┬───────────┘ │ │ │ Behavior FSM │ │
│ │ │ │ └─────────┬───────────┘ │
│ ┌─────────▼───────────┐ │ │ │ │
│ │ Planning / VLA │ │ │ ┌─────────▼───────────┐ │
│ │ Trajectory Gen. │ │ │ │ Vehicle Interface │ │
│ │ Language Reasoning │ │ │ │ (CAN gateway) │ │
│ └─────────┬───────────┘ │ │ └─────────┬───────────┘ │
│ │ │ │ │ │
└────────────┼───────────────┘ └────────────┼──────────────┘
│ │
▼ ▼
┌─────────────────────────────────────────────────────────────────────────┐
│ SAFETY MONITOR │
│ │
│ ┌─────────────┐ ┌─────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ OOD │ │ RSS Safety │ │ Confidence │ │ Watchdog / │ │
│ │ Detection │ │ Envelope │ │ Calibration │ │ Heartbeat │ │
│ │ (ensemble) │ │ Check │ │ Threshold │ │ Monitor │ │
│ └──────┬──────┘ └──────┬──────┘ └──────┬───────┘ └──────┬───────┘ │
│ └────────────────┼───────────────┬─┘ │ │
│ ▼ ▼ │ │
│ ┌──────────────┐ │ │
│ │ ARBITRATOR │◄──────────────────────────┘ │
│ │ Select: NEW │ │
│ │ or FALLBACK│ │
│ └──────┬───────┘ │
└──────────────────────────┼──────────────────────────────────────────────┘
│
▼
Vehicle Actuators2.2 Design Principles
- Compute-agnostic: Every component specifies its interface (input tensors, output tensors) not its runtime. Same model runs on Orin, Thor, or cloud via ONNX/TensorRT/Triton.
- Sensor-agnostic: The BEV encoder accepts a configurable sensor manifest. Start with LiDAR-only, add cameras and radar without architecture changes.
- Graceful degradation: If the new stack fails, hesitates, or detects OOD, the current stack takes over seamlessly. The vehicle never stops being safe.
- Data flywheel: Every meter driven generates training data. The system improves from its own operation.
- Domain-aware: The stack accepts domain context as first-class input. For airside this means A-CDM, NOTAM, ADS-B, and A-SMGCS; for other domains it may mean WMS/TMS systems, yard-management systems, mine dispatch, port TOS, farm boundaries, or road maps and traffic rules.
3. Component Design
3.1 Sensor Abstraction Layer
Purpose: Decouple the world model stack from specific sensor hardware. Provide a unified, synchronized, ego-compensated sensor bundle.
Design:
@dataclass
class SensorBundle:
timestamp: float # Unified timestamp (ns)
ego_pose: SE3 # Ego vehicle pose in world frame
ego_velocity: Twist # Linear + angular velocity
# LiDAR (always present)
pointcloud: PointCloud # Aggregated, ego-compensated, in vehicle frame
pointcloud_metadata: LiDARMeta # Sensor positions, beam config, intensity stats
# Camera (optional, added incrementally)
images: Optional[Dict[str, Image]] # Named cameras: front, left, right, back, etc.
camera_intrinsics: Optional[Dict[str, K]] # Per-camera calibration
camera_extrinsics: Optional[Dict[str, SE3]]
# Radar (optional, future)
radar_points: Optional[RadarPointCloud] # 4D radar: x, y, z, doppler velocity
# Airport context (optional, from integration layer)
airport_context: Optional[AirportContext] # ADS-B targets, NOTAM zones, flight scheduleKey decisions:
- Ego-motion compensation at this layer — the world model receives motion-corrected data, never raw. Uses IMU + wheel odometry for deskewing (same as your current
LidarProcessorNodelet). - Variable-rate fusion — LiDAR at 10Hz, cameras at 10-30Hz, radar at 13Hz, GPS at 2Hz. The bundle is produced at the LiDAR rate with the most recent data from each sensor interpolated to the LiDAR timestamp.
- Graceful absence — Every non-LiDAR field is
Optional. The downstream BEV encoder handles missing modalities via masked attention or zero-padding.
3.2 Perception Backbone
Purpose: Transform raw sensor data into a unified Bird's-Eye-View (BEV) feature representation that the world model consumes.
Architecture:
SensorBundle
│
├── LiDAR Branch
│ ├── Voxelization (pillar-based, configurable resolution)
│ ├── PointPillars / CenterPoint encoder
│ │ OR VoxelNet backbone
│ └── BEV features (C × H × W)
│
├── Camera Branch (when available)
│ ├── DINOv2 backbone (frozen or LoRA-tuned)
│ ├── LSS (Lift-Splat-Shoot) view transform
│ │ OR BEVFormer cross-attention
│ └── BEV features (C × H × W)
│
├── Radar Branch (when available)
│ ├── Pillar encoding of radar points
│ └── BEV features (C × H × W)
│
└── BEV Fusion
├── Concatenation + 1x1 Conv (simple)
│ OR Transformer cross-attention (rich)
└── Fused BEV features (C_fused × H × W)Key decisions:
- BEV resolution: 0.2m/pixel, 100m × 100m range = 500 × 500 grid. Sufficient for airside speeds (5-30 km/h). Configurable for larger ranges when tracking aircraft at distance.
- DINOv2 backbone for cameras: Pre-trained on ImageNet, provides rich semantic features without airside-specific training. DINO-based pre-training has shown ~9 percentage point route completion improvement over supervised baselines in CARLA simulation (62.18% vs. 53.20%). Frozen initially, LoRA fine-tuned on airside data later.
- Open-vocabulary detection head: Grounding DINO or YOLO-World attached to camera features for zero-shot detection of airside-specific objects. Text prompts: "baggage tractor", "belt loader", "aircraft nose gear", "ground crew in hi-vis", "FOD on tarmac". This gives you detection of 30+ GSE types and 100+ aircraft variants without any airside-specific training data.
- LiDAR-only mode: When cameras aren't available, the BEV encoder uses only the LiDAR branch. The world model still works — it just has geometric features without semantic richness.
3.3 World Model Core
This is the heart of the new stack. The world model predicts the future state of the entire scene — not individual objects, but the full 3D environment.
Architecture: 4D Occupancy World Model
History (t-T ... t)
│
┌─────────────────────▼──────────────────────┐
│ ENCODER │
│ BEV features → VQ-VAE tokenization │
│ Codebook: 512-1024 entries │
│ Each BEV frame → sequence of discrete tokens│
└─────────────────────┬──────────────────────┘
│ Token sequence
┌─────────────────────▼──────────────────────┐
│ DYNAMICS MODEL │
│ Autoregressive Transformer (GPT-style) │
│ OR Mamba SSM (O(n) vs O(n^2)) │
│ │
│ Inputs: │
│ - Past scene tokens (t-T ... t) │
│ - Ego action (candidate trajectory) │
│ - Airport context embedding (optional) │
│ │
│ Outputs: │
│ - Future scene tokens (t+1 ... t+K) │
│ - Per-token confidence scores │
└─────────────────────┬──────────────────────┘
│ Future tokens
┌─────────────────────▼──────────────────────┐
│ DECODER │
│ Tokens → 3D Occupancy Grid │
│ Resolution: 0.2m × 0.2m × 0.2m │
│ Height: -1m to 5m (30 voxels) │
│ Semantic channels (optional): │
│ - Occupied / Free / Unknown │
│ - Hazard (jet blast, FOD) │
│ - Dynamic (moving) / Static │
└─────────────────────┬──────────────────────┘
│
4D Occupancy Prediction
(3D space × K future timesteps)Why occupancy over object detection for airside:
| Property | Object Detection | Occupancy Prediction |
|---|---|---|
| Novel objects (new GSE type, unusual aircraft) | Fails — not in training classes | Works — just occupied voxels |
| Jet blast zone | Cannot represent | Hazard occupancy channel |
| FOD (small debris) | Needs tiny-object detector | Ground-level occupancy anomaly |
| Partial occlusion | Lost detections | Partial occupancy preserved |
| Aircraft (huge objects) | Bounding box is mostly empty | Accurate shape representation |
| Class-agnostic safety | Needs per-class behavior rules | "Don't drive into occupied space" — one rule |
Model sizing (compute-agnostic tiers):
| Tier | Parameters | Prediction Horizon | Latency (Orin) | Latency (Thor) |
|---|---|---|---|---|
| Lite | 50M | 2s (10 steps @ 5Hz) | ~80ms | ~20ms |
| Base | 200M | 4s (20 steps) | ~200ms | ~50ms |
| Large | 500M+ | 8s (40 steps) | Cloud/edge only | ~150ms |
Pre-training strategy:
- Pre-train on nuScenes + Waymo occupancy (300K+ labeled scenes exist)
- Self-supervised pre-training on your airside LiDAR bags (predict masked future frames)
- Fine-tune on labeled airside data (auto-labeled via foundation models)
3.4 Planning Layer
Two operating modes, selected based on available compute and maturity:
Mode 1: World-Model-Augmented Frenet Planner (Near-term)
Your existing Frenet planner generates 420 trajectory candidates per cycle. Currently it scores them against static obstacles. The world model upgrades this:
Frenet Trajectory Generator (existing, 420 candidates)
│
│ For each candidate trajectory:
│
├── Feed trajectory as ego-action to World Model
│ → Get predicted future occupancy for that action
│ → Score: collision probability over prediction horizon
│ → Score: proximity to hazard occupancy (jet blast, etc.)
│ → Score: smoothness of predicted future state
│
├── Existing scores (path smoothness, lateral deviation, etc.)
│
└── Weighted combination → Select best trajectoryThis is a drop-in upgrade to your existing planner. The world model acts as a learned cost function for trajectory evaluation. Your Frenet generator, Stanley controller, and vehicle interface are unchanged.
Mode 2: Learned Planner / VLA (Medium-term)
Replace the Frenet generator entirely with a learned policy:
World Model (4D Occupancy + Scene Understanding)
│
▼
┌────────────────────────────────────┐
│ VLA Planning Head │
│ │
│ Option A: Alpamayo-R1-10B (8.2B backbone + 2.3B action expert) │
│ - Camera + LiDAR BEV input │
│ - Chain-of-Causation reasoning │
│ - Trajectory + explanation │
│ - Text-guided planning │
│ │
│ Option B: Diffusion Policy (~100M) │
│ - BEV + occupancy input │
│ - Denoising trajectory gen │
│ - Multi-modal futures │
│ - Like DiffusionDrive (CVPR'25) │
│ │
│ Option C: MPC in Latent Space │
│ - TD-MPC2 style │
│ - CEM optimization in world │
│ model latent space │
│ - Implicit value function │
└──────────────┬─────────────────────┘
│
▼
Trajectory (x, y, yaw, velocity, timestamp)
│
▼
Stanley Controller (existing) → VehicleAlpamayo integration path: When cameras are added, Alpamayo becomes the natural planning head. Its Chain-of-Causation reasoning generates both the trajectory and a natural language explanation ("Slowing because ground crew is crossing ahead of aircraft nose. Jet blast zone boundary in 15m."). This is invaluable for:
- Safety case evidence (every decision is logged with reasoning)
- Operator trust (the vehicle explains itself)
- Regulatory compliance (EASA AI Roadmap requires explainability)
- Debugging (replay and read why the vehicle did what it did)
3.5 Airport Integration Layer
Purpose: Ingest airport-specific data sources that no road AV stack handles. This is a unique competitive advantage.
┌─────────────────────────────────────────────────┐
│ AIRPORT CONTEXT MANAGER │
│ │
│ ┌─────────────┐ ┌──────────────┐ │
│ │ ADS-B │ │ A-CDM / │ │
│ │ Receiver │ │ AODB │ │
│ │ (1090ES) │ │ (REST API) │ │
│ │ │ │ │ │
│ │ → Aircraft │ │ → Flight │ │
│ │ positions │ │ schedule │ │
│ │ altitudes │ │ → TOBT/TSAT │ │
│ │ callsigns │ │ → Gate │ │
│ └──────┬──────┘ │ assignments│ │
│ │ └──────┬───────┘ │
│ │ │ │
│ ┌──────▼────────────────▼───────┐ │
│ │ CONTEXT FUSER │ │
│ │ │ │
│ │ Produces AirportContext: │ │
│ │ - Active aircraft positions │ │
│ │ - Predicted pushback times │ │
│ │ - Active NOTAM zones │ │
│ │ - Jet blast hazard zones │ │
│ │ (computed from aircraft │ │
│ │ type + engine status) │ │
│ │ - Turnaround phase per gate │ │
│ │ - Dynamic restricted areas │ │
│ └──────────────┬────────────────┘ │
│ ┌──────────────▼────────────────┐ │
│ │ NOTAM │ │ AIXM/AMXM │ │
│ │ Parser │ │ Airport │ │
│ │ → Dynamic │ │ Model │ │
│ │ geofences │ │ → Static │ │
│ │ closures │ │ geometry │ │
│ │ restrictions│ │ taxiways │ │
│ └──────────────┘ │ markings │ │
│ └─────────────┘ │
└─────────────────────────────────────────────────┘How airport context feeds the world model:
The AirportContext is encoded as a conditioning signal for the world model:
- ADS-B aircraft positions → injected as occupied regions in the BEV with velocity vectors. The world model learns to predict aircraft movement given these priors.
- Flight schedule + turnaround phase → embedded as a context token. The world model learns temporal priors: "Aircraft at gate B7 is in phase 3 (cargo loading), pushback expected in 12 minutes."
- NOTAM zones → rendered as additional BEV channels (restricted/closed). The planner treats these as hard constraints.
- Jet blast prediction → computed from aircraft type + engine configuration (lookup table). Rendered as hazard occupancy extending behind aircraft.
3.6 Data Engine
Purpose: Transform your scattered bag files into a structured, labeled, continuously growing airside driving dataset.
┌─────────────────────────────────────────────────────────────────┐
│ DATA ENGINE │
│ │
│ STAGE 1: INGEST │
│ ┌─────────────┐ │
│ │ Bag Scanner │ Crawl storage, index all .bag files │
│ │ │ Extract: duration, topics, sensor config, │
│ │ │ GPS trace, vehicle ID, date │
│ └──────┬──────┘ │
│ │ │
│ STAGE 2: EXTRACT │
│ ┌──────▼──────┐ │
│ │ Scene │ Split bags into scenes (30-60s segments) │
│ │ Extractor │ Extract: pointclouds, images (if any), │
│ │ │ poses, CAN data, IMU, GPS │
│ │ │ Align to common timeline │
│ │ │ Export to efficient format (e.g., zarr, lance) │
│ └──────┬──────┘ │
│ │ │
│ STAGE 3: AUTO-LABEL │
│ ┌──────▼──────┐ │
│ │ Foundation │ LiDAR: CenterPoint/PointPillars 3D detection │
│ │ Model │ Camera (if available): Grounding DINO + SAM │
│ │ Labeler │ Occupancy: self-supervised from LiDAR │
│ │ │ Scenario tags: speed, proximity, weather, │
│ │ │ turnaround phase, near-miss │
│ └──────┬──────┘ │
│ │ │
│ STAGE 4: CURATE │
│ ┌──────▼──────┐ │
│ │ Scenario │ Balance dataset by scenario type │
│ │ Miner │ Prioritize rare/interesting events │
│ │ │ Flag low-quality labels for human review │
│ │ │ Applied Intuition-style CLIP retrieval: │
│ │ │ "aircraft pushback with ground crew nearby" │
│ └──────┬──────┘ │
│ │ │
│ STAGE 5: RECONSTRUCT │
│ ┌──────▼──────┐ │
│ │ 3DGS Digital │ Build 3D Gaussian Splatting reconstruction │
│ │ Twin Builder │ of operating airports from accumulated data │
│ │ │ Enable: novel view synthesis, counterfactual │
│ │ │ scenario injection, sensor simulation │
│ └──────┬──────┘ │
│ │ │
│ STAGE 6: GENERATE │
│ ┌──────▼──────┐ │
│ │ Synthetic │ Use Cosmos / world model to generate: │
│ │ Data Gen │ - Rare scenarios (near-collision, FOD, de-icing)│
│ │ │ - Weather variations (rain, fog, night) │
│ │ │ - New airport layouts (from AIXM geometry) │
│ │ │ - Adversarial scenarios (SafeDreamer-style) │
│ └─────────────┘ │
└─────────────────────────────────────────────────────────────────┘Data flywheel: Once vehicles run the new stack in shadow mode, every drive generates:
- Sensor data (training input)
- Current stack decisions (baseline labels)
- New stack decisions (comparison)
- Safety monitor activations (edge case mining gold)
- World model prediction errors (active learning signal — collect more data where the model is wrong)
3.7 Simulation Stack
Multi-fidelity, building on your existing kinematic sim:
LEVEL 0: KINEMATIC SIM (existing airside_python_sim)
├── Bicycle model, validated 4.1% error
├── Tests nav stack logic (waypoints, FSM, zone management)
├── Unchanged — keeps testing current stack
└── Add: world model prediction quality regression tests
LEVEL 1: SCENARIO SIM (new — world model in the loop)
├── World model generates future occupancy from past data
├── Replay real bag data, inject counterfactual actions
├── "What if I had turned left?" → world model predicts outcome
├── Closed-loop: planner proposes action → world model predicts
│ → planner re-plans → repeat for full scenario
└── Key metric: collision rate, goal completion, prediction error
LEVEL 2: NEURAL SIM (new — learned sensor simulation)
├── 3DGS digital twin of airport
├── Render synthetic LiDAR + camera from any viewpoint
├── Insert dynamic agents (aircraft, GSE, pedestrians)
├── Vary: weather, lighting, time of day, surface conditions
├── Full closed-loop with rendered sensor data
└── Enables testing perception + world model + planning end-to-end
LEVEL 3: SHADOW MODE (on real vehicle)
├── New stack runs alongside current stack, no actuator access
├── Compare: new stack decisions vs. current stack actions
├── Log: world model predictions vs. reality (prediction error)
├── Mine: disagreements between stacks = edge cases
└── Graduate to: new stack drives, current stack monitors3.8 Safety Architecture
┌──────────────────────────────────────────────────────────────┐
│ SAFETY MONITOR │
│ │
│ LAYER 1: INPUT VALIDATION │
│ ├── Sensor health check (all LiDARs alive, IMU sane, etc.) │
│ ├── Ego-pose sanity (velocity within physical bounds) │
│ └── Clock synchronization check │
│ │
│ LAYER 2: WORLD MODEL CONFIDENCE │
│ ├── OOD Detection │
│ │ ├── Ensemble disagreement (N world model copies) │
│ │ ├── Reconstruction error (VQ-VAE decoder quality) │
│ │ └── Input distribution monitoring (Mahalanobis distance)│
│ ├── Prediction Confidence │
│ │ ├── Per-voxel occupancy entropy │
│ │ ├── Temporal consistency (prediction stability) │
│ │ └── Calibration check (predicted vs. observed error) │
│ └── Threshold: if confidence < tau → FALLBACK │
│ │
│ LAYER 3: TRAJECTORY SAFETY │
│ ├── RSS (Responsibility-Sensitive Safety) envelope │
│ │ ├── Safe longitudinal distance (response time + braking)│
│ │ ├── Safe lateral distance │
│ │ ├── Right-of-way rules (aircraft always have priority) │
│ │ └── Proper response to restricted zones │
│ ├── Occupancy collision check │
│ │ └── Trajectory swept volume vs. predicted occupancy │
│ └── Violation: if trajectory breaches RSS → FALLBACK │
│ │
│ LAYER 4: OPERATIONAL BOUNDS │
│ ├── Geofence check (NOTAM zones, airport boundary) │
│ ├── Speed limits (zone-based, from zone_manager) │
│ ├── Teleoperation heartbeat (remote operator alive?) │
│ └── System health (system_minder integration) │
│ │
│ ARBITRATION LOGIC: │
│ ├── ALL layers pass → New Stack drives │
│ ├── ANY layer fails → Current Stack drives (graceful) │
│ ├── Critical failure → Controlled stop + teleoperation │
│ └── All decisions logged with full reasoning trace │
└──────────────────────────────────────────────────────────────┘Graceful degradation hierarchy:
- Full autonomy (new stack) — everything nominal
- Fallback autonomy (current stack) — new stack uncertain, current stack drives
- Reduced capability — both stacks uncertain, slow to safe speed, increase safety margins
- Controlled stop — safe position, engage parking brake
- Teleoperation — remote operator takes control via Fernride-style interface
3.9 Latency Budget
At airside speeds (5-30 km/h = 1.4-8.3 m/s), a 200ms total pipeline latency means the vehicle travels 0.28-1.67m before reacting. For comparison, the current stack runs planning at 50Hz (20ms cycle).
End-to-end latency budget (sensor-to-actuator):
| Component | Budget | Notes |
|---|---|---|
| Sensor aggregation + ego-compensation | 10ms | Existing LidarProcessorNodelet does this in ~5ms |
| BEV encoding (LiDAR branch) | 20ms | PointPillars on Orin |
| BEV encoding (camera branch, when added) | 30ms | DINOv2 ViT-B + LSS |
| World model inference (Lite tier, 50M) | 50ms | 2s horizon, 10 steps |
| Planning (trajectory scoring or learned) | 20ms | 420 Frenet candidates or diffusion |
| Safety monitor (all layers) | 10ms | RSS check + OOD ensemble |
| Arbitration + command output | 5ms | Negligible |
| Total | ~145ms | Fits within 10Hz loop |
World model cadence resolution: The world model runs at 5Hz (every other sensor bundle at 10Hz input rate). Between world model updates, the planner uses the most recent occupancy prediction with ego-motion extrapolation. This is analogous to how the current stack uses /odom/fused at 20Hz but /odom/fused/high_rate at 50Hz for interpolation.
Target loop rates:
- Perception + world model: 5-10Hz (new stack)
- Planning: 10-20Hz (reuses latest world model prediction)
- Safety monitor: 20-50Hz (lightweight checks, matches current stack rate)
- Control (Stanley + vehicle interface): 50Hz (unchanged)
3.10 Memory and VRAM Budget
| Component | VRAM/RAM | Notes |
|---|---|---|
| BEV Encoder (LiDAR, PointPillars) | ~500MB VRAM | Standard deployment size |
| BEV Encoder (Camera, DINOv2-B + LSS) | ~2GB VRAM | ViT-B frozen backbone |
| World Model (OccWorld Lite, 50M) | ~800MB VRAM | FP16 inference |
| World Model (OccWorld Base, 200M) | ~2GB VRAM | FP16 inference |
| Safety Monitor (3x ensemble) | ~1.5GB VRAM | 3 copies of lightweight heads |
| VQ-VAE tokenizer/decoder | ~400MB VRAM | Shared with world model |
| Current stack (GTSAM + VGICP) | ~4GB VRAM | Already allocated, GPU localization |
| OS + ROS + buffers | ~4GB RAM | System overhead |
| Sensor data buffers | ~2GB RAM | Point clouds + images in pipeline |
Compute tier feasibility:
| Platform | VRAM | Fits LiDAR-only (Phase 1) | Fits Multi-modal (Phase 2) | Fits VLA (Phase 3) |
|---|---|---|---|---|
| Orin 32GB | 32GB unified | Yes (~13GB total: 7GB GPU + 6GB system) | Yes (~17GB total) | No (10B model needs ~20GB+ GPU alone) |
| Orin 64GB | 64GB unified | Yes | Yes | Marginal (INT4 quantized) |
| Thor | 128GB+ | Yes | Yes | Yes |
| Cloud/edge | Unlimited | Yes | Yes | Yes |
3.11 Localization Strategy
The new stack reuses the existing GTSAM localization — it is already production-tested and provides high-quality ego-pose via GPU VGICP scan-matching.
Rationale: Localization is a solved problem in this stack. The GTSAM factor graph (VGICP + IMU + GPS + wheel odometry) outputs /odom/fused at 20Hz and /odom/fused/high_rate at 50Hz. Both stacks share this single localization source.
Architecture:
GTSAM Localization (existing, shared)
│
├──→ /odom/fused @ 20Hz ──→ Current Stack (planning)
│ ──→ New Stack (BEV encoder ego-compensation)
│
└──→ /odom/fused/high_rate @ 50Hz ──→ Both stacks (control loop)Future enhancement path:
- Phase 2+: The world model's implicit ego-motion prediction can serve as an additional factor in the GTSAM graph (learned odometry factor alongside VGICP)
- Phase 3+: Visual localization from cameras can add a visual factor to GTSAM (place recognition, visual odometry)
- The localization system is never replaced, only enriched with additional factors
3.12 Arbitration State Machine
┌──────────────┐
startup ──────→│ INITIALIZING │
│ (both stacks │
│ warming up) │
└──────┬───────┘
│ both stacks healthy
▼
┌──────────────┐
┌──────│ NEW_STACK │◄────────────────┐
│ │ DRIVING │ │
│ └──────┬───────┘ │
│ │ any safety layer fails │ all safety layers pass
│ ▼ │ for T_promote seconds
│ ┌──────────────┐ │ (hysteresis: 2s default)
│ │ FALLBACK │──────────────────┘
│ │ DRIVING │
│ └──────┬───────┘
│ │ fallback stack also uncertain
│ ▼
│ ┌──────────────┐
│ │ REDUCED │──→ (recovery) ──→ FALLBACK_DRIVING
│ │ CAPABILITY │
│ │ (slow, wide │
│ │ margins) │
│ └──────┬───────┘
│ │ critical failure
│ ▼
│ ┌──────────────┐
│ │ CONTROLLED │──→ (operator connects) ──→ TELEOPERATION
│ │ STOP │
│ └──────────────┘
│
│ critical failure (any state)
└──────────────────────────→ CONTROLLED_STOPKey design decisions:
- Both stacks always run hot. The current stack never cold-starts. It continuously processes sensor data and computes trajectories, even when the new stack is driving. Handoff latency is one control cycle (20ms).
- Hysteresis on promotion: The new stack must pass all safety checks for
T_promoteconsecutive seconds (default: 2s) before being promoted back from FALLBACK to NEW_STACK_DRIVING. This prevents rapid oscillation. - Mid-trajectory rejection: If the safety monitor rejects a new-stack trajectory mid-execution, the arbitrator immediately switches to the current stack's latest trajectory. The current stack's trajectory is always available because it runs continuously.
- All transitions are logged with full sensor snapshots, world model predictions, safety monitor scores, and the arbitration decision. This creates a rich dataset for debugging and improvement.
3.13 Data Volume Estimates
Current state (estimated):
| Data Source | Estimated Volume | Notes |
|---|---|---|
| Bag files (scattered) | 2-10 TB | Based on ~60GB visible in workspace, likely more on vehicle/server storage |
| PCD maps | ~500MB | 2 maps in workspace |
| Scenarios (YAML) | <1MB | Kinematic sim test scenarios |
Target dataset sizes:
| Phase | Target | Source | Storage |
|---|---|---|---|
| Phase 0 | 50-200 hours extracted scenes | Existing bags, organized and indexed | 1-5 TB (zarr/lance format) |
| Phase 1 | 500+ hours with auto-labels | Shadow mode collection + auto-labeling | 10-20 TB |
| Phase 2 | 2,000+ hours (real + synthetic) | Fleet collection + Cosmos/3DGS generation | 50-100 TB |
| Phase 3 | 5,000+ hours | Multi-airport fleet + scenario generation | 100-200 TB |
Per-vehicle data rate: ~1-2 TB/hour raw (4-8 LiDAR @ 10Hz + future cameras). Extracted scenes (downsampled, compressed) are ~50-100 GB/hour.
4. World Model Training Pipeline
4.1 Pre-training (Transfer from Road Driving)
Phase 1: Foundation (no airside data needed)
├── Download nuScenes + Waymo Open Dataset occupancy labels
├── Pre-train BEV encoder on road LiDAR data
├── Pre-train occupancy prediction transformer on road scenes
├── Pre-train VQ-VAE tokenizer on road BEV features
└── Result: World model that understands general driving dynamics
Phase 2: Domain Adaptation (your unorganized data)
├── Run data engine Stage 1-3 on your bag files
├── Self-supervised pre-training: mask future LiDAR frames, predict
├── No labels needed — the world model learns airside dynamics
│ from reconstruction loss alone
├── Fine-tune BEV encoder on airside LiDAR distribution
└── Result: World model adapted to airport geometry and dynamics
Phase 3: Supervised Fine-tuning (labeled airside data)
├── Auto-label with foundation models (Stage 3 of data engine)
├── Human review of auto-labels for quality-critical scenarios
├── Train occupancy prediction with ground truth occupancy
├── Train detection heads with labeled objects
└── Result: World model with airside-specific object understanding
Phase 4: RL Post-training (simulation)
├── SafeDreamer-style constrained RL in world model imagination
├── Reward: goal completion + smoothness
├── Constraint: zero collisions, RSS compliance, zone respect
├── Use adversarial scenario generation for robustness
└── Result: Planning policy optimized for airside safety4.2 Continual Learning (Fleet Operation)
Shadow Mode Operation
├── New stack runs, current stack drives
├── Log: world model predictions, new stack decisions
├── Measure: prediction error (predicted occupancy vs. observed)
├── Mine: high-error scenes → priority training data
├── Weekly: retrain world model on accumulated data
├── Monthly: evaluate new model in simulation → promote or reject
└── Gradually: transfer control from current → new stack
based on confidence metrics and safety record5. Key Model Choices and Alternatives
5.1 World Model Architecture Decision
| Option | Architecture | Params | Pros | Cons | Recommendation |
|---|---|---|---|---|---|
| OccWorld | VQ-VAE + GPT transformer (ECCV 2024) | 50-200M | Proven on nuScenes, open-source, class-agnostic | Autoregressive = sequential, slower for long horizons | Start here |
| Drive-OccWorld | Action-conditioned occupancy | 100-300M | Action-conditioned (what the planner needs), 33% better than UniAD | AAAI 2025, less mature | Upgrade to |
| Copilot4D | VQ-VAE + discrete diffusion on LiDAR (ICLR 2024) | ~52M published (39M WM + 13M tokenizer) | Native LiDAR (perfect for you), Waabi-backed | Point cloud output (not occupancy), no open code | Evaluate |
| OccSora | Diffusion on 4D occupancy | 300-500M | Longest horizon (16s), highest fidelity | Diffusion = slow inference, needs strong GPU | Long-term |
| Mamba-based | SSM for temporal modeling | 50-200M | O(n) complexity, longer horizons feasible | Newer, less proven for driving | Research track |
Recommendation: Start with OccWorld (open-source, proven, fits Orin). Upgrade to Drive-OccWorld when action-conditioned prediction is needed for planning. Evaluate Mamba-based architectures for long-horizon efficiency.
5.2 Planning Head Decision
| Option | Type | Params | Pros | Cons | When |
|---|---|---|---|---|---|
| World-model-augmented Frenet | Hybrid | 0 (reuses existing) | Zero-risk, drop-in upgrade, keeps proven planner | Limited by Frenet candidate space | Phase 1 |
| DiffusionDrive | Diffusion policy (CVPR 2025 Highlight) | ~50-100M published (ResNet-34 backbone) | Multi-modal trajectories, real-time truncated diffusion | Needs training on airside data | Phase 2 |
| TD-MPC2 in latent space | MPC + learned value | 300M | Latent MPC, proven across 104 tasks, single hyperparams | Implicit world model may conflict with explicit one | Phase 2 alt |
| Alpamayo VLA | Vision-Language-Action | 10.5B | Explainable, text-guided, NVIDIA ecosystem | Needs cameras, heavy compute, fine-tuning complex | Phase 3 |
5.3 Perception Backbone Decision
| Option | Modality | Params | Pros | Cons | When |
|---|---|---|---|---|---|
| PointPillars | LiDAR | 5M | Fast, proven, good baseline | Limited semantic understanding | Phase 1 |
| CenterPoint | LiDAR | 10M | Better 3D detection, center-based | Slightly heavier | Phase 1 alt |
| BEVFusion | LiDAR + Camera | 50-100M | Multi-modal, SOTA fusion | Needs cameras | Phase 2 |
| DINOv2 + LSS | Camera-primary | 300M | Rich semantic features, foundation model | Needs cameras | Phase 2 |
| UniPAD | LiDAR + Camera | 200M | Unified pre-training, CVPR 2024 | Complex training pipeline | Phase 3 |
6. Airside-Specific Innovations
6.1 Jet Blast Hazard Prediction
No road AV system needs this. Your world model should predict jet blast zones as hazard occupancy:
Input: Aircraft type (from ADS-B) + Engine status (from ATC/A-CDM)
→ Lookup: jet blast envelope (CFD-derived lookup table per aircraft type)
→ Render: hazard occupancy extending behind aircraft engines
→ World model: learns to predict jet blast evolution over time
→ Planner: treats hazard occupancy same as physical obstacles6.2 FOD Detection via Occupancy Anomaly
FOD (Foreign Object Debris) is a critical airside concern. Traditional approaches need purpose-built detectors. Occupancy world models offer a natural solution:
Ground plane occupancy at t-1: empty
Ground plane occupancy at t: occupied (small region, low height)
→ Anomaly: unexpected ground-level occupancy
→ Flag as potential FOD
→ If persistent across multiple frames: confirmed FOD
→ Alert operator + avoid zone6.3 Turnaround Sequence Prediction
The world model can learn the temporal structure of aircraft turnarounds:
Turnaround phases: Arrival → Chocks On → Doors Open → Unloading → Loading → Doors Close → Pushback
World model + flight schedule context:
→ Predicts which phase each gate is in
→ Predicts when pushback will happen (± 2-5 min)
→ Planner routes GSE to arrive just-in-time
→ Reduces apron congestion and idle time6.4 Ground Control Instruction Following (VLA Phase)
When Alpamayo or equivalent VLA is integrated:
Digital taxi instruction: "Tractor 7, proceed via Alpha to Stand B12, hold short of Bravo"
→ VLA parses instruction into route constraint
→ Plans trajectory: follow taxiway Alpha → hold at Bravo intersection
→ Monitors: ATC clearance to cross Bravo
→ Explains: "Holding short of Bravo as instructed. Awaiting clearance."7. Research Gaps and Open Questions
7.1 Unsolved Problems
| Gap | Impact | Mitigation Strategy |
|---|---|---|
| No public airside driving dataset | Can't benchmark against others | Your data becomes the first — publish a subset? |
| Occupancy world models not yet deployed on any real vehicle | You'd be first | Start with world-model-augmented Frenet (low risk) |
| Airside regulatory framework for ML-based AV is immature | Certification uncertainty | Build safety case with AMLAS + UL 4600, engage FAA early |
| LiDAR-only world models are less studied than camera-based | Fewer pre-trained models available | Copilot4D and AD-L-JEPA are LiDAR-native, OccWorld works from LiDAR BEV |
| Jet blast prediction from learned models is unexplored | Novel research needed | Start with CFD lookup table, learn residuals |
| Multi-agent coordination between autonomous GSE | No airside-specific solutions | Start with single vehicle, add fleet coordination later |
7.2 Research Directions Worth Tracking
- Dreamer V4 — Block-causal transformer world model, learns from unlabeled video. Could learn airside dynamics from your bags without labels.
- V-JEPA 2-AC — ~15x faster planning than video generation baselines (16s vs. ~4min per action). If it scales to driving, it's a game-changer for real-time world model planning.
- WorldRFT — RL fine-tuning of world models (AAAI 2026, 83% collision reduction). Direct path to safe learned planning.
- Waymo World Model (Genie 3) — Generates camera AND LiDAR data. When open-sourced, could bootstrap your neural sim.
- ISO/PAS 8800 — Just published Dec 2024. First standard addressing AI safety lifecycle for AV. Foundation for your safety case.
8. Phased Roadmap
Phase 0: Foundation (Months 0-3)
- [ ] Data engine Stages 1-3: index, extract, auto-label existing bags
- [ ] Pre-train BEV encoder on nuScenes/Waymo LiDAR
- [ ] Pre-train OccWorld on nuScenes occupancy
- [ ] Set up training infrastructure (GPU server or cloud)
- [ ] Define sensor abstraction layer interfaces
- Deliverable: Organized airside dataset + pre-trained world model baseline
Phase 1: Airside World Model (Months 3-9)
- [ ] Fine-tune OccWorld on airside LiDAR data (self-supervised)
- [ ] Build world-model-augmented Frenet planner
- [ ] Implement safety monitor (OOD detection + RSS envelope)
- [ ] Integrate Simplex arbitration with current stack
- [ ] Shadow mode deployment on one vehicle
- Deliverable: World model running in shadow mode, predicting airside scenes
Phase 2: Perception + Simulation Upgrade (Months 6-15)
- [ ] Add cameras to vehicle hardware
- [ ] Integrate DINOv2 + BEVFusion perception backbone
- [ ] Open-vocabulary detection for all airside object types
- [ ] Build 3DGS digital twin of primary airport
- [ ] Level 2 neural sim (closed-loop in reconstructed airport)
- [ ] Upgrade to Drive-OccWorld (action-conditioned)
- Deliverable: Multi-modal perception + rich simulation environment
Phase 3: Learned Planning + VLA (Months 12-24)
- [ ] Train DiffusionDrive or TD-MPC2 planning head on airside data
- [ ] Integrate Alpamayo VLA (language reasoning + explainability)
- [ ] Airport integration layer (ADS-B, A-CDM, NOTAM)
- [ ] Jet blast hazard prediction system
- [ ] Ground control instruction following (VLA)
- [ ] RL post-training with SafeDreamer constraints
- Deliverable: End-to-end learned planning with explainability
Phase 4: Fleet + Generalization (Months 18-30+)
- [ ] Multi-vehicle fleet coordination via shared world model
- [ ] Few-shot adaptation to new airports
- [ ] Continual learning data engine from fleet
- [ ] Safety case for certification (AMLAS + UL 4600)
- [ ] Turnaround sequence prediction and optimization
- Deliverable: Multi-airport, multi-vehicle autonomous airside operations
9. Success Criteria
| Metric | Phase 1 Target | Phase 3 Target | Phase 4 Target |
|---|---|---|---|
| World model prediction error (occupancy IoU) | >0.5 @ 2s | >0.7 @ 4s | >0.8 @ 4s |
| Shadow mode agreement with current stack | >80% | >95% | >99% |
| Novel object detection (unseen GSE types) | N/A (LiDAR only) | >70% recall | >90% recall |
| Collision rate (simulation) | <1% | <0.1% | <0.01% |
| New airport adaptation time | N/A | <1 week | <1 day |
| Decision explainability | None | VLA explanations | Logged for every decision |
| Adverse weather operation | Baseline (rain filter) | Camera+radar fusion | World model robustness validated |
This design specification synthesizes findings from 20 deep-dive research reports (13,210 lines of analysis across 200+ papers) with the concrete reality of the reference airside AV stack ADS ROS Noetic stack — 22 packages, LiDAR-centric perception, GTSAM localization, Frenet planning, multi-platform vehicle support. The architecture is designed to preserve every strength of the current stack while progressively unlocking capabilities that only learned world models can provide.