World Models for Autonomous Vehicles: A Comprehensive Technical Report

Date: March 2025

First-principles companion pages: World Models, JEPA Latent Predictive Learning, and Attention and Transformers. Focused additions: Self-Supervised Occupancy Flow and UniScene Occupancy-Centric Generation.

What Are World Models in the AV Context?
Key Architectures and Papers (2023--2025)
How World Models Improve AV Stacks
Domain Transfer: Airside Reference ODD
State of the Art: Deployable vs. Research-Only

1. What Are World Models in the AV Context?

1.1 Core Concept

A world model is a learned internal representation of the environment that enables an autonomous system to predict how the world will evolve over time in response to actions. Rather than merely perceiving the current state and mapping it to control outputs, a world model allows the system to imagine future states -- simulating what will happen before committing to a plan.

The concept originates from Ha and Schmidhuber's 2018 "World Models" paper and LeCun's articulation of the Joint Embedding Predictive Architecture (JEPA), but it has been dramatically accelerated in the AV domain by the convergence of large-scale generative models (diffusion models, autoregressive transformers, and video foundation models) with driving-specific datasets.

In the autonomous driving context, a world model is formally a function:

s_{t+1} = f(s_t, a_t, c_t)

where s_t is the current world state (which may be represented as video frames, point clouds, occupancy grids, or learned latent vectors), a_t is the ego-vehicle action, and c_t is contextual conditioning (e.g., map data, text prompts, weather, other agents' behaviors).

1.2 How They Differ from Traditional Perception-Planning Pipelines

Traditional AV stacks follow a modular pipeline:

Sensors --> Perception --> Prediction --> Planning --> Control

Each module is designed, trained, and evaluated independently. Perception detects objects and lane markings; prediction forecasts their trajectories; planning generates a safe ego-trajectory; control executes it. This design is interpretable and allows component-level testing, but it suffers from several fundamental limitations:

Limitation	Traditional Pipeline	World Model Approach
Error propagation	Errors cascade through modules (missed detection kills planning)	End-to-end learning can absorb imperfections
Information bottleneck	Intermediate representations discard information (e.g., bounding boxes lose shape/texture)	Richer latent states preserve more scene context
Closed-world assumption	Relies on pre-defined object taxonomies	Learns general scene dynamics, can handle novel objects
Planning horizon	Typically short-horizon reactive planning	Can "imagine" long-horizon futures (seconds to minutes)
Data efficiency	Requires massive labeled datasets per module	Self-supervised pre-training from raw video
Rare event handling	Limited to observed training data	Can generate and train on counterfactual scenarios

1.3 The "Imagination" Paradigm

The key conceptual shift is from reactive to imaginative autonomy. A world model enables:

Mental simulation: Before executing an action, the system rolls out multiple candidate futures in its learned latent space, evaluating which action sequence leads to the best outcome. This is analogous to model-based reinforcement learning (MBRL) and model predictive control (MPC) but in a learned representation rather than a hand-crafted physics simulator.
Counterfactual reasoning: "What would happen if I accelerated here?" or "What if that pedestrian steps off the curb?" These questions can be answered by conditioning the world model on hypothetical actions and observing the predicted outcomes.
Self-supervised learning at scale: Because predicting the future is an inherently self-supervised task (the ground truth is simply what happens next), world models can be trained on vast quantities of unlabeled driving data, following the same scaling paradigm that powered large language models.

2. Key Architectures and Papers (2023--2025)

2.1 GAIA-1 and GAIA-2 (Wayve, 2023--2025)

GAIA-1 (arXiv:2309.17080)

Architecture: GAIA-1 is a 9.1-billion-parameter generative world model composed of three components:

Multimodal encoders for video, text, and action inputs, projecting them into a shared representation space with temporal alignment
Autoregressive transformer (6.5B parameters) that predicts the next set of image tokens in a sequence, using vector-quantized (VQ) representations to cast future prediction as next-token prediction
Video diffusion decoder (2.6B parameters) that converts predicted tokens back to pixel space

Training: 15 days on 64 NVIDIA A100 GPUs for the world model, 15 days on 32 A100s for the video decoder. Trained on 4,700 hours of proprietary urban driving data collected in London (2019--2023).

Key innovations:

Frames world modeling as unsupervised sequence modeling analogous to LLM pre-training
Demonstrates emergent understanding of driving concepts (road geometry, traffic rules, agent interactions) without explicit labels
Supports multimodal conditioning: text prompts (e.g., "Going around a stopped bus"), action sequences (steering, speed), and video context
Can generate diverse plausible futures from the same starting context
Exhibits LLM-like scaling laws -- performance improves predictably with more data and compute

Limitations: Autoregressive frame-by-frame generation is computationally expensive; single-camera output only.

GAIA-2 (arXiv:2503.20523, March 2025)

Architecture: Fundamentally redesigned with two components:

Video tokenizer that compresses raw pixel-space videos into a compact continuous latent space (not discrete tokens as in GAIA-1)
Latent diffusion world model that operates on compressed representations to predict future states

Key advances over GAIA-1:

Latent diffusion replaces autoregressive generation: Encodes entire sequences into continuous latent space, eliminating temporal discontinuities and improving motion smoothness
Native multi-camera generation: Spatiotemporally consistent video across multiple viewpoints simultaneously
Geographic diversity: Trained across UK, US, and Germany with region-specific adaptation
Rich structured conditioning: Ego-vehicle dynamics (speed, steering curvature), 3D bounding boxes for dynamic agents, weather/time-of-day, road semantics (lanes, speed limits, crossings, intersections), and external embeddings (CLIP + proprietary driving model embeddings)
Multiple generation modes: Forecast future from existing video, create entirely new scenes, modify existing sequences, generate action-conditioned scenarios

2.2 UniSim (Google DeepMind / Waabi, CVPR 2023 Highlight)

Paper: arXiv:2308.01898

Architecture: Neural rendering-based closed-loop sensor simulator that separates scene reconstruction into:

Static background reconstruction using neural feature grids
Dynamic actor reconstruction with learnable priors for dynamic objects
Compositing network that merges static and dynamic elements
Convolutional completion network for inpainting regions not visible in original recordings

Approach: Converts a single recorded sensor log into a fully interactive, closed-loop multi-sensor simulation. Given a real-world driving log, UniSim can:

Re-render the scene from novel viewpoints
Add, remove, or reposition actors
Modify ego and actor trajectories arbitrarily
Simulate both LiDAR and camera data simultaneously

Key contribution: Demonstrates that the sim-to-real gap can be small enough for meaningful closed-loop safety evaluation -- testing AV stacks on safety-critical scenarios "as if in the real world" without ever leaving simulation. This is particularly important for scenarios too dangerous to test on public roads.

2.3 MILE -- Model-Based Imitation Learning (arXiv:2210.07729)

Architecture: Learns a compact latent world model jointly with a driving policy from expert demonstrations, without any online environment interaction.

Key innovations:

Uses 3D geometry as an inductive bias to ground latent representations
First camera-only method to simultaneously represent static scene, dynamic scene, and ego-behavior in a unified world model
Learned representations are interpretably decodable to bird's-eye-view (BEV) semantic segmentation
Plans entirely "in imagination" -- executes complex driving maneuvers from trajectories predicted within the world model's latent space

Results: 31% improvement in driving score on CARLA over prior state-of-the-art when deployed in previously unseen towns and weather conditions. Open-source code and weights available.

Significance: MILE demonstrated that offline learning from demonstrations, combined with a world model for imaginative planning, could match or exceed methods requiring expensive online interaction.

2.4 DriveDreamer and DriveDreamer-2

DriveDreamer (arXiv:2309.09777)

Approach: The first world model built entirely from real-world driving data (prior work primarily used gaming/simulated environments). Uses diffusion models to represent complex driving environments.

Two-stage training:

Learns structured traffic constraints from real-world scenarios
Develops predictive capabilities for future state anticipation

Capabilities: Generates controllable driving videos that faithfully capture traffic scenario constraints. Evaluated on the nuScenes benchmark.

DriveDreamer-2 (arXiv:2403.06845)

Key advance: Integrates Large Language Models (LLMs) into the world model pipeline, becoming the first world model to generate customized driving videos from natural language descriptions.

Pipeline:

LLM converts user text queries into agent trajectories
HDMap generation module ensures trajectories adhere to traffic rules
Unified multi-view video synthesis model generates spatiotemporally coherent multi-camera driving videos

Results: FID score of 11.2 (30% improvement), FVD score of 55.7 (50% improvement) over prior SOTA. Generated data demonstrably improves downstream 3D object detection and tracking.

2.5 ADriver-I (arXiv:2311.13549)

Architecture: Combines multimodal large language models (MLLMs) with diffusion models in a unified autonomous driving framework.

Key innovation -- Interleaved vision-action pairs: Unifies visual features and control signals into a single data format, enabling a single model to handle perception, prediction, and control.

Operational loop (autoregressive):

Accept current vision-action pair
Predict control signals for current frame
Use generated control signals + history to predict future frames
Repeat indefinitely

Significance: Demonstrates that the modular perception-prediction-planning pipeline can be collapsed into a single autoregressive loop, reducing redundancy inherent in traditional modular designs.

2.6 GenAD -- Generative End-to-End Autonomous Driving (CVPR 2024 Highlight)

Paper: arXiv:2403.09630

Architecture: The first large-scale video prediction model specifically for autonomous driving, using latent diffusion models with novel temporal reasoning blocks.

Training data: Over 2,000 hours of web-sourced driving videos with diverse geographic coverage, weather, and traffic conditions paired with text descriptions.

Key capabilities:

Zero-shot generalization to unseen driving datasets, surpassing both general and driving-specific video prediction models
Adaptable as either an action-conditioned prediction model or a motion planner
Associated dataset available through OpenDriveLab/DriveAGI on GitHub

2.7 Copilot4D (arXiv:2311.01017)

Architecture: Two-stage pipeline for 4D (3D spatial + temporal) point cloud forecasting:

VQVAE tokenizer converts raw LiDAR observations into discrete tokens
Discrete diffusion model (adapted from Masked Generative Image Transformer) predicts future tokens with parallel decoding

Results:

65% reduction in Chamfer distance over prior SOTA at 1-second prediction
50% reduction at 3-second prediction
Evaluated on NuScenes, KITTI Odometry, and Argoverse2

Significance: Demonstrates that "discrete diffusion on tokenized agent experience can unlock the power of GPT-like unsupervised learning for robotics" -- establishing point cloud world modeling as a viable paradigm parallel to video-based approaches.

2.8 OccWorld (arXiv:2311.16038)

Architecture: A world model operating in 3D occupancy space rather than image or point cloud space.

Scene tokenizer based on reconstruction: converts 3D occupancy grids into discrete scene tokens
GPT-like spatial-temporal generative transformer predicts future scene tokens and ego-motion tokens simultaneously

Rationale for occupancy representation:

More fine-grained 3D scene structure than bounding boxes
More economical to obtain (e.g., from sparse LiDAR)
Adapts to both vision-only and LiDAR setups

Results: Competitive planning results on nuScenes without requiring instance-level or map supervision. Code available on GitHub.

2.9 Vista (arXiv:2405.17398, OpenDriveLab)

Architecture: A generalizable driving world model addressing three limitations of prior work: poor generalization, low fidelity, and limited controllability.

Key innovations:

Novel loss functions targeting moving instances and structural information for high-resolution real-world dynamics
Latent replacement approach for injecting historical frames as priors, enabling temporally consistent long-horizon predictions
Multi-level controllability: high-level (commands, goal points) and low-level (trajectories, angles, speeds) conditioning through a unified efficient learning strategy

Results:

Outperforms general video generators in >70% of comparisons
55% improvement in FID and 27% in FVD over best prior driving world model
First model to generate generalizable reward signals for action evaluation without ground-truth references

2.10 MUVO -- Multimodal World Model with Geometric Voxel Representations

Paper: arXiv:2311.11762 (accepted IV 2025)

Innovation: Addresses the gap where most driving world models use camera-only input by exploring different sensor fusion strategies combining camera and LiDAR.

Approach: Instead of predicting raw sensor data, outputs 3D occupancy predictions -- argued to be more actionable for downstream driving tasks. Analyzes weaknesses of current sensor fusion approaches.

2.11 Think2Drive (arXiv:2402.16720)

Architecture: First model-based reinforcement learning method for autonomous driving:

Latent world model learns environment state transitions in low-dimensional latent space
RL-trained planner uses the world model as a neural simulator

Key results:

First reported 100% route completion rate in CARLA v2
Handles 39 common driving events
Achieves expert-level performance within 3 days on a single A6000 GPU
Introduced CornerCase-Repository benchmark for scenario-based evaluation

Significance: Demonstrates that model-based RL with a learned world model can handle corner cases far more flexibly than rule-based planners.

2.12 WorldDreamer (arXiv:2401.09985)

Approach: Frames world modeling as unsupervised visual sequence modeling via masked token prediction (inspired by BERT/MAE rather than GPT-style autoregression).

Capabilities: Captures dynamic elements across diverse environments including natural scenes and driving contexts. Supports text-to-video, image-to-video, and video editing tasks.

Innovation: Moves beyond domain-specific driving models to develop general world physics understanding applicable across varied environments.

2.13 DriveWorld (CVPR 2024, arXiv:2405.04390)

Architecture: 4D pre-training framework using a Memory State-Space Model with two components:

Dynamic Memory Bank: Learns temporal-aware latent dynamics for predicting future changes
Static Scene Propagation: Learns spatial-aware latent statics for comprehensive scene context
Task Prompt mechanism: Decouples task-specific features for flexible downstream adaptation

Results (pre-trained on OpenScene, evaluated on nuScenes):

3D Object Detection: +7.5% mAP
Online Mapping: +3.0% IoU
Multi-Object Tracking: +5.0% AMOTA
Motion Forecasting: -0.1m minADE
Occupancy Prediction: +3.0% IoU
Planning: -0.34m average L2 error

2.14 NVIDIA Cosmos World Foundation Models (arXiv:2501.03575)

Overview: An open-weight platform for physical AI development, positioning world foundation models as the counterpart to language foundation models but for the physical world.

Architecture: Includes both diffusion and autoregressive transformer models trained on 9,000 trillion tokens from 20 million hours of real-world data spanning human interactions, industrial, robotics, and driving scenarios. Model sizes range from 4B to 14B parameters.

Three tiers:

Nano: Optimized for real-time, low-latency edge deployment
Super: High-performance baseline models
Ultra: Maximum quality for distilling custom models

AV-specific extensions:

Cosmos-Drive-Dreams (arXiv:2506.09042): Specialized adaptation for generating controllable, high-fidelity, multi-view, spatiotemporally consistent driving videos. Demonstrated utility for 3D lane detection, 3D object detection, and driving policy learning. Addresses long-tail distribution problems.
Cosmos-Predict2.5 / Transfer2.5 (arXiv:2511.00062): Next generation with improved video quality and instruction alignment.

Availability: Open-weight with permissive licenses via GitHub and Hugging Face; fine-tunable via NVIDIA NeMo.

2.15 Waymo -- SceneDiffuser Family (2024--2025)

SceneDiffuser (NeurIPS 2024): Efficient and controllable driving simulation initialization and rollout using diffusion models.

SceneDiffuser++ (CVPR 2025): City-scale traffic simulation via a generative world model. Enables large-scale scene generation for comprehensive testing.

2.16 Doe-1 (arXiv:2412.09627, December 2024)

Architecture: Closed-loop autonomous driving as next-token generation with multi-modal tokens:

Perception tokens: Free-form text for scene descriptions
Prediction tokens: Image tokens for future state generation in RGB space
Planning tokens: Position-aware tokenized actions

Autoregressively generates perception, prediction, and planning in a unified transformer. Evaluated on nuScenes for VQA, action-conditioned video generation, and motion planning.

2.17 Other Notable Models (2024--2025)

Model	Key Innovation
DrivingWorld (arXiv:2412.19505)	Video GPT generating >40-second driving clips with spatial-temporal fusion
DFIT-OccWorld (arXiv:2412.13772)	Efficient 3D occupancy world model with decoupled dynamic flow
DrivePhysica (arXiv:2412.08410)	Physics-informed world model with coordinate alignment and flow guidance
InfinityDrive (arXiv:2412.01522)	Minute-scale video generation (1500+ frames) with memory injection
HoloDrive (arXiv:2412.01407)	First joint 2D-3D generation (camera + LiDAR) via BEV transforms
DrivingDiffusion (arXiv:2310.07771)	Multi-view driving video generation controlled by 3D layout
GEM (arXiv:2412.11198)	Generalizable ego-vision multimodal model, 4000+ hours multimodal data
UniDWM (arXiv:2602.01536)	Unified driving world model with structure-aware latent representation
World4Drive (arXiv:2507.00603)	Intention-aware physical latent world model for end-to-end driving

3. How World Models Improve AV Stacks

3.1 Data Augmentation and Synthetic Data Generation

World models address the fundamental data bottleneck in AV development: rare but safety-critical scenarios are by definition underrepresented in real-world driving logs, yet these are precisely the scenarios that matter most.

Mechanisms:

Controllable scenario generation: Models like GAIA-2 and DriveDreamer-2 accept structured conditioning inputs (ego-actions, agent placements, weather, road configurations) to generate targeted scenarios. DriveDreamer-2 further allows natural language specification of scenarios (e.g., "a vehicle suddenly cuts in from the left lane").
Long-tail augmentation: Cosmos-Drive-Dreams specifically targets the long-tail distribution problem, generating rare edge cases (near-collisions, unusual road geometries, adverse weather) to augment training data. This demonstrably improves downstream 3D detection and lane detection performance.
Multi-view consistency: GAIA-2, DriveDreamer-2, and HoloDrive generate spatiotemporally consistent multi-camera outputs, critical for training surround-view perception systems.
Multi-modal synthesis: HoloDrive and MUVO generate aligned camera + LiDAR data, enabling joint sensor training without additional real-world collection.

Quantitative impact: DriveDreamer-2 achieves FID=11.2 and FVD=55.7 (30--50% improvements), and generated data demonstrably improves 3D detection and tracking performance.

3.2 Sim-to-Real Transfer

The persistent challenge in simulation-based AV development is the sim-to-real gap: models trained in simulation often fail when deployed on real sensor data due to differences in visual fidelity, sensor noise, physics, and scenario diversity.

World models attack this gap from multiple angles:

Learned neural rendering (UniSim): By reconstructing scenes from real sensor data rather than from artist-authored 3D assets, UniSim achieves sensor simulation with a "small domain gap on downstream tasks." This enables testing AV systems on modified real-world scenarios without the traditional simulation fidelity penalty.
Real-data-grounded generation: Unlike traditional simulators (CARLA, SUMO) that synthesize from scratch, models like DriveDreamer are trained on real-world data (nuScenes, proprietary fleets), inheriting realistic visual statistics, sensor noise patterns, and behavioral distributions.
Foundation model transfer: NVIDIA Cosmos provides a general-purpose pre-trained world model that can be fine-tuned for specific domains, leveraging 20 million hours of diverse real-world video to bootstrap sim-to-real transfer for specialized applications.

3.3 Planning via Imagination (Model Predictive Control in Latent Space)

This is arguably the most transformative application of world models. Rather than planning in hand-crafted state spaces, the AV can plan in a learned latent space:

MILE's approach:

Encode current observation into latent state
Propose candidate action sequences
Roll out each sequence through the world model in latent space
Decode predicted latent states to evaluate outcomes (e.g., BEV segmentation for collision checking)
Select the action sequence with the best predicted outcome

Think2Drive's approach (MBRL):

Learn a latent world model of environment transitions
Use the world model as a "neural simulator" for RL policy training
Train the planner entirely within the world model, avoiding expensive real-world or high-fidelity simulator interaction
Achieve 100% route completion in CARLA v2 in 3 days on a single GPU

Key advantages of latent-space planning:

Speed: Latent rollouts are orders of magnitude faster than physics simulation
Parallelism: Tensor-based computation enables massive parallel rollout of candidate plans
Gradient-based optimization: Differentiable world models allow direct gradient computation through the planning horizon
Implicit constraint handling: The world model naturally captures physical constraints (road geometry, vehicle dynamics) without explicit modeling

UniDWM and World4Drive (2025) further develop this paradigm by constructing structure- and dynamic-aware latent representations that unify perception, prediction, and planning in a single learned space, enabling intention-driven future state prediction.

3.4 Safety Validation and Scenario Generation

Safety validation is the most commercially pressing application. Proving AV safety requires testing against an astronomical number of scenarios, many of which are too dangerous or too rare to encounter naturally.

World model contributions:

Targeted corner case generation: GAIA-2 can synthesize near-collisions, emergency maneuvers, and out-of-distribution conditions by manipulating its structured conditioning inputs. Think2Drive introduced a CornerCase-Repository specifically for evaluating AV responses to challenging situations.
Closed-loop evaluation: UniSim enables testing the complete AV stack (perception through control) in a neural simulation loop, where the AV's actions affect the simulated environment, which in turn affects subsequent observations. This is fundamentally different from open-loop replay testing.
Counterfactual analysis: Given a real-world driving log, world models can answer "what if" questions -- what would have happened if the ego vehicle had braked later, or if an oncoming vehicle had not yielded?
Waymo's SceneDiffuser++: Enables city-scale traffic simulation, allowing systematic coverage of rare multi-agent interaction patterns across entire urban networks.

3.5 Reducing Reliance on HD Maps

HD maps are expensive to create and maintain, representing a major scalability bottleneck for AV deployment. World models contribute to map-free driving in several ways:

OccWorld produces competitive planning results without instance-level or map supervision, suggesting that world models can internalize road structure from raw sensor data.
GAIA-2 conditions on road semantics (lanes, speed limits, intersections) but these can be predicted by the model rather than supplied from pre-built maps.
Vista generates generalizable reward signals for action evaluation without ground-truth reference data, enabling planning that does not depend on pre-mapped environments.
DriveWorld's pre-training on multi-camera video learns implicit map representations (+3.0% IoU on online mapping), suggesting world models can replace offline HD maps with learned online map prediction.

4. Domain Transfer: Airside Reference ODD

Airside airport operations are a detailed transfer case for world models because the domain combines low-speed autonomy, managed routes, unusual actors, infrastructure data, and safety-case pressure. The same transfer logic should be repeated for road AVs, warehouses, logistics yards, ports, mines, construction sites, agriculture, delivery robots, and outdoor campuses before claiming deployment relevance.

4.1 Environment Characteristics

Airport airside (tarmac/apron) environments differ fundamentally from public roads:

Factor	Public Roads	Airport Airside
Speed	0--130 km/h	Typically 5--25 km/h (strict limits)
Traffic mix	Homogeneous (cars, trucks, bikes)	Highly heterogeneous: aircraft, baggage tugs, belt loaders, catering trucks, fuel tankers, pushback tugs, passenger stairs, de-icing vehicles, follow-me cars, personnel on foot
Layout	Standardized lanes, intersections	Taxiways, apron stands, service roads; layout changes with gate assignments
Marking system	Standard road markings and signs	Aerodrome markings (taxiway centerlines, stand guidance, FOD zones) per ICAO Annex 14
Right-of-way	Traffic laws	Aircraft always have priority; complex hierarchy among ground vehicles
Weather impact	Reduced visibility/traction	Same, plus jet blast, de-icing chemicals, and FOD (foreign object debris)
Communication	None (except V2X research)	Mandatory radio communication with ATC ground control

4.2 Unique Challenges for World Models

4.2.1 Domain-Specific Object Recognition

Standard driving world models are trained on datasets (nuScenes, Waymo Open Dataset, Argoverse) containing cars, pedestrians, cyclists, and standard road infrastructure. Airport airside environments introduce object categories that are entirely absent from these datasets:

Aircraft of varying sizes (A380 vs. regional jets) with complex geometries
Ground Support Equipment (GSE): Pushback tugs, belt loaders, container loaders, catering high-lifts, fuel bowsers, lavatory service vehicles, ground power units
Personnel in high-visibility vests, often in close proximity to operating equipment
Jet bridges and their articulation states
Towbars and tow connections between aircraft and tugs

Implication for world models: A world model for airside operations would need to be trained on airside-specific data. Given the limited availability of such data (no public large-scale airport driving datasets exist comparable to nuScenes), foundation model fine-tuning becomes the most viable path. NVIDIA Cosmos or similar pre-trained models could be fine-tuned on airport-specific video collected from instrumented ground vehicles.

4.2.2 Behavioral Dynamics

Airside vehicle behavior follows different rules than road traffic:

No traffic lights or stop signs: Priority is managed by marshalling, follow-me vehicles, and ATC ground control
Stand-entry procedures: Vehicles must follow precise docking procedures around parked aircraft
Convoy behavior: Baggage trains and service vehicle convoys operate differently from road platoons
Pedestrian behavior: Ground crew may walk unpredictably around aircraft, often in blind spots
Aircraft push-back: Large, slow-moving aircraft pushed by tugs create unique occlusion and clearance situations

A world model for this domain must learn these specific interaction patterns, which differ qualitatively from road traffic.

4.2.3 Structured but Dynamic Environment

While airport layouts are structured (taxiway designations, painted stand markings), the operational configuration changes dynamically:

Gate assignments change per flight schedule
Temporary closures for maintenance
Seasonal variations (de-icing areas active only in winter)
Construction zones

This combination of structure and dynamic change is well-suited to world models that can adapt online, but it also means that world model conditioning on map data must accommodate temporal variability.

4.3 Regulatory Framework

ICAO (International Civil Aviation Organization)

Annex 14 (Aerodromes): Defines aerodrome design, markings, and lighting standards. Any autonomous vehicle operating airside must comply with surface movement guidance and marking systems defined here.
Doc 9137 (Airport Services Manual): Covers ground vehicle operations, including vehicle/pedestrian control, right-of-way rules, and safety requirements.
SMS (Safety Management System): ICAO requires SMS implementation for aerodrome operators. Autonomous vehicles would need to be integrated into the aerodrome's SMS, with hazard identification and risk mitigation processes.

FAA (United States)

Advisory Circular 150/5210-20: Guidelines for driver training and vehicle operations on airports. Currently written for human drivers but establishes the safety framework any autonomous system must meet.
14 CFR Part 139: Certification of airports. Autonomous vehicles would need to demonstrate compliance with airfield safety standards.
No specific autonomous vehicle regulations for airside: As of early 2025, neither ICAO nor the FAA has published standards specifically governing autonomous ground vehicles on airport surfaces. This regulatory gap means that any deployment would likely require special approvals and extensive safety cases on a per-airport basis.

IATA

ISAGO (IATA Safety Audit for Ground Operations): Industry standard for ground handling safety. Autonomous vehicles in ground handling roles would need to meet ISAGO audit criteria.
IATA Ground Operations Manual (IGOM): Defines standard procedures for ground handling. Autonomous system behavior would need to conform to IGOM procedures.

4.4 Opportunities for World Models in Airside Operations

Despite the challenges, airside environments have properties that make them particularly well-suited for world model-based autonomy:

Low speed: Operating at 5--25 km/h provides longer reaction times and reduces the consequence of prediction errors, making latent-space planning more tractable.
Structured environment: Despite dynamic elements, the overall environment is more structured and controlled than public roads, with fewer unpredictable external factors.
Controlled access: All vehicles and personnel on the apron are authorized, trained, and (in theory) following known procedures. This reduces the behavioral diversity the world model must capture.
High economic value: Ground handling delays are extremely costly to airlines (estimated $100+ per minute for aircraft on-ground delays). Efficiency improvements from autonomous GSE operations have clear ROI.
Safety-critical but lower liability complexity: Operations are on private property under airport operator control, potentially simplifying the liability framework compared to public roads.

Recommended approach for deploying world models in airside operations:

Use a pre-trained foundation model (e.g., NVIDIA Cosmos) fine-tuned on airport-specific data
Leverage the world model primarily for scenario generation and safety validation (generating rare airside incidents for testing)
Use occupancy-based world models (OccWorld-style) for planning, as 3D occupancy naturally represents the diverse object geometries found on the apron
Implement closed-loop testing (UniSim-style) with real airside sensor logs to validate perception and planning before physical deployment

5. State of the Art as of Early 2025

5.1 What Is Actually Deployable

Readiness Level	Technology	Status
Production-deployed	Traditional modular pipelines (Waymo, Cruise, Zoox)	Operational in geo-fenced areas; world models not yet core to production stacks
Fleet-tested	Wayve's LINGO-2 (vision-language-action model)	First closed-loop VLAM tested on public roads in London
Pilot/integration phase	NVIDIA Cosmos for synthetic data augmentation	Used by Waabi and others for training data generation; not yet in vehicle real-time inference
Advanced research	GAIA-2, GenAD, Vista, DriveDreamer-2	Generating high-quality synthetic training data; not running in-vehicle
Research prototype	MILE, Think2Drive, OccWorld (latent-space planning)	Demonstrated in simulation (CARLA, nuScenes); not tested on physical vehicles
Foundational research	Doe-1, WorldDreamer, DrivePhysica	Architectural innovations; early-stage evaluation

5.2 Key Gaps Between Research and Deployment

Real-time inference: Most world models require seconds to generate a single future prediction. Real-time planning demands predictions within 50--100ms. Only the smallest models (NVIDIA Cosmos Nano tier) are designed for edge deployment, and even these are not yet validated for real-time driving.
Closed-loop validation at scale: While UniSim and SceneDiffuser++ enable closed-loop testing, the industry lacks standardized benchmarks for evaluating world model fidelity in closed-loop settings. Most published results use open-loop metrics (FID, FVD, Chamfer distance) that do not directly predict driving performance.
Safety guarantees: World models are fundamentally probabilistic and can hallucinate plausible but physically impossible futures. No current framework provides formal safety guarantees from world model predictions, which is a prerequisite for safety-critical deployment.
Sensor fidelity: Video-based world models generate visually convincing outputs but may not accurately represent the precise geometric and radiometric properties that perception systems depend on. LiDAR world models (Copilot4D) are more geometrically faithful but less mature.
Multi-agent interaction fidelity: While GAIA-2 conditions on agent bounding boxes, the reactive behavior of other agents in response to ego actions remains a significant modeling challenge. Oversimplified agent behavior in the world model can lead to overconfident planning.
Domain adaptation: All current world models are trained on road driving data. Adapting to non-road domains (airports, mines, construction sites, warehouses) requires new data collection and fine-tuning, with limited guidance on how much data is sufficient.

5.3 Trajectory of the Field

The field is converging on a consensus architecture:

Video/Sensor Tokenizer --> Latent Diffusion/Transformer World Model --> Task-Specific Heads
                                       ^
                                       |
                          Structured Conditioning Inputs
                    (actions, maps, weather, agents, text)

Key trends:

Diffusion over autoregression: GAIA-2's shift from autoregressive tokens (GAIA-1) to latent diffusion reflects the field's direction. Diffusion models produce higher-fidelity outputs with better temporal consistency.
Occupancy as the representation layer: OccWorld, DFIT-OccWorld, MUVO, and Delta-Triplane Transformers all converge on 3D occupancy as a representation that balances expressiveness, computational cost, and task utility.
Foundation model + fine-tuning: NVIDIA Cosmos establishes the template -- large-scale pre-training on diverse video, then domain-specific fine-tuning. This will likely reduce the barrier to deploying world models in specialized domains like airport operations.
Unification of generation and planning: Doe-1 and UniDWM represent the frontier where the world model is not just a data generator but the core decision-making module, collapsing the traditional pipeline into a single learned system.
Scaling laws hold: GAIA-1 demonstrated LLM-like scaling behavior, with Wayve noting "significant room for improvement" through additional data and compute. Cosmos trained on 9,000 trillion tokens. The field expects continued performance gains from scale.

5.4 Industry Assessment

Near-term (2025--2026): World models will be deployed primarily as offline tools for:

Training data augmentation (synthetic rare scenarios)
Simulation-based testing and validation
Pre-training representations for downstream perception/prediction tasks

Medium-term (2026--2028): World models will begin to appear as online planning components in:

Geo-fenced, low-speed applications (airport GSE, warehouse robots, mining vehicles)
L4 autonomous vehicles as a planning module complementing traditional planners
Continuous learning systems that update their world model from fleet data

Long-term (2028+): World models may become the core architecture for autonomous systems, replacing modular pipelines entirely with unified learned models that perceive, predict, plan, and act within a single latent space.

References

Core World Model Papers

GAIA-1: Hu et al., "GAIA-1: A Generative World Model for Autonomous Driving," arXiv:2309.17080, 2023
GAIA-2: Wayve, "GAIA-2: A Latent Diffusion World Model for Autonomous Driving," arXiv:2503.20523, 2025
UniSim: Yang et al., "UniSim: A Neural Closed-Loop Sensor Simulator," CVPR 2023 Highlight, arXiv:2308.01898
MILE: Hu et al., "Model-Based Imitation Learning for Urban Driving," NeurIPS 2022, arXiv:2210.07729
DriveDreamer: Wang et al., "DriveDreamer: Towards Real-world-driven World Models for Autonomous Driving," arXiv:2309.09777, 2023
DriveDreamer-2: Zhao et al., "DriveDreamer-2: LLM-Enhanced World Models for Diverse Driving Video Generation," arXiv:2403.06845, 2024
ADriver-I: Li et al., "ADriver-I: A General World Model for Autonomous Driving," arXiv:2311.13549, 2023
GenAD: Yang et al., "Generative End-to-End Autonomous Driving," CVPR 2024 Highlight, arXiv:2403.09630
Copilot4D: Zhang et al., "Copilot4D: Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion," arXiv:2311.01017, 2023
OccWorld: Zheng et al., "OccWorld: Learning a 3D Occupancy World Model for Autonomous Driving," arXiv:2311.16038, 2023
Vista: Gao et al., "Vista: A Generalizable Driving World Model with High Fidelity and Versatile Controllability," arXiv:2405.17398, 2024
MUVO: "Multimodal World Model with Geometric Voxel Representations," IV 2025, arXiv:2311.11762
Think2Drive: Li et al., "Think2Drive: Efficient Reinforcement Learning by Thinking in Latent World Model for Autonomous Driving," arXiv:2402.16720, 2024
WorldDreamer: Wang et al., "WorldDreamer: Towards General World Models for Video Generation via Predicting Masked Tokens," arXiv:2401.09985, 2024
DriveWorld: Chen et al., "DriveWorld: 4D Pre-trained Scene Understanding via World Models for Autonomous Driving," CVPR 2024, arXiv:2405.04390

Foundation Models and Platforms

NVIDIA Cosmos: "Cosmos World Foundation Model Platform for Physical AI," arXiv:2501.03575, 2025
Cosmos-Drive-Dreams: arXiv:2506.09042, 2025
Cosmos-Predict2.5: arXiv:2511.00062, 2025

Industry Systems

Waymo SceneDiffuser: NeurIPS 2024
Waymo SceneDiffuser++: CVPR 2025
Wayve LINGO-2: Vision-Language-Action Model, 2024

Surveys

"A Survey of World Models for Autonomous Driving," Feng et al., arXiv:2501.11260, 2025
"The Role of World Models in Shaping Autonomous Driving," Tu et al., arXiv:2502.10498, 2025
"Exploring the Interplay Between Video Generation and World Models in Autonomous Driving," Fu et al., arXiv:2411.02914, 2024
"World Models for Autonomous Driving: An Initial Survey," Guan et al., arXiv:2403.02622, 2024
"Understanding World or Predicting Future? A Comprehensive Survey of World Models," Ding et al., arXiv:2411.14499, 2024

Recent Additions (Late 2024--2025)

Doe-1: arXiv:2412.09627
DrivingWorld: arXiv:2412.19505
DFIT-OccWorld: arXiv:2412.13772
DrivePhysica: arXiv:2412.08410
InfinityDrive: arXiv:2412.01522
HoloDrive: arXiv:2412.01407
GEM: arXiv:2412.11198
UniDWM: arXiv:2602.01536
World4Drive: arXiv:2507.00603

SLAM Methods

Methods

World Models for Autonomous Vehicles: A Comprehensive Technical Report ​

Table of Contents ​

1. What Are World Models in the AV Context? ​

1.1 Core Concept ​

1.2 How They Differ from Traditional Perception-Planning Pipelines ​

1.3 The "Imagination" Paradigm ​

2. Key Architectures and Papers (2023--2025) ​

2.1 GAIA-1 and GAIA-2 (Wayve, 2023--2025) ​

GAIA-1 (arXiv:2309.17080) ​

GAIA-2 (arXiv:2503.20523, March 2025) ​

2.2 UniSim (Google DeepMind / Waabi, CVPR 2023 Highlight) ​

2.3 MILE -- Model-Based Imitation Learning (arXiv:2210.07729) ​

2.4 DriveDreamer and DriveDreamer-2 ​

DriveDreamer (arXiv:2309.09777) ​

DriveDreamer-2 (arXiv:2403.06845) ​

2.5 ADriver-I (arXiv:2311.13549) ​

2.6 GenAD -- Generative End-to-End Autonomous Driving (CVPR 2024 Highlight) ​

2.7 Copilot4D (arXiv:2311.01017) ​

2.8 OccWorld (arXiv:2311.16038) ​

2.9 Vista (arXiv:2405.17398, OpenDriveLab) ​

2.10 MUVO -- Multimodal World Model with Geometric Voxel Representations ​

2.11 Think2Drive (arXiv:2402.16720) ​

2.12 WorldDreamer (arXiv:2401.09985) ​

2.13 DriveWorld (CVPR 2024, arXiv:2405.04390) ​

2.14 NVIDIA Cosmos World Foundation Models (arXiv:2501.03575) ​

2.15 Waymo -- SceneDiffuser Family (2024--2025) ​

2.16 Doe-1 (arXiv:2412.09627, December 2024) ​

2.17 Other Notable Models (2024--2025) ​

3. How World Models Improve AV Stacks ​

World Models for Autonomous Vehicles: A Comprehensive Technical Report

Table of Contents

1. What Are World Models in the AV Context?

1.1 Core Concept

1.2 How They Differ from Traditional Perception-Planning Pipelines

1.3 The "Imagination" Paradigm

2. Key Architectures and Papers (2023--2025)

2.1 GAIA-1 and GAIA-2 (Wayve, 2023--2025)

GAIA-1 (arXiv:2309.17080)

GAIA-2 (arXiv:2503.20523, March 2025)

2.2 UniSim (Google DeepMind / Waabi, CVPR 2023 Highlight)

2.3 MILE -- Model-Based Imitation Learning (arXiv:2210.07729)

2.4 DriveDreamer and DriveDreamer-2

DriveDreamer (arXiv:2309.09777)

DriveDreamer-2 (arXiv:2403.06845)

2.5 ADriver-I (arXiv:2311.13549)

2.6 GenAD -- Generative End-to-End Autonomous Driving (CVPR 2024 Highlight)

2.7 Copilot4D (arXiv:2311.01017)

2.8 OccWorld (arXiv:2311.16038)

2.9 Vista (arXiv:2405.17398, OpenDriveLab)

2.10 MUVO -- Multimodal World Model with Geometric Voxel Representations

2.11 Think2Drive (arXiv:2402.16720)

2.12 WorldDreamer (arXiv:2401.09985)

2.13 DriveWorld (CVPR 2024, arXiv:2405.04390)

2.14 NVIDIA Cosmos World Foundation Models (arXiv:2501.03575)

2.15 Waymo -- SceneDiffuser Family (2024--2025)

2.16 Doe-1 (arXiv:2412.09627, December 2024)

2.17 Other Notable Models (2024--2025)

3. How World Models Improve AV Stacks