Wayve: GAIA World Model Series & End-to-End Driving Platform

Last updated: 2026-03-22 Research scope: GAIA-1/2/3 architectures, LINGO-1/2, AV2.0 platform, Ghost Gym, scaling laws, commercial deployments, and lessons for airside autonomy.

Company Overview & Philosophy
AV2.0: The End-to-End Approach
GAIA-1: Foundations (2023)
GAIA-2: Latent Diffusion at Scale (2025)
GAIA-3: Safety Evaluation at 15B (2025)
Scaling Law Results
LINGO-1 & LINGO-2: Language-Conditioned Driving
Ghost Gym & PRISM-1: Closed-Loop Simulation
AI-500 Roadshow & Global Generalization
US Expansion: 500 Hours of Incremental Data
Commercial Deployments & Partnerships
Funding & Valuation
What Makes Wayve's Approach Unique
Lessons for Airside Autonomy
Sources

1. Company Overview & Philosophy

Wayve was founded in 2017 by Alex Kendall (CEO) and Amar Shah, both from the University of Cambridge, with a contrarian thesis: replace the hand-engineered autonomous vehicle stack with a single end-to-end neural network that learns to drive from data. Kendall's conviction is that one large foundation model, trained on diverse real-world driving data, will outperform modular, rule-based robotics stacks -- and that this approach scales globally in a way that hand-coded systems cannot.

Wayve's core concept is Embodied AI -- AI that learns from physical-world experience (both real and simulated) to handle complex or novel situations, rather than following pre-programmed instructions. The company positions this as part of a decade-long trend of "bringing AI into the physical world."

Rather than pursuing vertical integration (building its own fleet and ride-hailing platform), Wayve operates as an AI software provider, partnering with automakers (Nissan, Mercedes-Benz, Stellantis) and mobility platforms (Uber) to deploy its AI Driver across their vehicles and networks.

2. AV2.0: The End-to-End Approach

Wayve coined "AV2.0" to distinguish its approach from "AV1.0" (the traditional modular stack used by companies like Waymo and Cruise):

Dimension	AV1.0 (Traditional)	AV2.0 (Wayve)
Architecture	Modular: perception -> prediction -> planning -> control	End-to-end: cameras/radar -> single neural network -> driving outputs
Maps	Requires centimeter-accurate HD maps	Map-free; learns spatial understanding from data
Rules	Hand-coded driving rules, thousands of edge cases	Learned behavior from hundreds of millions of data samples
Scaling	Per-city engineering; new ODD = new rules	Single foundation model; adapts to new geographies with minimal incremental data
Sensors	Typically requires LiDAR + cameras + radar	Camera-primary (with radar); no LiDAR dependency
Vehicle integration	Custom robotaxi platforms	Adapts to mass-production vehicles from OEM partners

The AI Driver converts camera and radar inputs into driving outputs (steering, acceleration, braking) through one neural network. When adjustments are needed, Wayve retrains on new data rather than reprogramming rules.

A key claim: the system can "instantly apply knowledge -- or generalise its driving intelligence -- to new situations without re-engineering," making it far more scalable than location-specific approaches.

3. GAIA-1: Foundations (2023)

3.1 Overview

GAIA-1 (Generative AI for Autonomy) was introduced on June 17, 2023 and the technical report was published on October 3, 2023 (arXiv:2309.17080). It is a generative world model that leverages video, text, and action inputs to generate realistic driving scenarios with fine-grained control.

The core innovation: casting world modeling as an unsupervised sequence modeling problem -- mapping multimodal inputs to discrete tokens and predicting the next token, analogous to how LLMs predict the next word.

3.2 Architecture (Three-Stage Pipeline)

Total parameters: ~9.3 billion (6.5B world model + 2.6B video decoder + encoders)

Stage 1: Image Tokenizer (VQ-VAE)

Architecture: Fully convolutional 2D U-Net encoder/decoder with vector-quantized nearest-neighbor lookup
Codebook size: K = 8,192
Spatial downsampling: 16x
Tokens per image: 576 (18 x 32 grid) at 288 x 512 input resolution
Bit compression ratio: ~470x
Training losses: Weighted L1, L2, perceptual, and GAN losses; quantization embedding + commitment loss; inductive bias loss encouraging quantized features to match DINO model features via cosine similarity
Training: 200k steps, 4 days, batch size 160 across 32 A100 GPUs
Linear projection with L2 normalization to boost vocabulary utilization

Stage 2: Autoregressive World Model Transformer

Parameters: 6.5 billion
Context: T=26 frames at 6.25 Hz = ~4-second video windows
Sequence composition per timestep:
- m = 32 text tokens (from text encoder)
- n = 576 image tokens (from VQ-VAE)
- l = 2 action tokens (speed, curvature)
- Total sequence length: T x (m + n + l) = 26 x 610 = 15,860 tokens
Positional embeddings: Factorized spatio-temporal (T temporal + 610 spatial embeddings, dimension d=4096)
Optimizer: AdamW (lr=1e-4, weight decay=0.1, beta=(0.9, 0.95))
Training: 100k steps over 15 days, batch size 128, 64 A100 GPUs
Techniques: FlashAttention v2, DeepSpeed ZeRO-2, activation checkpointing
Conditioning dropout ratios: 20% unconditioned / 40% action-only / 40% text+action
Inference: Top-k sampling (k=50) with classifier-free guidance (scheduled guidance scale across tokens and frames)

Stage 3: Diffusion Video Decoder

Parameters: 2.6 billion
Architecture: 3D U-Net with factorized spatial and temporal attention layers
Operating window: T'=7 images at 288 x 512 resolution
Multi-task training: Image generation, video generation, autoregressive decoding, video interpolation (sampled equally)
Noise schedule: Cosine beta-schedule
Loss: Weighted L1 (lambda=0.1) and L2 (lambda=1.0)
Training: 300k steps over 15 days, batch size 64, 32 A100 GPUs
Inference: DDIM sampler, 50 diffusion steps
Temporal upsampling: 6.25 Hz -> 12.5 Hz -> 25 Hz (multi-stage)
EMA decay: 0.999

3.3 Training Data

4,700 hours of proprietary driving footage from London, UK (2019-2023) at 25 Hz
~420 million unique images
Balanced sampling over latitude, longitude, weather, steering/speed behaviors
Validation: 400 hours from geofenced unseen routes

3.4 Emergent Capabilities

Long-horizon generation: Minutes-long driving scenarios from seconds of context
Multi-future prediction: Diverse ego-behaviors and traffic patterns from identical initial conditions
Action conditioning: Fine-grained trajectory control via speed/curvature inputs
Text-based scene control: Weather, lighting, time of day, road conditions via natural language
Contextual awareness and geometry understanding: Learned without explicit supervision
Out-of-distribution extrapolation: Novel scenarios like off-road steering

4. GAIA-2: Latent Diffusion at Scale (2025)

4.1 Overview

GAIA-2 was announced in March 2025 (arXiv:2503.20523) and represents a fundamental architectural shift from GAIA-1: replacing the autoregressive transformer + VQ-VAE + diffusion decoder pipeline with a unified latent diffusion framework using flow matching.

4.2 Architecture

Total world model: 8.4 billion parameters

Video Tokenizer

Encoder: 85M parameters; downsampling convolutions + 24 spatial transformer blocks
Decoder: 200M parameters; 16 space-time factorized blocks + 8 additional blocks with full temporal context
Spatial compression: 32x downsampling
Temporal compression: 8x downsampling
Latent dimension: 64 channels
Total compression ratio: ~384x
Latent grid per camera: 14 x 30 tokens
From 24 frames at 448 x 960 -> 3 x 14 x 30 x 64 latent representation
Training: 300k steps, batch size 128, 128 H100 GPUs

Diffusion Transformer (DiT) -- World Model

Design: Space-time factorized transformer with 22 blocks
Hidden dimension: 4,096
Attention heads: 32 per block
Layer composition per block: Spatial attention -> temporal attention -> cross-attention -> MLP
Positional encoding: Sinusoidal for spatial tokens and camera timestamps; learnable layers for camera geometry
Conditioning injection: Adaptive layer norm for action/timestep; cross-attention for other conditioning variables

Flow Matching Training

Bimodal noise distribution:
- Primary mode: mu=0.5, sigma=1.4, p=0.8 (low-to-moderate noise)
- Secondary mode: mu=-3.0, sigma=1.0, p=0.2 (near-pure noise)
Input normalization: Mean mu_x=0.0, std sigma_x=0.32
Loss function: L2 velocity matching between predicted and target latents
Training: 460k steps, batch size 256, 256 H100 GPUs
Inference: 50 denoising steps with linear-quadratic noise schedule; autoregressive rollouts via sliding window

4.3 Seven Conditioning Pathways

This is one of GAIA-2's key innovations -- structured, disentangled control over scene generation:

#	Pathway	Details
1	Camera Parameters	Intrinsics, extrinsics, distortion coefficients
2	Video Frequency	Timestamp encoding via sinusoidal functions + MLP
3	Action (Ego-vehicle)	Speed and steering curvature via symlog normalization
4	Dynamic Agents	3D bounding boxes projected to 2D; up to B agent instances per frame
5	Metadata	Country, weather, time of day, lane types, traffic signals, pedestrian crossings, intersections, speed limits
6	CLIP Embedding	Visual or text-based semantic control via CLIP latents
7	Scenario Embedding	Proprietary driving-specific latent representations from Wayve's internal models

4.4 Multi-Camera Generation

Up to 5 simultaneous camera views with spatial and temporal coherence
Resolution: 448 x 960 pixels per camera
Total latent tokens: T x N x H x W = 6 x 5 x 14 x 30 = 12,600 tokens (for 48-frame inputs)
Accommodates various multi-camera rig configurations across vehicle platforms

4.5 Training Data

~25 million video sequences, each spanning 2 seconds
Collected 2019-2024 across UK, US, and Germany
Vehicles: 3 car models + 2 van types
Camera configurations: 5-6 cameras per vehicle at 20/25/30 Hz
Joint probability balancing across features; geographically held-out validation splits

4.6 Evaluation

Visual fidelity: Frechet DINO Distance (FDD) and FID using DINOv2 ViT-L/14
Temporal consistency: Frechet Video Motion Distance (FVMD) measuring keypoint motion distributions
Agent conditioning: Class-based IoU via OneFormer segmentation masks
Validation loss found to "correlate well with human perceptual preferences"

5. GAIA-3: Safety Evaluation at 15B (2025)

5.1 Overview

GAIA-3 was launched on December 2, 2025 as Wayve's most capable world model, designed specifically for offline evaluation and validation of autonomous driving AI through simulation of safety-critical scenarios.

5.2 Scale

15 billion parameters (2x GAIA-2)
Video tokenizer: 2x larger than GAIA-2's, for more faithful representation of real-world physics
Training compute: 5x more than GAIA-2
Training data: ~10x more than GAIA-2
Geographic coverage: 9 countries across 3 continents

5.3 Key Capabilities

World-on-Rails: Alter the ego vehicle's trajectory while maintaining perfect consistency of all other scene elements (other vehicles, pedestrians, lighting, weather). This enables counterfactual "what-if" analysis.

Safety-Critical Scenario Generation:

Counterfactual collision scenarios: side-on, head-on, drifting
Virtual NCAP-style tests (CCFTAP, CCRS) generated at scale
Consistency validated using LiDAR point cloud alignment

Embodiment Transfer: Re-render scenes from new sensor configurations using only unpaired samples from target rigs. This enables transferring evaluation suites across vehicle platforms without needing paired captures -- critical for OEM deployment.

Visual Robustness Control: Systematic modification of lighting, textures, and weather while preserving geometry and motion for robustness evaluation.

Offline Evaluation Suites: Repeatable, measurable testing through action conditioning and trajectory perturbations.

5.4 Improvements Over GAIA-2

Sharper visuals of static and dynamic elements
Enhanced world coherence over long trajectories and through occlusions
Improved rendering of pedestrians, signage, and fine-grained details (safety-critical elements)
More consistent lighting and richer texture detail
Better representation of real-world physics and causality

6. Scaling Law Results

A critical finding across the GAIA series is that world models for autonomous driving obey power-law scaling, analogous to scaling laws observed in large language models (Kaplan et al., Hoffmann et al.).

6.1 GAIA-1 Scaling Experiments

Models ranged from 0.65M to 6.5B parameters (10,000x to 1x of the full model)
Validation metric: Cross-entropy on geofenced held-out test set
Power-law fit: f(x) = c + (x/a)^b
Compute estimation: C = 6N (6 x parameter count, excluding embeddings)
Final performance of the 6.5B model was predicted with high accuracy from less than 1/20th the compute
The validation curve shows "significant room for improvement" through additional scaling

6.2 Key Scaling Insight

Optimal scaling requires increasing model size 1.5x as fast as dataset size as the compute budget grows. This is similar to the Chinchilla-optimal ratio observed in LLMs and suggests that current world models are likely data-constrained rather than parameter-constrained at the frontier.

6.3 GAIA-2 -> GAIA-3 Scaling

The progression from GAIA-2 (8.4B) to GAIA-3 (15B) with 5x compute and 10x data demonstrates continued returns from scaling. The improvements in visual fidelity, physics coherence, and safety-critical detail rendering align with the power-law predictions from GAIA-1.

6.4 Implications

World model quality is predictable from compute budget before training
This enables efficient resource allocation for research (train small models to predict large-model performance)
The "room for improvement" on the validation curve suggests 15B is far from the ceiling
The shift from autoregressive (GAIA-1) to latent diffusion (GAIA-2/3) improved sample efficiency without breaking the scaling trend

7. LINGO-1 & LINGO-2: Language-Conditioned Driving

7.1 LINGO-1 (September 2023)

Type: Open-loop driving commentator (vision-language model)

Architecture: Vision-language-action model (VLAM) combining imagery, driving data, and natural language through visual question answering (VQA).

Training data: Scalable dataset of imagery, language commentary, and driving actions from expert drivers in Central London. Commentary follows roadcraft principles -- professional driving instructors narrating observations and justifying decisions aloud. Data collected without impacting the rate of standard expert driving data collection.

Capabilities:

Generates continuous driving commentary explaining reasoning behind actions
Answers contextual questions about scenes, weather impacts, road rules
Performs perception, counterfactual reasoning, planning, and scene understanding tasks
Leverages internet-scale LLM knowledge for broader conceptual understanding

Performance: ~60% accuracy compared to human-level on comprehensive VQA benchmarks (perception, reasoning, driving knowledge). Performance doubled as architecture and training datasets improved.

Limitations: Trained primarily on Central London data; prone to LLM hallucinations; limited temporal context; open-loop only.

7.2 LINGO-2 (April 2024)

Type: Closed-loop vision-language-action model (VLAM) -- the first VLA tested on public roads

Architecture (two modules):

Wayve Vision Model: Processes consecutive camera images into a sequence of tokens
Auto-regressive Language Model: Receives vision tokens + conditioning variables (route, current speed, speed limit); trained to predict both a driving trajectory and commentary text

Key innovation: LINGO-2 outputs both driving action and language simultaneously, providing continuous commentary on its motion planning decisions while actually controlling the vehicle.

Three primary use cases:

Behavior adaptation: Natural language prompts modify driving behavior (e.g., "pull over," "turn right at the intersection," "stop at the give way line")
Real-time VQA: Answer questions about scenes and decisions while driving
Live commentary: Continuous narration explaining actions

Significance: LINGO-2 is the first demonstration of a vision-language-action model controlling a vehicle on public roads in closed loop. This goes beyond open-loop commentary -- the language model's outputs directly influence the vehicle's trajectory.

Testing: Validated in Ghost Gym simulation environment before public road testing. Additional safety validation needed for language-based behavior control in real-world settings.

8. Ghost Gym & PRISM-1: Closed-Loop Simulation

8.1 Ghost Gym

Ghost Gym is Wayve's closed-loop data-driven neural simulator with three primary components:

Neural Renderer: Uses learned neural rendering techniques (building on Neural Radiance Field research) to reconstruct scenes from actual driving data into photorealistic 4D worlds
High-Fidelity Simulated Robot: Replicates internal vehicle systems by logging raw sensor data and processing asynchronous components offline for deterministic replay
Vehicle Dynamics Model: Captures physics of vehicle movement and response to control inputs

Closed-loop advantage: Unlike open-loop replay (which simply plays back recorded data), Ghost Gym propagates the vehicle's simulated actions back into the environment, modeling how the world changes in response to revised driving behavior. This is critical for testing end-to-end models that combine perception and planning.

Applications:

Reproduce on-road intervention failures offline for debugging
Create "unit tests" to prevent model regressions
Generate synthetic scenarios at scale without road deployment
Align model behavior using human safety-operator feedback (re-simulate interventions to ensure future models don't repeat errors)

GAIA integration: Ghost Gym explores world modeling capabilities from the GAIA series to handle dynamic elements like pedestrians and cyclists -- an acknowledged open research challenge.

8.2 PRISM-1

PRISM-1 is a 4D scene reconstruction model (3D space + time) from video data that powers the next generation of Ghost Gym scenarios.

Key technical features:

Camera-only inputs: No LiDAR or 3D bounding box dependency
Self-supervised decomposition: Separates static and dynamic elements without explicit labels or predefined models
Dynamic element reconstruction: Cyclists, pedestrians, brake lights, opening car doors, road debris
Generalizes across camera setups without additional sensors or explicit 3D information

WayveScenes101 Dataset: Released with PRISM-1 -- 101 scenes from diverse driving environments in UK and US (urban, suburban, highway) under varied weather and lighting conditions.

Integration plan: PRISM-1 feeds into Ghost Gym to expand the variety of simulatable scenarios, accelerate development cycles, enable testing on different vehicle types/cameras, and improve coverage of under-represented scenarios.

9. AI-500 Roadshow & Global Generalization

In 2025, Wayve conducted its AI-500 Roadshow to validate whether a single foundation model could generalize globally.

Key Results

Metric	Value
Cities tested	506 (surpassing the 500 target)
Continents	3 (Europe, North America, Asia)
Total distance	1.45 million kilometers
Total testing hours	46,347 hours (~5.3 years equivalent)
Zero-shot cities (no prior local data)	219 cities (43%)
Sparse-data cities (<100 km of training data)	67% of all cities
Model	Single foundation model, one set of weights

Technical Approach

One global model with a single set of weights
Two lightweight embeddings provided at deployment:
- Country embedding: Encodes country-specific driving norms
- Driving-side embedding: Left-hand vs. right-hand traffic
No city-specific fine-tuning, HD mapping, or per-city engineering before deployment
Foundation model trained on globally diverse data spanning 70+ countries and multiple vehicle platforms

Significance

Wayve became the first and only AV company to drive zero-shot in 500+ cities within a single year. This validates the foundation-model hypothesis: a single AI driver can reliably operate across dense urban centers (Paris, New York, Tokyo), mountain regions, highways, and diverse traffic cultures without location-specific engineering.

10. US Expansion: 500 Hours of Incremental Data

Wayve's US expansion (late 2024) provides a case study in foundation-model adaptation efficiency.

Performance Progression

Data Volume	Improvement	Notes
0 hours (zero-shot)	Baseline	Underperformance vs. UK benchmark
100 hours	5x improvement	Strong initial gains
500 hours	40x improvement	Achieved UK-equivalent performance

Data was collected over 8 weeks. With just 500 hours of incremental US-specific data, the model matched its UK benchmark in both urban and highway environments.

Learned US-Specific Behaviors

Four-way stop intersections
Right turns on red
Unprotected left turns (opposite traffic flow from UK)
Freeway merging on short on-ramps
Implicit cultural norms (too nuanced to program manually)

Cross-Market Comparison

Germany (third market): Zero-shot performance was 3x better than initial US deployment -- likely because Germany's driving culture is closer to the UK training distribution
Cross-vehicle adaptation: 100 hours of vehicle-specific data yielded 8x improvement, comparable to geographic adaptation rates

Data Strategy

Wayve leverages a "data ocean" approach: combining internal fleet data with third-party datasets from partners and automakers, rather than relying solely on expensive sensor-rich autonomous vehicle footage. This enables petabyte-scale training with diverse coverage.

11. Commercial Deployments & Partnerships

11.1 Uber Partnership (L4 Robotaxis)

Announced: June 10, 2025
London L4 trials: Spring 2026 (UK's first robotaxi trial)
Model: Wayve deploys AI Driver in L4-capable vehicles; Uber owns/operates the fleet
Safety: Initial operation with trained safety drivers; transition to fully driverless planned
Expansion: Planned across 10+ markets globally

11.2 Nissan Partnership

Tokyo pilot: Wayve, Uber, and Nissan signed MoU (March 2026) for robotaxi deployment using Nissan LEAF by late 2026
Consumer vehicles: Nissan integrating Wayve AI into next-gen ProPILOT driver-assistance system; first consumer vehicles expected fiscal year 2027
Significance: First Uber autonomous vehicle partnership in Japan

11.3 OEM Integration Strategy

Wayve's technology spans L2+ through L4:

L2+ ("hands off"): Consumer ADAS integration (Nissan ProPILOT, others)
L3/L4 ("eyes off"): Robotaxi deployments via Uber platform
Vehicle-agnostic: Single AI Driver adapts across vehicle platforms from Mercedes-Benz, Nissan, Stellantis, and others

11.4 Microsoft Azure Partnership

Wayve uses Microsoft Azure for training compute infrastructure, scaling embodied AI models on large GPU clusters (A100s, H100s).

12. Funding & Valuation

Series D (February 2026)

Metric	Value
Amount raised	$1.2 billion
Total capital secured	$1.5 billion (including Uber milestone-based commitments)
Post-money valuation	$8.6 billion

Investors

Lead investors: Eclipse, Balderton, SoftBank Vision Fund 2

Financial investors: Ontario Teachers' Pension Plan, Baillie Gifford, British Business Bank, Icehouse Ventures, Schroders Capital

Strategic investors:

Microsoft -- compute infrastructure partner
NVIDIA -- GPU/DRIVE platform partner (robotaxi prototype on NVIDIA DRIVE Hyperion)
Uber -- fleet/platform partner + milestone-based deployment capital
Mercedes-Benz -- automotive OEM
Nissan -- automotive OEM (ProPILOT + Tokyo robotaxi)
Stellantis -- automotive OEM

Deployment Timeline

2026: Commercial robotaxi trials (London spring 2026, Tokyo late 2026)
2027: First consumer vehicles with Wayve AI (Nissan mass production)

13. What Makes Wayve's Approach Unique

13.1 vs. Waymo (AV1.0)

Waymo uses detailed pre-mapped routes, LiDAR + cameras + radar, and rule-based AI layers. Each new city requires extensive mapping and engineering. Wayve uses no HD maps, is camera-primary, and deploys a single model globally. However, Waymo has recently moved toward foundation models trained end-to-end, narrowing this gap architecturally while maintaining its sensor and mapping advantages.

13.2 vs. Tesla

Tesla also employs an end-to-end AI approach with vision-only sensing, making it the closest analog to Wayve. Key differences:

Wayve is vehicle-agnostic -- designed to work across OEM partners, not just one vehicle platform
Wayve is map-free -- Tesla's system still uses some map data for navigation
Wayve is a software supplier -- Tesla is vertically integrated (makes the car, the AI, and the fleet)
Wayve has published extensive research (GAIA papers, LINGO, PRISM) -- Tesla's technical details are less transparent

13.3 Distinctive Capabilities

Single global model: One set of weights operates in 500+ cities across 3 continents -- no other AV company has demonstrated this
World model research program: The GAIA series is the most systematic publicly documented effort to build driving-specific generative world models with scaling laws
Language-action integration: LINGO-2 is the first closed-loop VLA on public roads
Adaptation efficiency: 500 hours (8 weeks) to reach UK-equivalent performance in the US -- orders of magnitude less than traditional approaches
OEM-agnostic platform: Designed from the ground up as a deployable AI stack for mass-production vehicles from multiple manufacturers
Simulation-evaluation pipeline: GAIA-3 + Ghost Gym + PRISM-1 creates an integrated loop from data collection to world-model-powered evaluation

14. Lessons for Airside Autonomy

14.1 Foundation Model Transferability

Wayve's demonstration that 500 hours of incremental data can adapt a driving model to a new geography is directly relevant to airside operations. An airside autonomy system could be built on a foundation model pre-trained on large-scale driving data (including Wayve-style or similar datasets), then fine-tuned with airport-specific data. The key insight: general driving competence transfers, and domain-specific behaviors (taxiway rules, gate approach procedures, ground service equipment avoidance) can be learned from modest incremental data.

14.2 World Models for Safety Evaluation

GAIA-3's approach to safety-critical scenario generation is highly relevant:

Counterfactual collision scenarios can be generated for airside-specific hazards (runway incursions, FOD encounters, jet blast zones, ground crew proximity)
World-on-Rails enables testing "what-if" ego-vehicle trajectory changes while keeping the rest of the airport environment consistent
Embodiment transfer allows evaluation suite portability across different vehicle platforms (tugs, buses, baggage tractors) without paired data collection
NCAP-style virtual testing could be adapted for airside safety certification standards

14.3 Closed-Loop Simulation Strategy

Ghost Gym's approach -- neural rendering from real data + closed-loop replay -- is more feasible for airside than building traditional high-fidelity airport simulators:

Record airside driving data with cameras
Reconstruct scenes with PRISM-1-style 4D models
Run closed-loop evaluation of the autonomy stack
Re-simulate interventions to prevent regression
This is dramatically cheaper than building physics-based airport digital twins

14.4 Scaling Laws as Planning Tools

GAIA-1's power-law scaling results mean that world model quality is predictable from compute budget. For an airside program:

Train small models first to predict performance of larger models
Quantify the data/compute needed to reach target performance levels
Make investment decisions based on extrapolated scaling curves rather than trial-and-error

14.5 Language Conditioning for Operations

LINGO-2's language-action model opens possibilities for airside:

Verbal or text-based commands from ground control ("proceed to gate B7," "hold for crossing traffic")
Natural language explanations for safety audit trails
Operator override and guidance through language rather than manual control

14.6 Adaptation Efficiency for Airport-Specific Operations

The country/driving-side embedding approach could be adapted for airside:

Airport embedding: Encodes airport-specific layout patterns, traffic conventions, ground markings
Vehicle embedding: Adapts to different airside vehicle types (pushback tugs, passenger buses, cargo loaders)
With Wayve's demonstrated 100-hour / 8x-improvement curve for vehicle adaptation, deploying across multiple airports and vehicle types becomes tractable

14.7 Key Risks and Considerations

Domain gap: Airport airside is a fundamentally different operational domain from public roads. Transfer learning benefits may be limited for specialized behaviors (jet blast avoidance, wing-tip clearance, FOD detection)
Regulatory path: Aviation safety certification (DO-178C, EASA standards) is more stringent than automotive; world-model-based testing may need extensive validation for regulatory acceptance
Data scarcity: Airport driving data is far scarcer than public road data. Wayve's success relies on 70+ countries of road data -- an airside equivalent would need to be built from scratch or augmented with simulation
Failure modes: End-to-end neural networks can fail in unpredictable ways. Airside operations have extremely low tolerance for errors. Hybrid approaches (neural perception + rule-based safety layers) may be necessary

SLAM Methods

Methods

Wayve: GAIA World Model Series & End-to-End Driving Platform ​

Table of Contents ​

1. Company Overview & Philosophy ​

2. AV2.0: The End-to-End Approach ​

3. GAIA-1: Foundations (2023) ​

3.1 Overview ​

3.2 Architecture (Three-Stage Pipeline) ​

Stage 1: Image Tokenizer (VQ-VAE) ​

Stage 2: Autoregressive World Model Transformer ​

Stage 3: Diffusion Video Decoder ​

3.3 Training Data ​

3.4 Emergent Capabilities ​

4. GAIA-2: Latent Diffusion at Scale (2025) ​

4.1 Overview ​

4.2 Architecture ​

Video Tokenizer ​

Diffusion Transformer (DiT) -- World Model ​

Flow Matching Training ​

4.3 Seven Conditioning Pathways ​

4.4 Multi-Camera Generation ​

4.5 Training Data ​

4.6 Evaluation ​

5. GAIA-3: Safety Evaluation at 15B (2025) ​

5.1 Overview ​

5.2 Scale ​

5.3 Key Capabilities ​

5.4 Improvements Over GAIA-2 ​

6. Scaling Law Results ​

6.1 GAIA-1 Scaling Experiments ​