Skip to content

World Models & End-to-End Approaches in Autonomous Driving

Comprehensive Technical Report -- Leading AV Companies

Date: March 2026 Scope: Wayve, NVIDIA, Tesla, Waymo, Comma.ai, Waabi, Zoox, Aurora, Cruise, Hugging Face LeRobot, Applied Intuition


Table of Contents

  1. Wayve
  2. NVIDIA
  3. Tesla
  4. Waymo
  5. Comma.ai
  6. Waabi
  7. Zoox, Aurora, Cruise
  8. Hugging Face LeRobot & Open-Source Ecosystem
  9. Applied Intuition & Ghost Autonomy
  10. Comparative Analysis & Lessons for Airside AV

1. Wayve

1.1 Company Overview

Wayve is a London-based company pioneering an "AV2.0" approach to autonomous driving -- an AI-first, end-to-end paradigm that replaces traditional rule-based and HD-map-dependent stacks with foundation models trained on real-world driving data. Founded by Alex Kendall (Cambridge PhD), Wayve has raised approximately $2.3B total, including a $1.05B Series C (May 2024, led by SoftBank, with NVIDIA and Microsoft participating) and a $1.2B Series D (February 2026), bringing valuation to $8.6B.

1.2 GAIA World Model Family

Wayve's GAIA (Generative AI for Autonomy) is a lineage of generative world models that learn to predict and synthesize driving scenarios from data.

GAIA-1 (September 2023)

  • Architecture: Three-component pipeline:
    1. Modality-specific encoders for video, text, and action inputs, projecting into a shared representation space
    2. Autoregressive transformer (the world model core) that performs next-token prediction in a unified sequence space
    3. Video diffusion decoder that translates predicted latent tokens back to pixel-space video
  • Video tokenizer: Fully convolutional 2D U-Net encoder with discrete quantization via a learnable embedding table. Uses DINO-guided semantic compression rather than high-frequency detail preservation
  • Scale: 9 billion parameters
  • Training data: 4,700 hours of proprietary London driving data (2019--2023)
  • Emergent capabilities: Learned high-level scene structure, contextual awareness, geometric understanding, and generalization to novel scenarios

GAIA-2 (March 2025)

  • Architectural shift: Moved from GAIA-1's autoregressive frame-by-frame token generation to a latent diffusion world model operating in continuous latent space. This eliminates temporal discontinuities and enhances motion smoothness
  • Video tokenizer: Compresses raw pixel-space video into compact, semantically meaningful latent representations
  • Conditioning parameters: Extensive domain-specific controls including:
    • Ego-vehicle actions (speed, steering curvature)
    • Environmental factors (weather, time of day)
    • Road attributes (lane count, speed limits, intersections, traffic lights)
    • External embeddings (CLIP, proprietary driving-optimized models)
  • Multi-view generation: Native multi-camera video generation with spatial and temporal coherence across viewpoints (a key gap from GAIA-1)
  • Training data: Multi-country (UK, US, Germany), multi-vehicle-platform, multi-sensor-configuration
  • Operational modes: Forecasting, full scene synthesis, and modification of existing sequences

GAIA-3 (December 2025)

  • Scale: 15 billion parameters (double GAIA-2); video tokenizer also doubled in size
  • Training: 5x compute and 10x data compared to GAIA-2, spanning 9 countries across 3 continents
  • Key innovations:
    • World-on-Rails: Can alter ego trajectories while maintaining perfect consistency in all other scene elements
    • Embodiment Transfer: Re-renders scenes from new sensor configurations using small unpaired target samples -- enables evaluation across vehicle programs without paired captures
    • Controlled Visual Diversity: Varies lighting, textures, weather while preserving scene geometry and motion
  • Evaluation capabilities: Safety-critical scenario generation, NCAP-style testing at scale, rare failure mode enrichment, LiDAR alignment validation
  • Key result: Simulated testing closely mirrors real-world results; 5x reduction in synthetic-test rejection rates

1.3 LINGO Models (Language-Conditioned Driving)

LINGO-1 (September 2023)

  • Open-loop driving commentator combining vision, language, and action
  • Can comment on driving scenes and explain decision factors
  • Enhances interpretability and training feedback

LINGO-2 (April 2024)

  • First closed-loop Vision-Language-Action Model (VLAM) tested on public roads
  • Architecture: Two modules:
    1. Wayve vision model (visual encoding)
    2. Auto-regressive language model that receives visual tokens plus conditioning variables (route, speed, speed limit) and predicts driving trajectory + commentary text
  • Capabilities: Responds to constrained navigation commands ("pull over," "turn right") and adapts behavior in real-time
  • Represents closed-loop human-vehicle interaction via natural language

1.4 AV2.0 End-to-End Approach

Wayve's AV2.0 philosophy replaces HD maps, hand-coded rules, and modular perception-prediction-planning pipelines with a single end-to-end deep learning architecture:

  • Foundation model backbone: Trained on petabyte-scale real-world driving datasets; encodes transferable driving behaviors as a "universal backbone"
  • Generalization approach:
    • Single global model navigated 500 cities across Europe, North America, and Asia with no fine-tuning
    • Adapted to US roads approaching UK performance with only 500 hours of incremental US data (8 weeks of collection)
    • Rapidly learned US-specific behaviors: right turns on red, freeway merging, unprotected left turns
    • Japan research hub opened April 2025 (SoftBank-backed) for local adaptation
  • Commercial deployment timeline:
    • Uber partnership announced for L4 robotaxi trials in London (starting 2026)
    • Supervised autonomy software for consumer vehicles from 2027

1.5 Microsoft Azure Partnership

  • Partnership since 2020; Wayve uses Azure Storage, Azure Databricks, Azure AI infrastructure with AKS to connect thousands of GPUs into a flexible supercomputer
  • October 2025: Expanded commitment via new Azure services deal and Strategic Framework Agreement covering joint marketing, sales, and technology expansion to other OEMs
  • Microsoft is also an investor (participated in Series C and D)

1.6 Key Takeaways

AspectDetail
World model paradigmLatent diffusion (GAIA-2/3) replacing autoregressive (GAIA-1)
Scale trajectory9B -> ~15B parameters; 4,700hrs -> 10x more data
GeneralizationFoundation model approach; single model, 500 cities, no fine-tuning
Language integrationClosed-loop VLAM (LINGO-2) on public roads
DeploymentLondon robotaxi trials 2026 (with Uber)
Compute infrastructureMicrosoft Azure (thousands of GPUs)

2. NVIDIA

2.1 Cosmos World Foundation Models

Announced at CES on January 6, 2025, Cosmos is NVIDIA's platform of generative world foundation models for physical AI.

Architecture

  • Cosmos Tokenizer:

    • Lightweight, computationally efficient architecture with temporally causal mechanism (causal temporal convolution + causal temporal attention)
    • Encoder: Starts with 2-level Haar wavelet transform (4x spatial + 4x temporal downsampling); symmetrical encoder-decoder pair
    • Two tokenizer types:
      • Continuous (C): Vanilla autoencoder for latent diffusion models
      • Discrete (D): Finite-Scalar-Quantization (FSQ) for autoregressive transformers
    • Image (I) and Video (V) variants for each type
    • Compression: Up to 2048x total compression (8x/16x spatial, 4x/8x temporal); 8x more compression than SOTA while maintaining higher quality; runs up to 12x faster
    • 12 pre-trained tokenizer models on Hugging Face
  • Cosmos Predict: Generates virtual world states from multimodal inputs (text, images, video)

    • Cosmos Predict-2 (March 2025): Multi-frame generation, predicts intermediate actions/motion trajectories from start/end images
    • Cosmos Predict-2.5: Latest version specialized for simulating and predicting future world states
  • Cosmos Transfer: Conditional generation with spatial control inputs

  • Cosmos Reason: Reasoning model for physical AI (early access)

Licensing

  • Source code: Apache 2.0 License
  • Models: NVIDIA Open Model License (commercially usable, attribution required: "Built on NVIDIA Cosmos")
  • Pre-training and post-training scripts available via NVIDIA NeMo Framework

AV-Specific Adoption

  • Waabi: Data curation for AV software development and simulation
  • Wayve: Edge/corner case driving scenario search
  • Foretellix: High-fidelity testing scenario generation
  • Nexar, Oxa: Advancing autonomous driving systems

2.2 Alpamayo (January 2026)

NVIDIA's family of open AI models, simulation tools, and datasets for reasoning-based AV development.

Alpamayo 1

  • Architecture: 10B-parameter chain-of-thought reasoning Vision-Language-Action (VLA) model
  • Function: Processes multi-camera video input to generate trajectories alongside natural language reasoning traces
  • Key capability: Acts as "implicit world model operating in semantic space" -- reasons through novel edge cases step-by-step (e.g., traffic light outage at busy intersection)
  • Deployment model: Teacher model -- developers fine-tune and distill into their production AV stacks
  • Availability: Weights and inference scripts on Hugging Face; future versions with larger parameters, more I/O flexibility, and commercial licensing options

AlpaSim

  • Fully open-source simulation framework (GitHub)
  • Microservice architecture: Separate processes for Driver (inference), Renderer (perception), TrafficSim (traffic), Controller (vehicle control), Physics (dynamics)
  • Pipeline parallelism: While one scene renders, driver runs inference for another -- dramatically improves GPU utilization
  • gRPC-based interfaces; horizontal GPU scaling; ~900 reconstructed scenes
  • Supports both open-loop training and closed-loop refinement via RoaD algorithm (mitigates covariate shift)

Physical AI Open Datasets

  • 1,727 hours across 25 countries, 2,500+ cities
  • 310,895 clips (20s each), multi-camera + LiDAR + radar
  • Designed for robust VLA training and assessment

Industry Adoption

  • JLR, Lucid Motors, Uber, Berkeley DeepDrive

2.3 DRIVE Sim & Omniverse

  • DRIVE Sim on Omniverse: Transitioned AV simulation from game engine to simulation engine foundation
  • OpenUSD integration: Unified data framework enabling seamless interoperability of simulation assets; layer-stacking and composition arcs for asynchronous collaboration
  • Omniverse Blueprint for AV Simulation: Standardized API-driven workflow for digital twins, sensor data replay, ground-truth generation, closed-loop testing
  • Capabilities: Software-in-the-loop and hardware-in-the-loop configurations; complex sensor suite simulation across GPUs and nodes
  • Users: Foretellix, Mcity, Oxa, Parallel Domain, Plus AI, Uber

2.4 NVIDIA Halos Safety Platform

Comprehensive safety platform integrating full automotive hardware/software stack with AI safety research. Cosmos models deliver safety enhancements through extensive scenario generation and validation.

2.5 Partnership Ecosystem

NVIDIA's AV platform is adopted by:

  • OEMs: Mercedes-Benz, JLR, Lucid, and others
  • AV companies: Zoox, Aurora, Wayve, Waabi
  • Hardware: DRIVE AGX Orin (current), DRIVE AGX Thor (forthcoming)
  • Aurora partnership: Dual NVIDIA DRIVE Thor SoC configuration with DriveOS for the Aurora Driver

3. Tesla

3.1 FSD Architecture Evolution

Tesla's Full Self-Driving system has undergone a fundamental architectural transformation:

Phase 1: Modular Stack (FSD v1--v11)

  • HydraNet: Multi-task neural network for object detection, lane lines, signs
  • Occupancy Networks (2022): 3D volumetric representation of the environment; complemented HydraNet for perceiving occupied space
  • Explicit planning module: C++ rule-based planner consuming perception outputs
  • Result: ~300,000 lines of C++ code for planning and control

Phase 2: End-to-End (FSD v12, 2024)

  • Replaced 300,000 lines of C++ with a single neural network pipeline
  • Architecture: 48 distinct neural networks processing inputs from 8 cameras (360-degree coverage)
  • Transforms 2D camera images into 3D spatial understanding via BEV transformations and occupancy networks
  • Direct output: Steering, acceleration, braking commands from raw camera inputs
  • Training: ~70,000 GPU-hours per cycle; 1.5+ petabytes of driving data from 4M+ vehicle fleet

Phase 3: Unified E2E (FSD v13, Summer 2025)

  • Temporal-Voxel Transformer: Prioritizes long-term memory over instantaneous reaction
  • Temporal Transformers: Recursive buffer of last 10 seconds of video; maintains permanent 3D "Voxel Map" of surroundings
  • Occupancy Network v3: Upgraded perception; remembers obscured agents' trajectories and velocities
  • AI4 hardware: Higher NPU throughput, native FP16 (16-bit floating point) execution
  • Training approach: Millions of video clips of exemplary human driving; learns nuanced behaviors (subtle steering near trucks, gentle deceleration at stale green lights)

3.2 Tesla's Implicit World Model

Tesla does not deploy an explicit, standalone world model in the GAIA/Cosmos sense. Instead, FSD contains an implicit world model embedded within its end-to-end architecture:

  • The occupancy network + temporal transformer combination functions as an internal predictive model of the environment
  • The voxel map with memory effectively predicts where agents will be based on observed trajectories
  • Training on human driving videos implicitly teaches the network physics, agent behavior, and traffic dynamics

3.3 Autolabeling Pipeline & Data Engine

  • Shadow mode: Each vehicle runs two FSD systems -- one drives, one evaluates in background, identifying disagreements/inaccuracies
  • Fleet-sourced mining: When inaccuracies are confirmed, fleet is queried for similar examples
  • Human-in-the-loop: Mined examples are correctly labeled and fed back into training
  • Synthetic data: GPU-accelerated path-tracing engine generates synthetic frames matching real camera response curves (global illumination, accurate sensor simulation)
  • Scale: 3 billion FSD miles by January 2025; 500M km European data 2024--2025; 400% compute increase in 2024

3.4 Simulation Approach

  • Virtual proving ground generating perfectly labeled synthetic data
  • Simulates visual properties of cameras (camera response curves, lens distortion)
  • Edge case scenarios feed into training pipeline
  • Closed-loop simulation for safety validation

3.5 Dojo Supercomputer

  • Custom AI training supercomputer designed for video-based neural network training
  • Status: Dojo project was reportedly disbanded August 2025, then restarted January 2026
  • Dojo 2: Expected to operate "at scale" (~100k H100 equivalent) sometime in 2026
  • Designed to compress training cycles from days/weeks to hours

3.6 Current FSD Status (March 2026)

  • FSD v13 deployed fleet-wide (Summer 2025)
  • FSD v14.x series in active development/testing
  • Europe rollout ongoing with expanded data collection
  • Robotaxi ambitions continue alongside consumer FSD

4. Waymo

4.1 EMMA (End-to-End Multimodal Model for Autonomous Driving)

Published October 2024; represents Waymo's research direction toward unified end-to-end autonomy.

Architecture

  • Foundation: Built on Google's Gemini multimodal large language model
  • Input: Raw camera sensor data (images)
  • Output unification: All non-sensor inputs (navigation instructions, ego status) and outputs (trajectories, 3D locations) represented as natural language text
  • Task formulation: Jointly processes multiple driving tasks in unified language space using task-specific prompts:
    • Motion planning (planner trajectories)
    • Perception (object detection)
    • Road graph estimation
  • Co-training benefit: Single co-trained EMMA matches or surpasses individually trained task-specific models

Performance

  • State-of-the-art motion planning on nuScenes
  • Competitive results on Waymo Open Motion Dataset (WOMD)
  • Competitive camera-primary 3D object detection on Waymo Open Dataset

Limitations

  • Processes only small number of image frames
  • No LiDAR/radar input (camera-only)
  • Computationally expensive
  • Future plans: 3D sensing encoders for LiDAR/radar integration

4.2 Waymo World Model (February 2026)

  • Foundation: Built on Google DeepMind's Genie 3 -- the most advanced general-purpose world model for photorealistic, interactive 3D environments
  • Adaptation: Genie's broad world knowledge is transferred from 2D video into 3D lidar outputs compatible with Waymo's hardware suite
  • Multi-sensor output: Generates high-fidelity camera AND lidar data simultaneously
  • Controllability:
    1. Driving actions: Counterfactual "what-if" scenario testing
    2. Scene layout: Road configurations, traffic signals, agent behavior
    3. Language control: Time-of-day, weather, fully synthetic scenes
  • Rare event simulation: Can simulate events never observed by fleet -- tornados, animal encounters, etc.
  • Efficient variant: Reduced computational demands for longer simulations while maintaining fidelity
  • Context: Waymo has logged ~200 million fully autonomous miles; billions of virtual miles

4.3 MultiPath++ (Motion Forecasting)

Published at ICRA 2022; a key component of Waymo's prediction stack.

  • Input representation: Compact polylines for road features; raw agent state (position, velocity, acceleration) -- departed from dense image-based approaches
  • Multi-Context Gating (MCG) Fusion: Cross-attention-like mechanism for context-aware fusion between agents and road elements
  • Latent anchor embeddings: Learned end-to-end (replacing static pre-defined anchors)
  • Agent encoding: LSTM for sequential state history
  • Output: Gaussian Mixture Model priors for multi-modal trajectory prediction
  • Performance: SOTA on Argoverse Motion Forecasting and WOMD challenges

4.4 Waymo Open Sim Agents Challenge (WOSAC)

  • First public challenge for multi-agent traffic simulation
  • Goal: Design realistic simulators for evaluating and training AV behavior models
  • Waymax simulator: Made available to research community (2024)
  • 2025 challenge: Top methods include SMART-R1's alternating SFT-RFT-SFT pipeline achieving SOTA realism and safety

4.5 Waymo's Overall Architecture

Waymo's production system (the "Waymo Driver") remains a modular architecture with perception, prediction, planning, and control as distinct components, augmented by:

  • Extensive sensor suite (cameras, LiDAR, radar)
  • HD maps for operational domains
  • Massive simulation infrastructure

EMMA and the Waymo World Model represent research directions that may gradually integrate into the production stack. The production system benefits from the most real-world autonomous miles of any company (~200M+).


5. Comma.ai

5.1 Overview

Comma.ai's openpilot is the leading open-source advanced driver assistance system, supporting 325+ car models with 10,000+ active users and 100M+ miles driven.

5.2 End-to-End Architecture

openpilot uses a system-level end-to-end design -- a single neural network predicting car trajectory directly from camera images, trained on real-world driving data uploaded by users.

Vision Backbone Evolution

  • Previous: EfficientNet-based
  • Current: FastViT (Hybrid Vision Transformer) -- biggest model improvement since 0.9.0 redesign

Model Architecture (openpilot 0.10+)

  • MPC systems entirely removed (both lateral and longitudinal)
  • Model directly outputs executable trajectories
  • End-to-end training with world model supervision

5.3 World Model Approach

Comma.ai's world model paper ("Learning to Drive from a World Model," CVPR 2025) details their approach:

World Model Architecture

  • Core: Diffusion Transformer (DiT) operating on compressed video
  • Video encoding: Pretrained Stable Diffusion VAE (8x8 compression, 4 latent channels)
  • Model sizes tested: 250M, 500M (primary), 1B parameters
  • Conditioning: Vehicle poses, world timesteps, diffusion noise timesteps fused via Adaptive Layer Norm
  • Causal masking: Block-wise triangular attention prevents future frame access; enables KV caching
  • 3D extension: DiT adapted to 3D inputs by extending patching table to 3D, then flattening before transformer blocks

Training

  • Rectified Flow objective: Minimizes image reconstruction loss + multi-hypothesis planning loss (5 trajectory hypotheses with Laplace priors)
  • Noise level augmentation: Addresses autoregressive drift (30% of samples, LogitNormal distribution)
  • Data: 100k--400k one-minute video segments at 5 Hz, 128x256 resolution

Future-Anchored World Model

  • Non-causal variant conditioned on future observations
  • Creates "recovery pressure" -- trajectories converge to goal states without explicit recovery behavior
  • Model predicts both images and action trajectories; Plan Model generates supervision signals

Simulation Comparison

AspectReprojective SimulationWorld Model Simulation
Mechanism3D depth reprojection + inpaintingLearned video prediction
Dynamic scenesStatic scene assumption (fails)Handles dynamic environments
Night drivingSevere lighting/reflection artifactsImplicit handling
RangeLimited (<4m translation)No range limit
Shortcut learningPose-correlated artifacts exploitableGeneralizable
ScalabilityLimitedScales with compute and data

Information Bottleneck

  • White Gaussian noise limits feature representation to ~700 bits
  • Prevents exploitation of simulator artifacts; enforces generalizable feature learning

5.4 Real-World Results

  • Closed-loop (MetaDrive): World model on-policy achieves 24/24 lane-centering, 19/20 lane changes
  • Real-world deployment (500 users, ~2 months):
    • World model: 29.92% engaged time, 52.49% distance engagement
    • Reprojective: 27.63% engaged time, 48.10% distance engagement

5.5 Recent Evolution

  • openpilot 0.10: World model ("Tomb Raider") supervises driving model during training
  • openpilot 0.11: First driving model trained using both videos AND plans generated from a world model
  • Fully open-source: all code on GitHub (commaai/openpilot)

6. Waabi

6.1 UniSim: Neural Closed-Loop Sensor Simulator

Waabi, founded by Raquel Urtasun (former Uber ATG chief scientist), takes an "AI-first" approach centered on simulation.

Architecture

  • Takes pre-recorded real-world sensor logs and converts them into modifiable digital twin simulations
  • Generates both camera and LiDAR point clouds with temporal and spatial consistency
  • Neural rendering approach (related to NeRF family) for photorealistic scene reconstruction

Key Capabilities

  • Counterfactual scenario generation: "What would happen if the car in front cut in more aggressively?"
  • Cross-platform transfer: Data from a car's sensors can simulate what a truck's sensors would see
  • Closed-loop testing: Runs full autonomy stack in reactive, immersive simulation
  • Multi-sensor consistency: Camera + LiDAR outputs are temporally and spatially coherent

6.2 Waabi World

  • Comprehensive closed-loop simulator for training and testing AV systems primarily in virtual environments
  • Uses generative AI to create millions of driving scenarios
  • Focused on autonomous trucking
  • Fully driverless truck operations planned by end of 2025
  • Raised $200M; named to CNBC Disruptor 50 (2025)

6.3 Integration with NVIDIA Ecosystem

  • Evaluating NVIDIA Cosmos for data curation in AV software development
  • Uses NVIDIA GPUs for training and simulation
  • Cosmos partnership for scenario generation

7. Zoox, Aurora, Cruise

7.1 Zoox (Amazon)

Architecture

  • Purpose-built vehicle: Symmetrical, bidirectional electric robotaxi designed from ground up (not retrofitted)
  • Sensor suite: Cameras, LiDAR, radar, and longwave-infrared cameras at four corners providing overlapping 360-degree FOV (100m+ range)
  • AI stack: Traditional three-stage pipeline: Perception -> Prediction -> Planning
    • Perception: Deep neural networks extract lanes, signs, lights, objects, pedestrians from fused sensor input
    • Prediction: CNN processes bird's-eye view with ~60 semantic layers; enhanced by Graph Neural Networks for explicit agent relationship encoding; outputs probability distribution of trajectories per agent
    • Planning: Consumes predictions to generate safe, efficient routes
  • Emerging direction: Integrating prediction with planning for conditional reasoning ("if I do X, how will agents react?")
  • Simulation: Adversarial simulations testing edge cases
  • Computing: NVIDIA GPU-based
  • Status: Public robotaxi service launched San Francisco, November 2025 ("Zoox Explorers" program)

7.2 Aurora Innovation

Aurora Driver

  • Level: SAE L4 autonomous driving system
  • Focus: Autonomous freight trucking (Texas corridor)
  • Hardware: Dual NVIDIA DRIVE Thor SoC configuration with DriveOS
  • Key sensor: Proprietary FirstLight LiDAR -- detects objects in dark 450m+ away, identifying pedestrians 11 seconds sooner than human drivers
  • Architecture: Common core design enabling integration across multiple truck platforms (modular, consolidated components)
  • Partnerships: Continental (mass manufacture 2027), Volvo Autonomous Solutions
  • Status: Surpassed 100,000 driverless miles by October 2025; expanding routes in Southwestern US

7.3 Cruise (GM)

Timeline

  • October 2023: San Francisco accident (vehicle dragged pedestrian) triggered regulatory and reputational crisis
  • December 2024: GM stopped funding Cruise robotaxi development; 1,000 employees dismissed (50% workforce reduction)
  • February 2025: Cruise merged into GM; pivot from robotaxis to personal vehicle autonomy
  • December 2025: GM attempted revival; hired former Tesla executive Ronalee Mann
  • Current direction: Developing hands-free/eyes-closed features for consumer GM vehicles (personal autonomy, not robotaxi); input from former Tesla Autopilot chief Sterling Anderson

8. Hugging Face LeRobot & Open-Source Ecosystem

8.1 LeRobot

  • Founded: 2024, led by ex-Tesla lead Remi Cadene
  • Growth: 0 to 12,000+ GitHub stars in first year
  • Mission: Lowering barriers to robotics with open models, datasets, and tools in PyTorch

Key Capabilities (v0.4.0)

  • Models: PI0.5, GR00T N1.5 (VLA models), and other imitation learning / reinforcement learning approaches
  • Datasets v3.0: Scalable data infrastructure
  • Plugin system: Easier hardware integration
  • SmolVLA: Compact VLA model running on CPUs and single consumer GPUs -- lowers barrier to generalist robotics

Ecosystem Growth

  • Robotics datasets on Hugging Face: 1,145 (2024) -> 26,991 (2025) -- now the single largest dataset category on the Hub
  • Hugging Face acquired Pollen Robotics to sell open-source robots

8.2 Relevance to AV

While LeRobot is primarily focused on manipulation robotics, the VLA architectures (PI0.5, GR00T N1.5, SmolVLA) share architectural DNA with AV systems:

  • Vision-Language-Action paradigm applicable to driving
  • Open datasets and training infrastructure transferable
  • Imitation learning / on-policy RL approaches directly relevant
  • Community-driven development model

9. Applied Intuition & Ghost Autonomy

9.1 Applied Intuition

Platform

  • AI-powered ADAS/AD toolchain, vehicle platform, and autonomy stack
  • Customers: 18 of top 20 global automakers (Toyota, Nissan, Porsche, VW Group, Stellantis, Valeo)
  • Valuation: $6B+ (as of 2024)

SDS for Automotive (2025)

  • End-to-end ADAS stack: Unified neural architecture integrating perception, planning, control
  • White-box transparency: Full source code access for OEMs; customizable, validatable, traceable
  • Silicon-agnostic: Camera-heavy design supporting 1 to 11 cameras, optional radars
  • Feature range: L2++ with pathway to L3/L4
  • Regulatory compliance: ISO 26262, SOTIF, ISO/PAS 8800:2024
  • Scale: 50M+ simulations covering billions of driving miles

Simulation Infrastructure

  • Neural simulation, AI agents, full vehicle testing
  • Large-scale data and ML infrastructure
  • Enables 4x faster development cycles
  • NVIDIA collaboration: ADAS stack optimized for DRIVE AGX Orin and forthcoming DRIVE AGX Thor

9.2 Ghost Autonomy

  • Ghost Autonomy: Supplier of autonomous driving software that innovated in AV tech
  • October 2024: Applied Intuition acquired Ghost Autonomy's patent portfolio
  • Strategic IP acquisition strengthening Applied Intuition's freedom to innovate
  • Demonstrates consolidation trend in AV industry

10. Comparative Analysis & Lessons for Airside AV

10.1 Architecture Taxonomy

CompanyWorld Model TypeE2E ApproachPrimary ModalityOpen/Proprietary
WayveExplicit generative (GAIA, latent diffusion)Full E2E (AV2.0)Camera-primaryProprietary
NVIDIAFoundation model platform (Cosmos) + Reasoning VLA (Alpamayo)Provides tools, not production stackMulti-modalOpen (Cosmos: NVIDIA License; Alpamayo: HF)
TeslaImplicit (embedded in E2E network)Full E2E (FSD v12+)Camera-onlyProprietary
WaymoExplicit generative (Genie 3-based)Research E2E (EMMA); production modularCamera + LiDAR + radarResearch papers; proprietary production
Comma.aiExplicit generative (DiT-based)Full E2E (openpilot)Camera-primaryFully open source (MIT)
WaabiNeural simulator (UniSim)Simulation-firstCamera + LiDARProprietary
ZooxN/A (traditional simulation)Modular (perception->prediction->planning)Camera + LiDAR + radar + IRProprietary
AuroraN/AModular with proprietary LiDARCamera + LiDAR + radarProprietary
Applied IntuitionNeural simulationE2E ADAS stack (SDS)Camera-primary (flexible)White-box commercial
  1. Latent diffusion is winning: Both Wayve (GAIA-2/3) and Comma.ai have moved from autoregressive to diffusion-based world models for superior temporal consistency and motion quality

  2. VLA models are the frontier: NVIDIA Alpamayo, Wayve LINGO-2, and Waymo EMMA all represent Vision-Language-Action architectures that unify perception, reasoning, and action in language space

  3. World models serve dual purpose: Both training (data augmentation, on-policy learning) AND evaluation (safety validation, scenario generation)

  4. Foundation model generalization works: Wayve's 500-city demo and rapid US adaptation with 500 hours of data validates the foundation model approach for geographic transfer

  5. Open-source ecosystem is maturing: NVIDIA Cosmos/Alpamayo, Comma.ai openpilot, HuggingFace LeRobot create a viable open stack

10.3 What Can Be Borrowed/Adapted for Airside AV

From Wayve

  • Foundation model approach to generalization is directly applicable: Train on diverse airport environments, then fine-tune per airport with minimal local data
  • GAIA-style world models for safety-critical scenario generation (rare events: FOD, aircraft pushback conflicts, emergency vehicles on tarmac)
  • Embodiment transfer (GAIA-3): Same model evaluates across different GSE vehicle types
  • LINGO-2 language conditioning: Adaptable to airside commands ("hold at runway crossing," "proceed to gate B12")

From NVIDIA

  • Cosmos as training data engine: Fine-tune Cosmos for airport environments; generate synthetic ramp scenarios, weather conditions, lighting variations
  • Alpamayo reasoning VLA: Chain-of-thought reasoning for complex airside decisions (right-of-way at taxiway intersections, responding to marshaller signals)
  • AlpaSim framework: Open-source simulation adaptable to airport operations
  • Physical AI datasets: While road-focused, demonstrate data curation practices transferable to airside data collection
  • DRIVE Sim + Omniverse: OpenUSD-based digital twins of airports for simulation; reconstruct specific ramp layouts

From Tesla

  • Data engine paradigm: Fleet-sourced learning from operational airside vehicles; shadow mode for identifying edge cases
  • Autolabeling pipeline: Reduce annotation costs for airside-specific objects (aircraft types, GSE, marshalling signals)
  • Implicit world model in E2E: Demonstrates that explicit world models aren't strictly necessary -- sufficiently large E2E networks can learn physics and dynamics implicitly
  • Temporal voxel approach: Particularly relevant for airside where remembering obscured agents (behind aircraft) is critical

From Waymo

  • EMMA-style multimodal model: Airport operations have rich text/procedure context (NOTAM, ATC instructions) that can be unified with visual input
  • Waymo World Model (Genie 3-based): Simulating rare airside events (engine blast, jet wash, fuel spill)
  • MultiPath++ forecasting: Aircraft and GSE trajectory prediction with multi-modal uncertainty
  • WOSAC-style benchmarks: Could establish standardized airside simulation challenges

From Comma.ai

  • Open-source E2E as starting point: openpilot's architecture (FastViT + temporal transformer) is a proven, deployable E2E backbone
  • World model training methodology: DiT-based world model with future anchoring is directly transferable
  • Information bottleneck technique: Prevents overfitting to simulation artifacts -- critical for sim-to-real transfer in airside environments
  • Real-world validation methodology: Engagement metrics and on-policy learning framework

From Waabi

  • UniSim for cross-platform transfer: Record data from one GSE vehicle type, simulate what another vehicle type's sensors would see
  • Simulation-first development: Particularly attractive for airside where real-world testing is expensive and restricted
  • Counterfactual safety testing: "What if aircraft pushed back earlier?" scenarios

From Applied Intuition

  • White-box stack for regulatory compliance: Airside operations face aviation safety standards (DO-178C, ARP4754A) requiring traceable, inspectable systems
  • Silicon-agnostic design: Important for GSE manufacturers with varying compute constraints
  • OEM integration model: Demonstrates how to deliver autonomy technology to vehicle manufacturers (analogous to GSE OEMs)

10.4 Open-Source vs. Proprietary Considerations

Strong Open-Source Options Available

ComponentOpen-Source OptionLicenseNotes
World foundation modelsNVIDIA CosmosNVIDIA Open Model License (commercial OK)Best-in-class tokenizers + generation
Reasoning VLANVIDIA Alpamayo 1Hugging Face (research; commercial TBD)10B parameter, chain-of-thought
E2E driving stackComma.ai openpilotMIT LicenseProduction-proven, 100M+ miles
Simulation frameworkNVIDIA AlpaSimOpen source (GitHub)Microservice architecture, scalable
Robotics VLA modelsHuggingFace LeRobotApache 2.0 / MITSmolVLA, PI0.5, GR00T N1.5
Driving datasetsNVIDIA Physical AI AVOpen1,727 hours, 25 countries

Proprietary Components Worth Licensing

ComponentProviderValue Proposition
Full simulation platformApplied Intuition18/20 OEMs trust it; regulatory compliance built-in
World model for evaluationWayve GAIA-3Best-in-class generative driving scenarios
Foundation model computeMicrosoft AzureWayve-proven GPU orchestration at scale
Hardware platformNVIDIA DRIVE AGXIndustry-standard AV compute; Thor coming

Based on this analysis, a practical development strategy would be:

  1. Start with open-source backbone: Adapt openpilot's E2E architecture (FastViT + temporal transformer) or NVIDIA Alpamayo for airside perception and planning

  2. Build airside world model: Fine-tune NVIDIA Cosmos on airside driving data to create an airport-specific world foundation model for training data generation and scenario simulation

  3. Adopt simulation-first development: Use AlpaSim (open-source) or Applied Intuition (commercial) as simulation backbone; integrate Cosmos-generated synthetic airside scenarios

  4. Implement data engine: Follow Tesla's shadow-mode paradigm -- deploy data collection vehicles on ramps to build proprietary airside datasets

  5. Language conditioning: Incorporate LINGO-2/EMMA-style language interfaces for airside procedure compliance and ATC/ground control integration

  6. Safety validation: Use GAIA-3-style world models (or Waymo World Model approach) for rare-event simulation specific to airside operations

  7. Regulatory pathway: Applied Intuition's white-box, traceable architecture provides a model for meeting aviation safety standards


Key Sources

Wayve

NVIDIA

Tesla

Waymo

Comma.ai

Others

Public research notes collected from public sources.