Datasets, Benchmarks, and Data Engines for Autonomous Driving World Models

Deep Technical Research Report — March 2026

Major Driving Datasets
Benchmarks for World Models and Generation
Data Engines and Auto-Labeling
The Data Gap for Airside Operations
Data Curation and Quality

1. Major Driving Datasets

1.1 nuScenes (Motional, 2019)

The foundational multimodal dataset for autonomous driving research.

Property	Details
Scale	1,000 curated scenes, 5.5 hours of driving from 83 logs (14.8h total)
Sensors	6 cameras (360deg), 1 LiDAR (Velodyne HDL-32E), 5 radars — full AV sensor suite
Annotations	1.4M 3D bounding boxes across 23 object classes; panoptic extension adds 1.1B point-level labels
Locations	Boston, Singapore
Key metrics	nuScenes Detection Score (NDS), AMOTA for tracking

Strengths for world models: Rich multimodal temporal sequences with semantic maps enable learning spatial-temporal dynamics. Over 16,000 downloads, 1,000+ academic citations.

Limitations (per "nuScenes Revisited," Dec 2025): Geographic overlap between train/val/test splits causes overfitting risk. Only 5.5h of data is small for modern training. Radar synchronization issues complicate multimodal fusion. The dataset's curation bias toward "interesting" moments limits continuous-driving distribution learning.

Key extensions:

nuImages: 93,000 annotated images for 2D detection
Panoptic nuScenes: Semantic + instance labels at point level
NuPlanQA (2025): 4.3M frames with multi-modal language QA built on nuPlan

1.2 nuPlan (Motional, 2021)

The world's first large-scale real-world planning dataset and benchmark.

Property	Details
Scale	1,282 hours of driving, ~235x larger than nuScenes
Locations	Las Vegas, Boston, Pittsburgh, Singapore
Features	Auto-labeled object tracks, traffic light data, HD maps
Use case	Closed-loop planning evaluation

nuPlan is the base dataset from which OpenScene is derived (see section 1.10). Its massive scale makes it the de facto standard for end-to-end driving research in 2024-2025.

1.3 Waymo Open Dataset (Waymo, 2019-present)

The most actively maintained large-scale AV dataset, with continual expansions.

Property	Details
Perception Dataset	1,150 scenes, 20s each; 12.6M 3D box labels, 12M 2D box labels
Motion Dataset (WOMD)	103,354 scenes; 574+ hours with raw LiDAR range images
Sensors	5 LiDARs, 5 cameras (no radar)
Locations	Phoenix, San Francisco, Mountain View, and other US cities
Current version	v1.3.1 (Oct 2025), added `sdc_paths` for future route planning

Recent extensions (2024-2025):

WOD-E2E: 4,021 high-difficulty long-tail segments (~12 hours) for end-to-end driving in safety-critical situations
WOMD-Reasoning: 3M QA pairs spanning map recognition, motion narratives, interaction reasoning, and intention prediction — enabling joint vision-language benchmarking
WOMD-LiDAR: Raw compressed LiDAR range images for the full 574h motion dataset, enabling sensor-to-prediction learning

1.4 Waymo Open Motion Dataset (WOMD)

Deserves separate treatment due to its importance for world model training:

Property	Details
Scenes	103,354 (104,000 in some references)
Duration	574+ hours
Prediction horizon	8 seconds
Key tasks	Marginal/joint motion forecasting, interaction prediction, sim agents
Metrics	minADE, minFDE, Miss Rate, Overlap Rate, forecasting mAP

The WOMD is particularly relevant for world models because it provides the long-horizon trajectory data needed to supervise future state prediction. The sim agents challenge specifically evaluates the ability to generate realistic multi-agent futures.

1.5 Argoverse 1 & 2 (Argo AI → now open-source, 2019/2023)

Property	Argoverse 1	Argoverse 2
Sensor Dataset	113 scenes	1,000 3D-annotated scenarios
LiDAR Dataset	—	20,000 unannotated sequences
Motion Forecasting	324,557 segments	250,000 scenarios
Sensors	2 LiDARs, 7 cameras	2 LiDARs (64 beams total at 10Hz), 7 ring cameras (20Hz), 2 stereo cameras
Object classes	15	26
Locations	Miami, Pittsburgh	Austin, Detroit, Miami, Pittsburgh, Palo Alto, Washington D.C.

Unique features: Real-valued ground height maps at 30cm resolution (critical for 3D scene reconstruction). Map change dataset (1,000 scenarios, 200 with real-world HD map changes). Rich HD maps with 3D lane-level geometry, lane markings, traffic direction, crosswalks, and driveable area.

1.6 KITTI and KITTI-360 (KIT/Toyota, 2012/2022)

KITTI (original):

Tasks: stereo, optical flow, visual odometry, 3D detection, 3D tracking
Sensors: 2 color cameras, 2 grayscale cameras, Velodyne HDL-64E, GPS/IMU
Location: Karlsruhe, Germany (urban, rural, highway)
Status: Still widely used as a baseline despite age; 15,000+ citations

KITTI-360 (successor):

73.7km driving distance, 300K images, 80K laser scans
150K+ images and 1B 3D points with coherent semantic instance annotations
Full 360-degree FOV via fisheye cameras and pushbroom scanner
Tasks: semantic scene understanding, novel view synthesis, semantic SLAM

KITTI remains the most-cited AV dataset, but its limited scale and single-city coverage make it insufficient for modern world model training. It retains value as a domain-transfer target (e.g., Waymo-to-KITTI zero-shot benchmarks).

1.7 Lyft Level 5 (Lyft/Woven Planet, 2019)

Property	Details
Scale	55,000+ human-labeled 3D annotated frames; 1,000+ hours total sensor data
Fleet	20 self-driving cars
Duration	4-month collection period
Location	Palo Alto, California (fixed route)
Maps	Drivable surface map + HD spatial semantic map

The dataset is no longer actively maintained since Woven Planet (Toyota) absorbed Lyft Level 5. Its value lies primarily in the large fleet data collection paradigm it demonstrated.

1.8 PandaSet (Hesai/Scale AI, 2021)

Property	Details
Scenes	100+ scenes, 8 seconds each
Sensors	1 x 360deg mechanical spinning LiDAR, 1 x forward-facing long-range LiDAR, 6 cameras
Annotations	28 object classes, 37 semantic segmentation classes
License	First AV dataset with no-cost commercial license

PandaSet is notable for its dual-LiDAR configuration and commercial-friendly licensing, but its small scale limits use for world model training.

1.9 ONCE (One millioN sCenEs, 2021)

Property	Details
Scale	1 million LiDAR scenes, 7 million camera images
Duration	Selected from 144 driving hours (20x larger than nuScenes/Waymo at time of release)
Sensors	1 high-quality LiDAR (70K points/scene avg), 7 cameras (360deg)
Annotations	Small labeled subset + massive unlabeled pool
Focus	Semi-supervised and self-supervised 3D detection

ONCE is particularly relevant for studying data scaling and self-supervised pretraining approaches, as its massive unlabeled pool enables controlled experiments on how much labeled vs. unlabeled data is needed.

1.10 OpenScene (OpenDriveLab/Shanghai AI Lab, 2024)

Property	Details
Scale	120+ hours of driving data
Base	Compact redistribution of nuPlan (>10x compression)
Sampling	2Hz sensor data with relevant annotations retained
Annotations	Occupancy labels from 20s accumulated LiDAR, flow annotations (motion direction + velocity)
Cities	Boston, Pittsburgh, Las Vegas, Singapore

Critical role: OpenScene is the official dataset for:

CVPR 2024 Autonomous Grand Challenge: End-to-End Driving + Predictive World Model tracks
CVPR 2025 Autonomous Grand Challenge: NAVSIM-v2 End-to-End Driving track

At 120h it provides ~20x more data than competing occupancy datasets (Occ3D: 5.5h, OccNet: 5.7h), making it the primary benchmark dataset for world model research.

1.11 DriveLM (OpenDriveLab, 2024 — ECCV 2024 Oral)

Property	Details
Structure	Graph-structured QA pairs (Perception, Prediction, Planning)
Base datasets	nuScenes + CARLA
Annotation type	Frame-wise P3 QA with logical dependencies
Graph structure	QA pairs as nodes, object relationships as edges

DriveLM is the first language-driving dataset covering the full stack of driving tasks with graph-structured logical dependencies. It enables training VLMs that can reason about driving through structured question-answering chains from perception through prediction to planning.

1.12 Talk2Car (KU Leuven, 2019)

Property	Details
Scale	11,959 natural language commands across 850 nuScenes videos
Task	Grounding natural language commands to 3D bounding boxes
Modalities	Leverages full nuScenes sensor suite (maps, GPS, LiDAR, radar, 360deg RGB)

Talk2Car is important for language-grounded driving where commands like "follow that blue car" must be resolved to specific detected objects.

1.13 Significant New Datasets (2024-2025)

CoVLA (Turing Inc., WACV 2025):

10,000 driving scenes from Tokyo, 80+ hours of video
Automated frame-level language captions + future trajectory annotations
Bridges vision-language-action for VLA model training
CoVLA-Agent baseline achieves 0.814m ADE with ground truth captions

OpenDV-YouTube / OpenDV-2K (OpenDriveLab, 2024):

1,700-2,000+ hours of YouTube driving videos — 300x-374x larger than nuScenes
244 cities in 40 countries
Paired with LLM/VLM-generated text descriptions and driving instructions
Used to pretrain Vista, a generalizable driving world model
Key insight: YouTube data provides natural distribution of driving scenarios across geographies, weather, and traffic

V2X-Real (ECCV 2024):

Multi-vehicle + infrastructure cooperative perception
33K LiDAR frames, 171K camera images, 1.2M annotated boxes (10 categories)
Data from 2 connected AVs + 2 smart infrastructure units

V2X-Radar (Tsinghua, 2024):

First V2X dataset with 4D radar
20K LiDAR frames, 40K camera images, 20K 4D radar frames, 350K annotated boxes

WayveScenes101 (Wayve, 2024):

101 driving sequences (20s each), 101,000 images
Designed specifically for novel view synthesis in driving
US and UK locations

SEVD (CARLA-based, 2024):

58 hours of multi-modal synthetic data
9 million bounding box annotations
Ego-vehicle + infrastructure viewpoints, 3 lighting / 5 weather conditions

WOD-E2E (Waymo, 2025):

4,021 high-difficulty long-tail driving segments
Focused specifically on rare, safety-critical situations for E2E driving

UniOcc (ICCV 2025):

Unified benchmark merging nuScenes, Waymo, CARLA, OpenCOOD
14.2 hours across 2,152 sequences
Standardizes occupancy forecasting and prediction evaluation

2. Benchmarks for World Models and Generation

2.1 NAVSIM (NeurIPS 2024 / CoRL 2025)

The most important recent benchmark for end-to-end driving evaluation.

Core concept: A middle ground between open-loop and closed-loop evaluation. Uses large real-world datasets with a non-reactive simulator, gathering simulation-based metrics by unrolling BEV abstractions of test scenes.

Property	Details
Dataset	OpenScene (derived from nuPlan)
Metrics	Progress, time-to-collision, drivable area compliance, comfort
Correlation	Pseudo-simulation achieves strong correlation with traditional closed-loop simulation at 6x less compute
Competition	CVPR 2024: 143 teams, 463 submissions; CVPR 2025: NAVSIM v2 challenge
Version	v2 (2025) with private_test_hard split for HuggingFace leaderboard

Key finding: On large sets of challenging scenarios, simple methods with moderate compute (e.g., TransFuser) can match large-scale E2E architectures (e.g., UniAD), suggesting architecture innovations alone are insufficient — data and evaluation rigor matter more.

2.2 nuScenes Prediction Benchmarks

Trajectory prediction:

Up to 25 proposed future trajectories per agent
Metrics: ADE (Average Displacement Error), FDE (Final Displacement Error) over k most likely predictions
Focus: Multi-modal trajectory forecasting quality

3D Occupancy prediction:

Voxelized 3D space representation
Joint estimation of occupancy state and semantic class per voxel
Metrics: IoU and mIoU per class
Standard benchmarks: Occ3D-nuScenes, OpenOccupancy
Occluded/out-of-FOV voxels are ignored during evaluation

Relevance for world models: Occupancy prediction is a direct proxy for world modeling — predicting future 3D occupancy is essentially predicting future world states. The transition from trajectory-only to dense occupancy forecasting marks a paradigm shift toward richer world representations.

2.3 Waymo Sim Agents Challenge

Property	Details
Task	Simulate 32 realistic joint futures for all agents in a scene
Input	1 second of past agent tracks + map
Goal	Realistic, interactive agent simulation
Status	Running annually since 2023; 2025 leaderboard remains open

The Sim Agents Challenge is the closest public benchmark to directly evaluating world models in the behavioral/trajectory domain. Generating 32 diverse, physically plausible joint futures tests both diversity and realism of learned dynamics.

2.4 CARLA Leaderboard

Property	Details
Simulator	CARLA (Unreal Engine 4 based)
Leaderboard v2	39 challenging scenarios for robustness evaluation
Status	No official challenge in 2025 (unforeseen circumstances); local use still available
Best entry	SimLingo (topped both Leaderboard 2.0 and Bench2Drive, won CARLA Challenge 2024)

CARLA remains the primary closed-loop simulation benchmark but faces questions about sim-to-real transferability.

2.5 Bench2Drive (NeurIPS 2024 Datasets & Benchmarks)

A granular closed-loop E2E-AD benchmark enhanced by world model RL expert policies.

Property	Details
Scenarios	44 scenario types, 5 routes each (different weather/towns) = 220 routes
Route length	~150 meters per route
Evaluation	Multi-ability assessment across all 44 scenarios
Enhancement	World model RL expert provides better demonstrations than rule-based experts

Bench2Drive is significant because it demonstrates how world models can improve benchmark quality itself — world model RL experts generate better training demonstrations than hand-coded rule-based policies.

2.6 WorldModelBench (NeurIPS 2025 Datasets & Benchmarks)

The first dedicated benchmark for evaluating video generation models as world models.

Property	Details
Domains	7 application domains, 56 subdomains
Evaluation	Instruction following + physics adherence
Human alignment	67K crowd-sourced human labels across 14 frontier models
Judger	Fine-tuned 2B multimodal model achieving 8.6% higher accuracy than GPT-4o at detecting violations
Violation detection	Nuanced physics violations (e.g., mass conservation, object permanence)

Autonomous driving is one of the 7 primary domains evaluated, making this relevant for assessing whether driving world models actually learn physical laws vs. just appearance.

2.7 WorldLens (CVPR 2026)

The most comprehensive driving-specific world model evaluation framework.

Property	Details
Evaluation axes	5: Generation, Reconstruction, Action-Following, Downstream Task, Human Preference
Dimensions	24 evaluation dimensions across the 5 axes
Scope	Visual realism, geometric consistency, functional reliability, perceptual alignment
Key finding	No single model dominates across all axes

WorldLens addresses a critical gap: existing benchmarks either test generation quality OR downstream task utility, but not both. A world model that generates beautiful videos but fails geometrically or behaviorally is useless for planning.

3. Data Engines and Auto-Labeling

3.1 Tesla's Data Engine

The canonical example of a fleet-scale data engine, operating in a continuous closed loop:

Pipeline architecture:

Shadow mode: Models run passively on fleet vehicles, detecting candidate failures without affecting vehicle behavior
Data upload: When owners opt in, clips are uploaded to Tesla's supercomputer infrastructure
Auto-labeling: Fleet data is used to construct precise 3D environment models, which serve as the basis for automatic labeling of new data
Human review: Auto-labeled data is escalated for human verification before training
Retraining: Models are retrained on the curated + auto-labeled data
Deployment: Updated models are pushed to the fleet, restarting the cycle

Active learning approach:

Uncertainty sampling + query-by-committee to identify maximally informative samples
Selective upload of uncommon scenarios ("trigger" system)
Fleet of millions of vehicles provides unmatched geographic and scenario diversity
Simulated data augmentation using synthetic environments for FSD training

Scale advantage: No other company operates at Tesla's fleet scale for data collection. This creates a compounding advantage where more data enables better models, which identify more interesting data, which enables even better models.

3.2 Auto-Labeling with Foundation Models

Grounding DINO + SAM pipeline: The dominant auto-labeling paradigm in 2024-2025 chains open-vocabulary detection with segmentation:

Grounding DINO provides open-set 2D object detection from text prompts (52.5 AP on COCO zero-shot)
SAM/SAM2 generates precise segmentation masks from detected boxes
Masks are projected onto 3D point clouds via geometric calibration for 3D labels

ZOPP Framework (NeurIPS 2024): Combines Grounding DINO features with SAM for zero-shot offboard panoptic perception, specifically designed for autonomous driving. Integrates detection, segmentation, and multi-object tracking in an offboard pipeline.

UP-VL approach: Uses VLM textual outputs + spatiotemporal clustering to automatically generate 3D bounding boxes and object tracklets — producing high-quality pseudo labels without any human annotations.

SAL method: Employs SAM to generate 2D instance masks, then projects onto 3D point clouds via geometric calibration. These projected masks serve as supervisory signals for training 3D instance segmentation models.

Grounded SAM 2: Includes a cascaded auto-label pipeline with caption and phrase grounding capabilities, enabling fully automated annotation workflows at scale.

Key insight: Foundation model auto-labeling has shifted the bottleneck from "can we label this data?" to "can we verify the labels are correct?" — quality assurance, not label generation, is now the constraint.

3.3 Active Learning for Driving

ActiveAD (2024): Planning-oriented active learning that introduces task-specific diversity and uncertainty metrics:

Three key metrics: Displacement Error, Soft Collision, Agent Uncertainty
Critical result: Using only 30% of training data, ActiveAD matches or exceeds methods trained on 100% of data in both nuScenes (open-loop) and CARLA (closed-loop) evaluation
Demonstrates that intelligent data selection can substitute for 3x more random data

Uncertainty sampling strategies:

Least Confident sampling: selects frames where the model is most uncertain
Entropy-based sampling: selects frames with highest prediction entropy
Practical results: 80% accuracy achievable in 10 iterations with appropriate strategies

Industrial practice (NVIDIA):

Scalable active learning implementation with A/B testing validation
Continuous improvement loops integrated into production data pipelines

3.4 Self-Supervised Pretraining on Driving Data

OpenDV-YouTube / Vista (OpenDriveLab):

1,700-2,000+ hours of YouTube driving videos used for pretraining
Self-supervised video prediction as pretraining objective
Vista model: predicts high-fidelity, long-horizon futures and can serve as a generalizable reward function
Covers 244 cities across 40 countries — natural distribution of driving world

DriveVLA-W0 (2025):

Uses world modeling (future image prediction) as a dense self-supervised signal for VLA training
Addresses the "supervision deficit" where sparse action labels underutilize model capacity
Two instantiations: autoregressive world model (discrete tokens) + diffusion world model (continuous features)
Key result: With data scaling from 70K to 70M frames, world model supervision shows sustained improvement, while action-only supervision saturates
State-of-the-art on NAVSIM benchmark, surpassing BEV-based and VLA baselines

Implication for world models: Self-supervised pretraining on raw driving video, followed by fine-tuning on labeled data, is emerging as the dominant paradigm. World model training objectives naturally provide dense supervision signals that scale better than action-only supervision.

3.5 Scaling Laws for Driving Data

Three landmark studies establish the empirical scaling laws:

Waymo Scaling Laws (June 2025):

Studied 500,000 hours of driving data — largest AV scaling study to date
Motion forecasting quality follows a power-law as a function of training compute
As training compute grows, optimal scaling requires increasing model size 1.5x faster than dataset size
Unlike LLMs, optimal AV models tend to be relatively smaller but require significantly more data
Closed-loop metrics also improve with scaling — first demonstration that real-world AV performance predictably improves with more data/compute

Data Scaling Laws for E2E Driving (Dec 2024):

ONE-Drive dataset: 4 million demonstrations, 30,000+ hours, 23 scenario types
Open-loop evaluation exhibits power-law relationship with data quantity
Closed-loop evaluation does NOT follow a simple power law — suggests qualitative phase transitions
Real-world deployment achieved 24.41 km average miles-per-intervention

DriveVLA-W0 Scaling (2025):

World model objectives amplify data scaling — performance gains accelerate with more data (vs. saturating with action-only supervision)
Tested across 70K to 70M frames
World model pretraining fundamentally changes the scaling curve shape

Practical implications for data collection:

More data reliably improves performance (no plateau observed at current scales)
Data quality and diversity matter as much as raw quantity
Optimal model sizes are smaller than LLMs, but data requirements are proportionally larger
Closed-loop improvement requires data diversity, not just volume

4. The Data Gap for Airside Operations

4.1 Current State: No Public Airside Driving Datasets

The aerospace field has a notable absence of large-scale public datasets and easily accessible simulation tools specifically tailored for autonomous airport navigation. This stands in stark contrast to the road driving domain, which benefits from dozens of large-scale public datasets.

What exists today:

Resource	Type	Scale	Limitations
Synth_Airport_Taxii	Synthetic (MS Flight Sim 2020)	Small	Classes: taxiway lane, signs, persons, airplanes, ground vehicles, runway limits. Sim-to-real gap.
AssistTaxi	Real images (aircraft taxiway)	300,000+ frames	Only 2 small airports (Melbourne FL, Grant-Valkaria). Aircraft perspective, not ground vehicle.
AeroVect proprietary	Real fleet data	Undisclosed	Private; collected at major US airports. Not publicly available.

Why the gap exists:

Security restrictions: Airport ramps are secured areas with strict access control
Liability concerns: Sensor data from active ramps captures aircraft, personnel, and proprietary operations
Small addressable market: Fewer companies working on airside autonomy vs. road driving
Equipment diversity: GSE vehicles (baggage tugs, pushback tractors, belt loaders) vary dramatically
Regulatory vacuum: No standards or regulations exist for autonomous GSE data collection

4.2 Transfer Learning from Road Driving to Airside

What transfers well:

Low-level perception: object detection, segmentation, depth estimation
Multi-sensor fusion pipelines (LiDAR + camera + radar)
Occupancy prediction architectures
Motion forecasting for agent interactions
Self-supervised visual representations (DINOv2, CLIP features)

What does NOT transfer:

Semantic understanding: road lanes vs. taxiway markings vs. apron boundaries
Agent behavior models: aircraft follow-me vehicles, pushback procedures, jet blast zones
Speed regimes: airside operations are typically limited to 8-25 kph
Right-of-way rules: aircraft always have priority; FOD (Foreign Object Debris) creates unique constraints
Environment geometry: wide-open aprons vs. structured road networks

Domain adaptation strategies:

ReSimAD (ICLR 2024) provides a promising template for zero-shot 3D domain transfer:

Reconstruct 3D scene-level meshes from source domain (domain-invariant representation)
Simulate target-domain-like point clouds conditioned on reconstructed meshes
Train perception models on simulated target data

This approach outperforms unsupervised domain adaptation methods that require access to real target domain data — critical for airside where real data is scarce. It has been validated across Waymo-to-KITTI, Waymo-to-nuScenes, and Waymo-to-ONCE transfers.

Practical strategy for airside transfer:

Pretrain backbone on large-scale road driving data (nuPlan, Waymo, OpenDV-YouTube)
Use ReSimAD-style reconstruction-simulation to generate airport-like environments
Fine-tune on small real airport dataset (even a few hours of data)
Apply continual learning as more real airport data is collected

4.3 Synthetic Data Generation for Airside

Existing platforms and approaches:

Microsoft Flight Simulator 2020 (validated for airport use):

Synth_Airport_Taxii demonstrates airport-specific synthetic data
Classes: taxiway lanes, vertical signs, persons, airplanes, horizontal signs, runway limits, ground vehicles
Key finding: Models trained on a mix of real + synthetic images outperform models trained on either alone
Cost-effective alternative to real-world data collection

CARLA-style approach adapted for airports:

CARLA2Real tool (2024) reduces sim-to-real appearance gap using G-Buffer data from the rendering pipeline
Enhancing Photorealism Enhancement (EPE) method targets characteristics of real-world datasets
Domain randomization strategies demonstrate strong generalization when transferring to real data
Could be extended to airport environments with custom Unreal Engine assets

Proposed synthetic data pipeline for airside:

Environment construction: Build airport ramp environments in Unreal Engine 5 or Unity, using real airport GIS data and satellite imagery for layout accuracy
Asset library: Model GSE vehicles (tugs, tractors, belt loaders, fuel trucks), aircraft types, personnel, signage, and markings
Behavior simulation: Implement realistic GSE traffic patterns, aircraft pushback procedures, and pedestrian behavior models
Domain randomization: Vary lighting (day/night/dusk), weather (rain, fog, snow), surface conditions (wet, icy), and equipment configurations
Sensor simulation: Generate synthetic LiDAR, camera, and radar data with realistic noise models
Automatic annotation: Leverage rendering engine metadata for perfect ground-truth labels
Real-data validation: Continuously calibrate sim-to-real gap using small real-world datasets

4.4 Bootstrapping a Data Engine for Airside

Based on Tesla's data engine paradigm, adapted for a niche domain with limited initial data:

Phase 1: Foundation (0-6 months)

Deploy instrumented vehicles (LiDAR + cameras + GPS/RTK) on 2-3 partner airports
Collect 100-500 hours of raw sensor data during normal operations
Use foundation model auto-labeling (Grounding DINO + SAM) for initial annotations
Human-in-the-loop review for safety-critical labels (aircraft proximity, personnel detection)
Build initial HD maps using AeroVect-style rapid mapping (~2 hours per airport)

Phase 2: Active Learning Loop (6-18 months)

Deploy initial perception models on fleet vehicles in shadow mode
Use ActiveAD-style uncertainty metrics to identify informative scenarios
Trigger-based upload: automatically flag near-miss events, unusual objects, degraded visibility
Maintain scenario database with tagged attributes (weather, time-of-day, traffic density, aircraft type)
Target: 1,000-5,000 hours of curated data

Phase 3: World Model Training (12-24 months)

Pretrain world model on road driving data (OpenDV-YouTube, nuPlan)
Fine-tune on airside data with domain-specific objectives
Use world model for scenario generation to augment real data
Implement closed-loop evaluation in simulation
Target: competitive performance on airside-specific benchmarks

Phase 4: Scaling (18+ months)

Expand to 10+ airports for geographic diversity
Continuous mining for edge cases and rare scenarios
Implement data balancing strategies for long-tail events
Build airside-specific foundation model
Target: 10,000+ hours of diverse airside data

4.5 Domain-Specific Pretraining Strategies

Synthetic Continued Pretraining (ICLR 2025): This approach is directly relevant for airside domains where real data is scarce:

Start with a small corpus of domain-specific data (e.g., 100h of airport driving)
Synthesize a larger corpus using generative methods to make the knowledge more "learnable"
Continue pretraining the foundation model on the augmented corpus
The synthetic expansion helps the model integrate niche knowledge more effectively

Recommended pretraining curriculum for airside world models:

Stage	Data Source	Objective	Duration
1	ImageNet / COCO / general vision	Visual feature learning	Standard pretrained backbone
2	OpenDV-YouTube (1,700h)	Driving dynamics, multi-agent prediction	50-100 GPU-days
3	nuPlan / Waymo (500Kh available)	Structured driving with HD maps	100-200 GPU-days
4	Synthetic airport data (rendered)	Airport-specific semantics	20-50 GPU-days
5	Real airport data (100-1,000h)	Domain-specific fine-tuning	10-30 GPU-days
6	Continual learning on fleet data	Continuous improvement	Ongoing

Key principle: The further up the pyramid, the more compute-efficient the learning becomes because lower stages have already encoded transferable representations. Stage 2-3 provide the dynamics understanding; stages 4-5 specialize the semantics.

5. Data Curation and Quality

5.1 Mining Interesting/Rare Scenarios from Fleet Data

The challenge: AV fleets generate terabytes of data daily, but the vast majority captures trivial driving (straight roads, no interactions). Safety-critical rare events — the "long tail" — are where model failures concentrate.

Applied Intuition's Data Explorer:

Foundation model trained on 5B+ text-image pairs via contrastive learning (CLIP-based)
Two-tower retrieval architecture:
- Item tower: Fleet camera data embedded once during upload (batch GPU processing). For a 20-minute log with 4 cameras at 4Hz, ~20,000 images are embedded.
- Query tower: Natural language queries embedded at query time (real-time CPU processing)
Engineers search using plain language: "cyclists at night," "construction zones," "jaywalking pedestrians"
Apache Spark handles scalable nearest-neighbor search across thousands of hours
Hybrid queries combine natural language + structured data filters simultaneously

Motional's scenario mining approach:

Mining pipelines ingest sensor outputs (LiDAR, images, radar) plus AV software intermediate results (detections, predictions)
Attribute computation for each scenario enables structured querying
Scenario-specific data mining targets model weaknesses

RefAV (2025): Planning-centric scenario mining that focuses on finding scenarios relevant to specific planning failure modes, not just perceptually interesting ones.

SMc2f (2025): Robust scenario mining from coarse to fine — progressively refines scenario search from broad categories to specific safety-critical instances.

5.2 Scenario Tagging and Retrieval

Tagging taxonomy dimensions:

Environmental: Weather (clear, rain, fog, snow), lighting (day, night, dusk, dawn), road surface (dry, wet, icy)
Traffic: Density (free flow, congested, stopped), agent types (vehicles, pedestrians, cyclists, construction equipment)
Behavioral: Lane changes, turns, merges, U-turns, emergency maneuvers, near-misses
Infrastructure: Intersections, roundabouts, construction zones, school zones, parking lots
Temporal: Time of day, day of week, season, holiday periods

Retrieval-driven development (AV 3.0 paradigm):

Customers continuously mine, assemble, and refresh datasets targeting observed model weaknesses
Each training iteration starts with a query: "what scenarios does the model handle worst?"
Foundation models enable natural-language retrieval without predefined tag schemas
Vector similarity search handles semantic queries that exact tags cannot express

LLM-based scenario analysis (2025): Recent work uses LLMs to analyze driving scenarios in natural language, enabling queries like "Why was the vehicle braking?" and extracting structured scenario descriptions from raw driving logs.

5.3 Data Balancing Strategies

The curse of rarity (Nature Communications, 2024): The rarity of safety-critical events in high-dimensional variable spaces presents fundamental challenges. Simply collecting more data does not proportionally increase coverage of the long tail — the number of possible rare combinations grows exponentially.

Strategies at the data level:

Oversampling rare events: Duplicate or augment underrepresented scenarios. Risk: overfitting to specific rare instances
Undersampling common events: Remove redundant "boring" driving. Risk: losing contextual diversity
Importance sampling: Weight training examples by their information value. ActiveAD achieves 100% performance with 30% of data using this approach
Synthetic augmentation: Generate rare scenarios using GANs, VAEs, diffusion models, or simulators. Risk: synthetic artifacts, domain gap
Curriculum learning: Train on progressively harder scenarios, spending more compute on difficult examples

Strategies at the model level:

Loss re-weighting: Higher loss weight for rare classes/scenarios
Meta-learning for long-tail: Differentiable semantic meta-learning specifically designed for long-tail motion forecasting (2025)
Contrastive learning: Learn representations that distinguish rare events from common ones
Model uncertainty as a signal: Let the model identify its own areas of confusion for targeted data collection

Industrial best practice: Combine real-world data mining + simulation-based augmentation + active learning, ensuring a balance between reliability (real data) and coverage (synthetic data) for rare scenarios.

5.4 Privacy and Anonymization for Airport Data

Regulatory context: Airport ramp data presents heightened privacy concerns beyond typical road driving:

Employee identification (ground handlers, maintenance personnel)
Aircraft registration numbers and airline-specific operations
Security procedures and restricted area layouts
Temporal patterns of operations (flight schedules, staffing patterns)
Location-based privacy risks from GPS/RTK data

Anonymization techniques:

Visual anonymization:

Face blurring/replacement: PP4AV provides benchmarking dataset and methods for face/license plate anonymization in driving contexts
Deep Natural Anonymization: Replaces identifiable features with synthetic alternatives while preserving semantic content
Badge/uniform anonymization: Airport-specific requirement to obscure employee identification
Aircraft registration masking: Remove or blur tail numbers and airline-specific livery details

Spatial anonymization:

Trajectory perturbation: Add noise to GPS tracks to prevent re-identification of specific routes
Temporal obfuscation: Shift timestamps to prevent correlation with flight schedules
Map generalization: Remove airport-specific details from HD maps before public release

Data-minimization approaches:

Process sensor data onboard, upload only embeddings/features rather than raw images
Federated learning: train models locally at each airport without centralizing raw data
Synthetic data generation as a privacy-preserving alternative to real data sharing
Differential privacy for aggregate statistics about airport operations

Recommended pipeline for airport data:

Onboard processing: Run face/badge/registration detection immediately on capture
Automatic anonymization: Apply Deep Natural Anonymization before data leaves the vehicle
Access control: Role-based access — only safety team sees raw data; ML team works with anonymized data
Audit trail: Log all data access for regulatory compliance
Retention policy: Auto-delete raw data after anonymized version is verified; retain only anonymized data long-term
Security classification: Tag each data element with sensitivity level (public / internal / restricted / confidential)

Summary: Key Takeaways for World Model Development

Dataset landscape

For pretraining: OpenDV-YouTube (1,700h+ diverse video) and nuPlan/Waymo (massive scale with annotations) provide the best foundation
For benchmarking: OpenScene + NAVSIM is the current standard for world model evaluation; WorldLens (CVPR 2026) provides the most comprehensive evaluation framework
For language grounding: DriveLM and CoVLA enable vision-language-action world model training

Scaling laws

Performance improves as a power-law with compute (similar to LLMs)
Optimal AV models are relatively small but require disproportionately more data than LLMs
World model pretraining objectives amplify scaling — DriveVLA-W0 shows sustained improvement to 70M frames where action-only supervision saturates

For airside operations

No public datasets exist; this is simultaneously a challenge and an opportunity
Transfer learning from road driving provides a strong foundation for low-level perception
Synthetic data from Unreal Engine / Flight Simulator is validated as a viable supplement
A phased data engine approach (instrument -> collect -> auto-label -> active-learn -> scale) can bootstrap from near-zero data
Privacy/security requirements demand onboard anonymization and federated learning approaches

Data engine essentials

Foundation model auto-labeling (Grounding DINO + SAM) has shifted the bottleneck from label generation to label verification
Active learning can achieve full performance with 30% of data (ActiveAD)
Scenario mining with embedding-based retrieval (Applied Intuition's approach) enables natural-language data discovery at scale
The data balancing challenge requires combining real mining + synthetic generation + importance sampling

SLAM Methods

Methods

Datasets, Benchmarks, and Data Engines for Autonomous Driving World Models ​

Table of Contents ​

1. Major Driving Datasets ​

1.1 nuScenes (Motional, 2019) ​

1.2 nuPlan (Motional, 2021) ​

1.3 Waymo Open Dataset (Waymo, 2019-present) ​

1.4 Waymo Open Motion Dataset (WOMD) ​

1.5 Argoverse 1 & 2 (Argo AI → now open-source, 2019/2023) ​

1.6 KITTI and KITTI-360 (KIT/Toyota, 2012/2022) ​

1.7 Lyft Level 5 (Lyft/Woven Planet, 2019) ​

1.8 PandaSet (Hesai/Scale AI, 2021) ​

1.9 ONCE (One millioN sCenEs, 2021) ​

1.10 OpenScene (OpenDriveLab/Shanghai AI Lab, 2024) ​

1.11 DriveLM (OpenDriveLab, 2024 — ECCV 2024 Oral) ​

1.12 Talk2Car (KU Leuven, 2019) ​

1.13 Significant New Datasets (2024-2025) ​

2. Benchmarks for World Models and Generation ​

2.1 NAVSIM (NeurIPS 2024 / CoRL 2025) ​

2.2 nuScenes Prediction Benchmarks ​

2.3 Waymo Sim Agents Challenge ​

2.4 CARLA Leaderboard ​

2.5 Bench2Drive (NeurIPS 2024 Datasets & Benchmarks) ​

2.6 WorldModelBench (NeurIPS 2025 Datasets & Benchmarks) ​

2.7 WorldLens (CVPR 2026) ​

3. Data Engines and Auto-Labeling ​

3.1 Tesla's Data Engine ​

3.2 Auto-Labeling with Foundation Models ​

3.3 Active Learning for Driving ​

3.4 Self-Supervised Pretraining on Driving Data ​