Skip to content

LiDAR-Native World Models for Autonomous Driving

Point Cloud Prediction, 4D Occupancy Forecasting, and Deployment for Airport Airside Operations

Last updated: 2026-04-11


Key Takeaway: Camera-centric world models generate images or video -- representations that a LiDAR-primary stack cannot directly consume. LiDAR-native world models predict future point clouds, voxel occupancy, or range images in the same metric coordinate system the planner already operates in. For a LiDAR-primary airside AV like reference airside AV stack, this eliminates the modality translation problem entirely. Copilot4D proved LiDAR future prediction is viable (65% Chamfer distance reduction). UnO demonstrated that self-supervised LiDAR occupancy forecasting outperforms supervised baselines. The field is converging on voxel-tokenized discrete diffusion and continuous occupancy fields as the two dominant paradigms. Both are deployable on Orin within 50-100ms for 3-step prediction, and both train self-supervised -- critical when no public airside LiDAR datasets exist.


Table of Contents

  1. Why LiDAR-Native World Models
  2. Copilot4D: Discrete Diffusion on LiDAR Tokens
  3. UnO: Unsupervised Occupancy Fields
  4. LidarDM: Map-Conditioned LiDAR Generation
  5. LiDARCrafter: Language-Guided 4D LiDAR Generation
  6. 4D Occupancy Forecasting with LiDAR Input
  7. Point Cloud Prediction Networks
  8. Multi-Sensor Fusion World Models
  9. AD-L-JEPA: Embedding-Space LiDAR Prediction
  10. Training LiDAR World Models on Airside Data
  11. Deployment on NVIDIA Orin
  12. Applications for Airside Safety
  13. Comparison Table
  14. Recommended Roadmap
  15. References

1. Why LiDAR-Native World Models

1.1 The Modality Mismatch Problem

The dominant world model research (GAIA-1/2/3, Sora, Vista, DrivingWorld) generates video frames -- pixel-space predictions of what the camera will see. This is natural for camera-primary stacks (Wayve, Tesla, comma.ai) but creates a fundamental mismatch for LiDAR-primary stacks:

Camera World Model OutputLiDAR Stack Needs
RGB pixel predictions3D point coordinates (x, y, z)
BEV from image liftingNative BEV from point cloud
Relative depth estimatesAbsolute metric distances
2D flow vectors3D velocity vectors
Appearance-based predictionGeometry-based prediction
Resolution: ~1280x720 pixelsResolution: ~200K points/scan

The reference airside AV stack operates entirely in metric 3D coordinates: GTSAM localization produces metric poses, Frenet planning uses metric distances, and the safety controller checks metric clearances. A world model that predicts in pixel space requires an additional depth estimation stage to convert back to 3D -- introducing error, latency, and architectural complexity that is entirely avoidable.

1.2 What LiDAR-Native World Models Predict

LiDAR-native world models operate directly on point cloud or voxel representations. They predict one (or more) of:

Input:  LiDAR scans at t-T, t-T+1, ..., t (past T frames)
        Optional: ego action a_t, HD map M

Output options:
  1. Future point clouds:    P_{t+1}, P_{t+2}, ..., P_{t+K}
  2. Future voxel occupancy: O_{t+1}, O_{t+2}, ..., O_{t+K}
  3. Future range images:    R_{t+1}, R_{t+2}, ..., R_{t+K}
  4. Future occupancy flow:  {O_k, F_k} for k = t+1 ... t+K
  5. Continuous 4D field:    f(x, y, z, t) → {occupied, flow}

All of these outputs are in the same metric coordinate frame the planner uses. No conversion, no lifting, no depth estimation.

1.3 Metric Accuracy: Why It Matters for Safety

Camera world models inherently operate in a projective space where metric distances are ambiguous. A predicted image cannot distinguish between a small object nearby and a large object far away. LiDAR world models preserve absolute metric distances because they operate on range measurements:

Safety RequirementCamera World ModelLiDAR World Model
"Is the aircraft 15m away?"Requires depth estimation (10-20% error at range)Direct: range measurement in prediction
"Will the GSE be within 2m in 3s?"Requires 3D trajectory extraction from videoDirect: predicted point cloud positions
"Is there a 0.5m clearance?"Sub-meter accuracy unreliable beyond 20mSub-meter accuracy maintained to 100m+
"Is FOD stationary or moving?"Requires optical flow + depth → 3D flowDirect: 3D flow in voxel space

For airside operations where clearances are tight (wing tips within 2-3m of obstacles, personnel within arm's reach of equipment), metric prediction accuracy is not optional -- it is a safety requirement.

1.4 Advantages for Airport Airside

Aircraft trajectory prediction: Aircraft during pushback follow predictable geometric arcs defined by the nose gear steering angle and the pushback tractor's path. A LiDAR world model that has observed pushback sequences can predict the 3D swept volume of an aircraft 5-10 seconds ahead -- directly providing the forbidden zone for ego path planning.

GSE motion prediction: Baggage tractors, belt loaders, catering trucks, and fuel bowsers follow semi-structured patterns during turnaround. A LiDAR world model trained on turnaround sequences learns these patterns as 3D occupancy evolution, providing trajectory predictions without requiring per-vehicle-type detection and tracking.

FOD persistence modeling: LiDAR world models naturally separate static and dynamic scene elements. A detected object that persists in the same voxels across predicted future frames is confirmed stationary (debris, tool left on apron). One that disappears or moves is dynamic (person walking, vehicle in motion). This static/dynamic decomposition is critical for FOD assessment.

Metric safety envelopes: The planner can directly query predicted future occupancy: "At t+3s, is any voxel within my planned swept volume occupied?" This is a native operation in voxel space. The equivalent query in pixel space requires projecting the swept volume into image coordinates, checking predicted images, and converting back -- lossy and slow.

1.5 Relationship to Camera-Centric Work in This Repository

This document complements the existing camera-centric world model research:

DocumentFocusModality
overview.mdGAIA-1/2, Cosmos, DriveWorld, Think2DriveCamera (video generation)
diffusion-world-models.mdSora, DriveDreamer, Drive-WMCamera (diffusion video)
tokenized-and-jepa.mdDrivingGPT, GAIA-1, VQ-VAECamera (token prediction)
occupancy-world-models.mdOccWorld, Drive-OccWorld, OccSoraCamera BEV (occupancy)
occupancy-deployment-orin.mdFlashOcc, nvbloxCamera/LiDAR (occupancy)
This documentCopilot4D, UnO, LidarDM, LiDARCrafterLiDAR-native

The key architectural point: occupancy world models (OccWorld, Drive-OccWorld) operate on BEV/voxel representations that are modality-agnostic at the world model stage. They can be driven by either camera-based BEV encoders or LiDAR-based voxelization. The LiDAR-native world models covered here either operate directly on raw point clouds/range images or use LiDAR-specific tokenization.


2. Copilot4D: Discrete Diffusion on LiDAR Tokens

2.1 Overview

Paper: "Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion" Authors: Zhang et al. (Waabi) Venue: ICLR 2024 arXiv: 2311.01017

Copilot4D is the first LiDAR-native world model that demonstrates high-quality future point cloud prediction. It frames point cloud forecasting as discrete token prediction using a VQ-VAE tokenizer and a discrete diffusion process.

2.2 Architecture

LiDAR Point Cloud Sequence

├─ Past frames: P_{t-T}, ..., P_t


┌──────────────────────────────────────────────────────┐
│  VQ-VAE Tokenizer                                     │
│                                                        │
│  Encoder:                                              │
│    Point cloud → Voxelize (BEV pillars)               │
│    → PointNet per-pillar feature extraction            │
│    → Swin Transformer backbone (multi-scale features)  │
│    → Bottleneck projection                             │
│                                                        │
│  Vector Quantization:                                  │
│    → Lookup nearest codebook entry (K=8192 entries)    │
│    → Straight-through estimator for gradients          │
│    → Exponential moving average codebook updates       │
│                                                        │
│  Decoder (dual-branch):                                │
│    Branch 1: Neural feature grid → continuous features │
│    Branch 2: Binary occupancy map → occupied/free      │
│    → Combined via differentiable rendering → P_recon   │
└───────────────────┬──────────────────────────────────┘
                    │ Discrete tokens z_t per frame

┌──────────────────────────────────────────────────────┐
│  Discrete Diffusion World Model                       │
│                                                        │
│  Input: z_{t-T}, ..., z_t (past token sequences)      │
│                                                        │
│  Forward process:                                      │
│    Randomly mask tokens with probability β (absorbing) │
│    Controlled noise injection rate η = 20%             │
│                                                        │
│  Reverse process (MaskGIT-based):                      │
│    Predict all masked tokens in parallel               │
│    Iteratively unmask by confidence score              │
│    10-20 steps for high-quality generation             │
│                                                        │
│  Architecture: Transformer with spatiotemporal attn    │
│  Temporal causal masking: can only attend to past      │
│                                                        │
│  Output: z_{t+1}, ..., z_{t+K} (future token seqs)    │
└───────────────────┬──────────────────────────────────┘


┌──────────────────────────────────────────────────────┐
│  VQ-VAE Decoder                                       │
│                                                        │
│  Tokens → codebook lookup → features                   │
│  → Dual-branch decoding → point cloud reconstruction  │
│                                                        │
│  Output: P_{t+1}, ..., P_{t+K} (predicted future      │
│          point clouds in metric coordinates)           │
└──────────────────────────────────────────────────────┘

2.3 Key Technical Innovations

Discrete tokenization of LiDAR occupancy: Rather than modeling continuous point distributions (which is intractable for 200K+ points/scan), Copilot4D converts point clouds to a discrete BEV grid and applies vector quantization. This reduces the prediction problem from generating 200K 3D coordinates to predicting ~2000 discrete tokens -- a 100x dimensionality reduction that makes transformer-based modeling feasible.

Modified MaskGIT for temporal prediction: Standard MaskGIT generates images by iteratively unmasking tokens. Copilot4D extends this to temporal sequences: given past frames as context (unmasked), predict all future tokens simultaneously, then iteratively refine by confidence. The controlled noise injection (eta=20%) during training prevents the model from collapsing to copying the last observed frame.

Dual-branch decoder: The decoder produces both a neural feature grid (for continuous features like intensity) and a binary occupancy map (for structure). Point clouds are reconstructed via differentiable ray-casting through the occupancy map, producing physically plausible returns (respecting occlusion, ray termination).

2.4 Results

MetricCopilot4DPrevious SOTAImprovement
Chamfer Distance @ 1s (nuScenes)0.361.4174% reduction
Chamfer Distance @ 3s (nuScenes)0.72~1.5>50% reduction
Perceptual Quality (FPD)Best--First to evaluate
Datasets evaluatednuScenes, KITTI, Argoverse2Single dataset3x generality

Qualitative observations:

  • Generates realistic future point clouds that preserve object shapes and scene geometry
  • Handles dynamic objects: vehicles accelerating, turning, stopping
  • Preserves static background structure: buildings, fences, vegetation
  • Struggles with sudden appearance of new objects (occlusion-to-visibility transitions)

2.5 Limitations

  1. Not open-source: Code and weights are not publicly released. Waabi (founded by Raquel Urtasun, previously of Uber ATG) treats this as proprietary technology.
  2. BEV-only representation: Copilot4D uses BEV pillars, collapsing height information. This limits prediction of objects at different vertical levels (e.g., aircraft wing tips above vehicle roof).
  3. Single-sensor training: Trained on individual LiDAR sweeps from nuScenes (32-beam) and Waymo (64-beam). Multi-LiDAR setups like the reference airside AV stack's 4-8 sensors are not directly supported -- would need to merge point clouds before tokenization.
  4. No action conditioning: Copilot4D predicts unconditional futures -- it does not condition on the ego vehicle's planned action. This means it cannot answer "what will happen IF I take this action?"
  5. Inference speed: The iterative MaskGIT decoding requires 10-20 steps per future frame. Estimated 80-150ms per 3-step prediction on A100. On Orin, this would need significant optimization.

2.6 Airside Relevance

Despite its limitations, Copilot4D proved a critical thesis: LiDAR future prediction is viable and dramatically better than prior approaches. The 65-75% Chamfer distance reduction demonstrates that discrete tokenization of LiDAR data is a productive representation for world modeling.

For airside application, Copilot4D's approach would need adaptation:

  • Merge 4-8 RoboSense scans into unified cloud before tokenization
  • Extend BEV to full 3D voxelization (aircraft height variation requires it)
  • Add action conditioning for planning integration
  • Fine-tune on airside LiDAR data (see Section 10)

The architecture is reproducible from the paper even without official code. The VQ-VAE tokenizer uses standard components (PointNet, Swin Transformer, codebook quantization), and MaskGIT-based diffusion is well-documented.


3. UnO: Unsupervised Occupancy Fields

3.1 Overview

Paper: "Unsupervised Occupancy Fields for Perception and Forecasting" Authors: Agro et al. (Waabi) Venue: CVPR 2024 Oral arXiv: 2406.08691

UnO learns a continuous 4D spatio-temporal occupancy field from raw LiDAR data with zero semantic labels. It is arguably the most relevant LiDAR-native world model for the reference airside AV stack because it is fully self-supervised, operates directly on LiDAR, and produces both occupancy and flow predictions.

3.2 Architecture

LiDAR Point Cloud Sequence: P_{t-T}, ..., P_t


┌─────────────────────────────────────────────────────────┐
│  Voxelized LiDAR Encoder                                │
│                                                          │
│  Each scan → voxel grid (e.g., 0.2m resolution)         │
│  → Sparse 3D convolution backbone (VoxelNet-style)      │
│  → BEV feature extraction F_t                            │
│                                                          │
│  HD Map Rasterization (optional):                        │
│  → Rasterize lane lines, road boundaries → BEV map      │
│  → Concatenate with LiDAR BEV features                   │
└───────────────────┬─────────────────────────────────────┘
                    │ BEV features: F_{t-T}, ..., F_t

┌─────────────────────────────────────────────────────────┐
│  Temporal Fusion + Implicit Field Decoder                │
│                                                          │
│  Query: (x, y, z, t') for any future spacetime point    │
│                                                          │
│  Decoder:                                                │
│    1. Bilinear interpolate BEV features at (x, y)       │
│    2. Cross-attend to temporal sequence (past context)   │
│    3. MLP head → {P(occupied), flow_vector(vx, vy, vz)} │
│                                                          │
│  Continuous: queries at arbitrary resolution/timestep    │
│  No grid commitment at inference time                    │
└───────────────────┬─────────────────────────────────────┘


┌─────────────────────────────────────────────────────────┐
│  Self-Supervised Training                                │
│                                                          │
│  1. Occupancy loss: Does predicted occupancy explain     │
│     observed LiDAR returns? (ray termination likelihood) │
│                                                          │
│  2. Free-space loss: Rays that pass through a region     │
│     without returning → that region should be empty      │
│                                                          │
│  3. Flow consistency loss: Predicted flow should         │
│     transport occupancy from t to t+1 consistently      │
│                                                          │
│  4. Point cloud rendering loss (optional):               │
│     Render predicted occupancy to point cloud,           │
│     compare with actual future LiDAR scan                │
│                                                          │
│  NO SEMANTIC LABELS REQUIRED                             │
└─────────────────────────────────────────────────────────┘

3.3 Key Technical Aspects

Continuous implicit representation: Unlike grid-based methods (Copilot4D, OccWorld) that commit to a fixed resolution, UnO represents the environment as a continuous function. The planner can query occupancy at any 3D point at any future time -- enabling adaptive resolution (fine near the vehicle, coarse at distance) without retraining.

Self-supervised training from LiDAR only: The training signal comes entirely from future LiDAR observations. If the model predicts a region is occupied at t+1 and the actual LiDAR scan at t+1 has returns in that region, the prediction was correct. Free-space supervision comes from rays that traverse a region without hitting anything. This eliminates the need for any human annotation.

Flow prediction: In addition to occupancy, UnO predicts 3D flow vectors for each occupied point. Flow indicates direction and speed of movement, enabling distinction between static obstacles (zero flow) and dynamic agents (nonzero flow). This is directly usable for trajectory prediction.

3.4 Results

BenchmarkMetricUnOPrevious SOTANotes
Argoverse 2Point Cloud ForecastingSOTA--Best across all horizons
nuScenesPoint Cloud ForecastingSOTA--Best across all horizons
KITTIPoint Cloud ForecastingSOTA--Best across all horizons
nuScenesBEV Semantic Occupancy (fine-tuned)SOTA--Superior to supervised baselines
nuScenesLow-data regime (10% labels)SOTA--Largest advantage when labels scarce

Critical finding: UnO pre-trained with self-supervision outperforms fully supervised baselines when fine-tuned on limited labeled data. This is the scalability breakthrough -- the model learns useful 3D world dynamics from raw LiDAR without any annotation.

3.5 Airside Relevance

UnO is arguably the highest-priority LiDAR world model for the reference airside AV stack for several reasons:

  1. Zero labels needed for pre-training. Just drive the vehicle around the airside recording LiDAR sequences. UnO learns world dynamics from these raw recordings.
  2. Continuous representation matches planner queries. The Frenet planner can query arbitrary (x, y, z, t) points along candidate trajectories to check predicted occupancy and flow.
  3. Flow prediction enables agent classification. Static objects (FOD, parked aircraft) vs. dynamic agents (GSE, personnel) are distinguished by flow magnitude -- no detection or tracking required.
  4. Generalizes to any object. Class-agnostic occupancy handles the 30+ GSE types, 100+ aircraft variants, and novel objects (construction barriers, temporary signage) found on airside.
  5. CVPR Oral quality. This is peer-reviewed work from Waabi (Raquel Urtasun's company, focused on LiDAR-first autonomy).

Limitation: Like Copilot4D, UnO is not open-source. However, the architecture is fully described in the paper, and the components (sparse 3D CNN, implicit field decoder, self-supervised losses) are all implementable with standard libraries (spconv, PyTorch).


4. LidarDM: Map-Conditioned LiDAR Generation

4.1 Overview

Paper: "Generative LiDAR Simulation in a Generated World" Authors: Zyrianov et al. Venue: ICRA 2025 arXiv: 2404.02903GitHub: vzyrianov/LidarDM

LidarDM is the first map-conditioned LiDAR generation model -- given an HD map layout and a driving scenario (trajectories of actors), it generates realistic LiDAR point cloud sequences. This is not a world model in the prediction sense (it does not forecast from past observations), but rather a generative simulation model that creates synthetic LiDAR data for training and testing.

4.2 Architecture

Inputs:
  1. HD Map layout (lanes, road boundaries, crosswalks)
  2. Actor trajectories (position, heading, velocity over time)
  3. Ego trajectory (position, heading over time)
  4. Optional: LiDAR sensor specification (beam pattern, range)

Pipeline:
┌─────────────────────────────────────────────┐
│  Stage 1: Layout-Aware Scene Composition     │
│                                              │
│  HD map → rasterized BEV layout              │
│  Actor trajectories → 3D bounding boxes      │
│  Compose background + foreground layout      │
└──────────────────┬──────────────────────────┘


┌─────────────────────────────────────────────┐
│  Stage 2: Continuous Diffusion on            │
│           Range Images                       │
│                                              │
│  Layout → condition features                 │
│  Diffusion model generates range images      │
│  (H x W where H = beam count, W = azimuth)  │
│  Denoising: 50-100 DDPM steps               │
│  Produces: range + intensity per pixel       │
└──────────────────┬──────────────────────────┘


┌─────────────────────────────────────────────┐
│  Stage 3: Range Image → Point Cloud          │
│                                              │
│  Convert range pixels to 3D (x, y, z) using │
│  known beam angles and ego pose              │
│  Apply raydrop modeling (physically-based    │
│  model of which rays return vs miss)         │
│  Output: realistic point cloud P_t           │
└─────────────────────────────────────────────┘

4.3 Key Innovations

Range image as diffusion target: Rather than generating 3D points directly (high-dimensional, unordered), LidarDM converts the problem to 2D: generate a range image where each pixel encodes the distance measured by one LiDAR beam at one azimuth angle. This 2D representation allows efficient use of standard image diffusion architectures (U-Net with cross-attention for conditioning).

Map conditioning: The HD map provides structural priors (road geometry, building outlines) that constrain the generated scene. This is critical for generating novel but plausible scenes -- the LiDAR returns must be consistent with the underlying road layout.

Raydrop modeling: Real LiDAR sensors do not receive returns from all emitted beams -- some miss due to surface properties, range limits, or multi-path effects. LidarDM models this raydrop pattern, producing point clouds with realistic density variations.

4.4 Results

  • Generates physically plausible LiDAR sequences from map + trajectory specification
  • Temporal coherence across frames (smooth scene evolution)
  • Evaluated by training 3D detection models on synthetic data and testing on real data
  • Sim-to-real gap: models trained on LidarDM data achieve within 5-8% of real-data performance

4.5 Airside Application: Generating Training Data from AMDB Maps

This is where LidarDM becomes uniquely valuable for the reference airside AV stack use case. The existing repository documents note that no public airside LiDAR datasets exist and that AMDB (Aerodrome Mapping Database) data is available free from the FAA for 500+ US airports.

The pipeline:

AMDB (airport layout) → Convert to HD map format
                       → Define turnaround scenarios (actor trajectories)
                       → LidarDM generates synthetic LiDAR sequences
                       → Train/validate perception models on synthetic data
                       → Fine-tune on small real airside dataset (500-1000 frames)

Specific scenarios for generation:

  1. Aircraft pushback from gate (define aircraft trajectory from gate to taxiway)
  2. Baggage tractor convoy crossing apron (define multi-vehicle trajectories)
  3. Catering truck approaching aircraft door (define approach trajectory)
  4. Personnel walking near active taxiway (define pedestrian trajectories)
  5. FOD on apron surface (place static objects at specified locations)

Cost estimate: Per the sim-to-real document, a high-fidelity digital twin can outperform real-data training by 4.8%. LidarDM + 500 real scans achieves within 3-5% of fully real-data-trained models using 10x less real annotation. Development cost: $50-75K for first airport, $25-50K for each additional airport.

4.6 Limitations

  • Requires HD map as input (AMDB provides this for airports, but conversion to HD map format is needed -- see the AMXM/Lanelet2 discussion in 30-autonomy-stack/localization-mapping/maps/hd-map-standards-airside.md)
  • Range image representation limits to single-LiDAR viewpoint per generation step
  • 50-100 diffusion steps per frame: not real-time, only for offline data generation
  • Beam pattern specific to training data (nuScenes 32-beam vs. RoboSense 32-beam RSHELIOS -- close but not identical)

5. LiDARCrafter: Language-Guided 4D LiDAR Generation

5.1 Overview

Paper: "Dynamic 4D World Modeling from LiDAR Sequences" Authors: WorldBench Venue: AAAI 2026 Oral GitHub: worldbench/LiDARCrafter (193 stars)

LiDARCrafter extends LiDAR generation to language-guided 4D scene creation. Rather than specifying exact trajectories (as LidarDM requires), you describe the scene in natural language and the model generates corresponding 4D LiDAR sequences.

5.2 Architecture

Input: Language prompt + optional seed LiDAR frame
       "A truck merges from the right lane while a pedestrian
        crosses ahead"

Pipeline:
┌─────────────────────────────────────────────────┐
│  Component 1: 4D Layout Generation               │
│                                                   │
│  Language prompt → CLIP text encoder              │
│  → Layout prediction network                      │
│  → 3D bounding box trajectories over time         │
│  → Scene graph with object types and motions      │
└──────────────────┬────────────────────────────────┘


┌─────────────────────────────────────────────────┐
│  Component 2: Single-Frame LiDAR Synthesis       │
│                                                   │
│  Layout → condition features                      │
│  Seed frame (if provided) → structural prior      │
│  Diffusion model generates high-fidelity frame    │
│  Scene-level and object-level generation          │
└──────────────────┬────────────────────────────────┘


┌─────────────────────────────────────────────────┐
│  Component 3: Temporal Consistency Enforcement    │
│                                                   │
│  Inter-frame coherence via temporal attention     │
│  Static background persistence                    │
│  Dynamic object motion smoothness                 │
│  Output: 4D LiDAR sequence (T frames x N points) │
└─────────────────────────────────────────────────┘

5.3 Results

EvaluationLiDARCrafterPrevious SOTAMetric
Single-frame quality (nuScenes)BestLidarDMScene-level FPD
Foreground object qualityBest--Object-level Chamfer
Temporal stabilityBest--Frame-to-frame consistency
ControllabilityLanguage-guidedTrajectory-only (LidarDM)Flexibility

5.4 Airside Application

LiDARCrafter's language interface is powerful for generating diverse airside scenarios:

Prompt examples for airside data generation:

"Aircraft pushback from stand with tractor attached to nose gear,
 turning right onto taxiway alpha"

"Three baggage carts in convoy crossing apron from left to right,
 with a loader parked at the aircraft door"

"Ground crew member walking from the terminal building toward
 the aircraft, passing behind a fuel bowser"

"Small debris on the taxiway centerline, with a taxiing aircraft
 approaching from 100 meters"

"De-icing truck spraying aircraft wing, with fog reducing
 visibility to 200 meters"

Advantage over LidarDM: No need to specify exact trajectories -- the language model handles motion planning of the described agents. This dramatically reduces the cost of scenario authoring, enabling generation of hundreds of diverse airside scenarios from text descriptions.

Limitation: Language-conditioned generation may not produce sufficiently precise geometric accuracy for safety-critical scenarios. The aircraft swept area during pushback must be geometrically exact (determined by nose gear angle and aircraft dimensions), which a language-conditioned model may approximate imprecisely.


6. 4D Occupancy Forecasting with LiDAR Input

6.1 The Occupancy Forecasting Paradigm

4D occupancy forecasting predicts future 3D occupancy grids from past observations. While most published methods use camera-based BEV encoders (because nuScenes benchmarks are camera-focused), the forecasting stage itself is modality-agnostic -- it operates on BEV/voxel features regardless of how they were produced. This means any camera-based occupancy forecasting method can be driven by LiDAR-based BEV features instead.

Camera-based pipeline (published):
  Multi-view images → BEV encoder (LSS/BEVFormer) → BEV features
  → Occupancy forecasting model → Future occupancy grids

LiDAR-based pipeline (for reference airside AV stack):
  Multi-LiDAR point clouds → Voxelization + 3D sparse CNN → BEV features
  → SAME occupancy forecasting model → Future occupancy grids

The forecasting model does not care how the BEV features were produced.

6.2 OccWorld with LiDAR Input

Paper: OccWorld (ECCV 2024)GitHub: wzzheng/OccWorld

OccWorld's architecture has two stages: (1) scene tokenizer (VQ-VAE) and (2) GPT-like temporal predictor. The scene tokenizer operates on 3D semantic occupancy grids -- which can come from either camera-based or LiDAR-based occupancy estimation.

LiDAR adaptation:

python
# Original OccWorld: camera BEV → occupancy → tokenize → predict
# Adapted for LiDAR: LiDAR voxels → occupancy → tokenize → predict

# Step 1: Replace BEV encoder with LiDAR voxelization
# Original:
#   bev_features = camera_bev_encoder(multi_view_images)
# Replace with:
#   bev_features = lidar_voxel_encoder(merged_point_cloud)

import torch
import spconv.pytorch as spconv

class LiDARBEVEncoder(torch.nn.Module):
    """Replace camera BEV encoder with LiDAR-based encoder."""
    
    def __init__(self, voxel_size=[0.2, 0.2, 0.2],
                 point_cloud_range=[-51.2, -51.2, -5.0, 51.2, 51.2, 3.0],
                 max_points_per_voxel=20,
                 max_voxels=40000):
        super().__init__()
        self.voxel_size = voxel_size
        self.point_cloud_range = point_cloud_range
        
        # Hard voxelization
        self.voxel_layer = spconv.VoxelGeneratorV2(
            voxel_size=voxel_size,
            point_cloud_range=point_cloud_range,
            max_num_points=max_points_per_voxel,
            max_voxels=max_voxels
        )
        
        # Simple mean encoder (or PointNet for richer features)
        self.voxel_encoder = MeanVoxelEncoder(in_channels=4)  # x, y, z, intensity
        
        # Sparse 3D backbone (matches VoxelBackBone8x)
        self.backbone_3d = spconv.SparseSequential(
            spconv.SubMConv3d(16, 32, 3, padding=1),
            torch.nn.BatchNorm1d(32),
            torch.nn.ReLU(),
            spconv.SparseConv3d(32, 64, 3, stride=2, padding=1),
            torch.nn.BatchNorm1d(64),
            torch.nn.ReLU(),
            # ... additional layers to match BEV feature dim
        )
        
        # Height compression → BEV
        self.height_compression = torch.nn.Conv2d(64 * 8, 256, 1)
    
    def forward(self, points):
        """
        Args:
            points: merged multi-LiDAR point cloud [N, 4] (x, y, z, intensity)
        Returns:
            bev_features: [B, C, H, W] matching OccWorld's expected input
        """
        voxels, coords, num_points = self.voxel_layer(points)
        voxel_features = self.voxel_encoder(voxels, num_points)
        sparse_features = self.backbone_3d(voxel_features, coords)
        bev = self.height_compression(sparse_features.dense())
        return bev

6.3 Drive-OccWorld with LiDAR Input

Paper: Drive-OccWorld (AAAI 2025)GitHub: yuyang-cloud/Drive-OccWorld

Drive-OccWorld extends OccWorld with action conditioning -- it predicts what the world will look like IF the ego vehicle takes a specific action. This is critical for planning: evaluate multiple candidate trajectories by predicting their consequences.

Action conditioning formats supported:

  • Velocity: (vx, vy) in m/s
  • Steering angle: converted to curvature
  • Trajectory: sequence of (delta_x, delta_y) in meters
  • High-level commands: "go forward," "turn left," "turn right"

LiDAR adaptation is identical to OccWorld -- replace the camera BEV encoder with a LiDAR voxel encoder. The action conditioning and temporal prediction stages are unchanged.

Results on nuScenes (camera-based, for reference):

  • 33% improvement on L2@1s vs. UniAD
  • 0.85m average L2 error
  • 0.29% collision rate

6.4 DIO: Decomposable Implicit 4D Occupancy-Flow

Paper: "Decomposable Implicit 4D Occupancy-Flow World Model" Venue: CVPR 2025 Link: DIO

DIO combines the implicit continuous representation of UnO with the action-conditioned prediction of Drive-OccWorld. Key innovation: decomposing the scene into static background and per-object dynamic fields, with each object's future state predicted by a decomposed implicit function conditioned on its class, velocity, and the ego action.

Relevance for airside: The decomposition into static background (terminal buildings, taxiway surface) and dynamic objects (aircraft, GSE, personnel) naturally maps to airside scene structure. Each dynamic object's trajectory is predicted independently, enabling per-agent risk assessment.

6.5 FlashOcc Extension for Temporal Prediction

FlashOcc (see occupancy-deployment-orin.md) achieves 197.6 FPS on Orin for single-frame occupancy prediction. Extending it to temporal prediction requires adding a recurrent or attention-based temporal module:

FlashOcc (single frame):
  BEV features → 2D CNN → channel-to-height → 3D occupancy
  Speed: 197.6 FPS on Orin (TensorRT INT8)

FlashOcc + Temporal (proposed):
  BEV features at t-2, t-1, t → Temporal fusion (ConvGRU or temporal attention)
  → 2D CNN → channel-to-height → future 3D occupancy at t+1, t+2, t+3
  Estimated speed: 40-80 FPS on Orin (3-step prediction adds ~3x compute)

This lightweight temporal extension preserves FlashOcc's efficiency while adding prediction capability. The temporal fusion can be as simple as a ConvGRU operating on BEV features:

python
class TemporalFlashOcc(torch.nn.Module):
    """FlashOcc with lightweight temporal prediction."""
    
    def __init__(self, flashocc_backbone, hidden_dim=128, num_future_steps=3):
        super().__init__()
        self.backbone = flashocc_backbone  # Pre-trained FlashOcc
        self.num_future_steps = num_future_steps
        
        # ConvGRU for temporal fusion
        self.temporal_gru = ConvGRU(
            input_dim=hidden_dim,
            hidden_dim=hidden_dim,
            kernel_size=3,
            num_layers=2
        )
        
        # Future prediction heads (one per timestep)
        self.future_heads = torch.nn.ModuleList([
            torch.nn.Conv2d(hidden_dim, hidden_dim, 3, padding=1)
            for _ in range(num_future_steps)
        ])
    
    def forward(self, bev_features_sequence):
        """
        Args:
            bev_features_sequence: [B, T, C, H, W] past BEV features
        Returns:
            future_occupancy: [B, K, X, Y, Z] predicted future occupancy
        """
        # Encode temporal context
        hidden = self.temporal_gru(bev_features_sequence)  # [B, C, H, W]
        
        # Predict future steps autoregressively
        future_occ = []
        h = hidden
        for k in range(self.num_future_steps):
            h = self.future_heads[k](h) + h  # residual
            occ_k = self.backbone.occ_head(h)  # channel-to-height → 3D
            future_occ.append(occ_k)
        
        return torch.stack(future_occ, dim=1)

6.6 Comparison of 4D Occupancy Forecasting Methods

MethodVenueInputAction ConditionedSelf-SupervisedLiDAR-NativeOpen Source
OccWorldECCV 2024Camera BEVNoNoAdaptableYes
Drive-OccWorldAAAI 2025Camera BEVYesNoAdaptableYes
OccSora2024Camera BEVYes (trajectory)NoAdaptablePartial
UnOCVPR 2024LiDARNoYesYesNo
DIOCVPR 2025Multi-modalYesPartialYesNo
Cam4DOccCVPR 2024CameraNoNoNoYes
OccLLaMA2024Camera BEVYes (language)NoAdaptablePartial

Recommendation: Start with OccWorld (open-source, well-documented) with a LiDAR BEV encoder. Upgrade to Drive-OccWorld when action conditioning is needed. Aim toward UnO-style self-supervised training long-term to eliminate labeling requirements.


7. Point Cloud Prediction Networks

7.1 Direct Point Cloud Sequence Prediction

Before the world model era, several methods tackled point cloud prediction as a sequence modeling problem. These are simpler, faster, and more directly deployable than the methods in Sections 2-6, though less capable.

7.2 PointRNN / PointGRU / PointLSTM

Papers:

  • PointRNN (ECCV 2020): "PointRNN: Point Recurrent Neural Network for Moving Point Cloud Processing"
  • PointLSTM (2020): Extension with LSTM gating for longer-term dependencies

Architecture:

PointRNN / PointLSTM:

Input: Point cloud sequence P_{t-T}, ..., P_t
       Each P_i = [N_i x 3] (variable number of points)

For each timestep:
  1. Group nearby points into local neighborhoods (ball query)
  2. Extract per-neighborhood features (PointNet)
  3. Update recurrent state:
     PointRNN:  h_t = tanh(W_h * h_{t-1} + W_x * x_t + b)
     PointLSTM: f_t, i_t, o_t, c_t = LSTM(h_{t-1}, x_t)
  4. Decode next point positions from hidden state

Output: Predicted P_{t+1}, ..., P_{t+K}

Pros:

  • Simple architecture, easy to implement
  • Operates directly on point clouds (no voxelization)
  • Lightweight: 1-5M parameters

Cons:

  • Limited receptive field (local neighborhoods only)
  • Poor long-range prediction (>2 seconds)
  • No scene context (no map, no ego action)
  • Fixed point count assumptions

7.3 Pillar-Based Prediction (PillarFlow / PillarMotion)

Approach: Convert point clouds to pillar representation (same as PointPillars), then apply 2D temporal prediction in BEV space.

Multi-LiDAR point cloud → Pillar encoder (PointPillars)
→ BEV features [C, H, W]
→ Stack past T BEV features: [T, C, H, W]
→ 3D convolution or ConvLSTM for temporal modeling
→ Predicted future BEV features [K, C, H, W]
→ BEV → occupancy or pillar → point cloud

Advantage: Leverages existing PointPillars deployment (6.84ms on Orin with TensorRT). The temporal module adds only 2D operations on top.

Estimated performance on Orin:

  • PointPillars BEV encoding: ~7ms per frame
  • Past 5 frames encoding: ~35ms (or amortized ~7ms if pipelined)
  • Temporal ConvLSTM: ~5-10ms
  • Future 3-step prediction: ~15ms
  • Total: ~30-50ms for 3-step BEV occupancy prediction

This is the fastest path to LiDAR-based occupancy forecasting on Orin -- it reuses the existing PointPillars deployment and adds minimal overhead.

7.4 SPFNet: Scene Flow + Prediction

Approach: Decompose the prediction problem into (1) estimate scene flow (per-point 3D motion vectors) between consecutive frames, and (2) extrapolate flow forward to predict future positions.

P_t, P_{t-1} → Scene flow estimation F_t (per-point velocity)
Static points (|F| < threshold) → persist unchanged
Dynamic points → extrapolate position: P_{t+k} = P_t + k * F_t
Collision handling: merge overlapping predictions

Advantage: Physically interpretable -- each point's future position is predicted based on its observed velocity. Simple constant-velocity assumption works well for 1-2 second horizons.

Limitation: Cannot predict behavior changes (a vehicle that is moving straight but will turn). This is where learned world models outperform flow extrapolation.

7.5 Range Image Prediction

Approach: Convert LiDAR point clouds to range images (H=beams, W=azimuth, value=range), then apply standard 2D video prediction architectures.

LiDAR scan → Range image R_t [H x W] (H=32 beams, W=2048 azimuths)
Stack past T range images: R_{t-T}, ..., R_t
→ 2D video prediction model (ConvLSTM, PredRNN, SimVP)
→ Predicted future range images R_{t+1}, ..., R_{t+K}
→ Convert range images back to 3D point clouds

Advantage: Reduces 3D prediction to 2D video prediction -- a much more mature field with well-optimized architectures and TensorRT support. Range images preserve full 3D information (losslessly recoverable given beam angles).

LidarDM uses this representation for its generative model. SimVP (Simple Video Prediction, NeurIPS 2022) achieves competitive prediction quality with only 2D convolutions, making it TensorRT-friendly.

7.6 Comparison of Point Cloud Prediction Methods

MethodRepresentationParametersInference (A100)Inference (Orin est.)Accuracy (CD@1s)Scene ContextAction Cond.
PointRNNRaw points~2M~20ms~60ms1.2-1.5NoneNo
PointLSTMRaw points~3M~25ms~75ms1.0-1.3NoneNo
Pillar-basedBEV pillars~5M~15ms~45ms0.8-1.0BEV mapPossible
SPFNet (flow)Raw points~4M~30ms~90ms0.9-1.1NoneNo
Range imageRange image~8M~20ms~60ms0.7-0.9NonePossible
Copilot4DBEV tokens~50M~100ms~300ms0.36TemporalNo
UnOImplicit field~30M~60ms~180ms~0.4Map + temporalNo

Observations:

  • Simple methods (PointRNN, flow) are fast but inaccurate beyond 1 second
  • Copilot4D and UnO are dramatically more accurate but ~5-10x slower
  • Pillar-based and range image methods offer the best accuracy/speed tradeoff for Orin
  • For safety applications (is this space safe in 3 seconds?), the binary occupancy accuracy matters more than exact point positions -- even simple methods can answer this

8. Multi-Sensor Fusion World Models

8.1 MUVO: Camera + LiDAR Joint Future Prediction

Paper: "Multimodal World Model with Geometric Voxel Representations" Venue: IV 2025 arXiv: 2311.11762

MUVO is the first world model that jointly predicts future camera images AND LiDAR point clouds from multi-modal input. Rather than predicting raw sensor data, it outputs 3D occupancy predictions as the shared representation.

Architecture:

Input: Camera images + LiDAR point clouds (current + past)

Camera branch:
  Multi-view images → Image backbone → BEV features (via LSS or BEVFormer)

LiDAR branch:
  Point cloud → Voxelization → Sparse 3D CNN → BEV features

Fusion:
  Camera BEV + LiDAR BEV → Concatenation or attention fusion → Fused BEV

Temporal prediction:
  Fused BEV sequence → Temporal transformer → Future fused BEV

Decoding:
  Future BEV → 3D occupancy prediction
  Future BEV → Camera image decoder (for camera-based validation)
  Future BEV → Point cloud renderer (for LiDAR-based validation)

8.2 Waymo World Model

Waymo's internal world model (referenced in the cutting-edge-2026.md document) generates both camera AND LiDAR data, built on top of DeepMind's Genie 3. Key details:

  • Multi-modal consistency: predicted camera views and LiDAR scans are geometrically consistent
  • Used for closed-loop simulation testing
  • Not publicly available (Waymo internal)

8.3 Challenge: Aligning Camera and LiDAR Feature Spaces

The fundamental challenge in multi-modal world models is aligning two fundamentally different representations:

DimensionCamera FeaturesLiDAR Features
Spatial structureDense 2D gridSparse 3D points
Information contentTexture, color, semanticsGeometry, range, reflectance
Distance encodingImplicit (depth ambiguity)Explicit (range measurement)
DensityUniform (every pixel has a value)Non-uniform (density falls with range)
Failure modesLighting, weather, occlusionRange limits, reflectance, rain

Alignment strategies:

  1. Late fusion in BEV space: Each modality produces BEV features independently, then concatenate. Simple but may miss cross-modal correlations.
  2. Cross-attention fusion: Attend camera features to LiDAR features and vice versa. More expensive but captures cross-modal dependencies.
  3. Shared latent space: Project both modalities into a shared latent representation. Requires careful training to avoid one modality dominating.

8.4 Airside Relevance: LiDAR Primary + Camera Secondary

For the reference airside AV stack evolution:

Phase 1 (Current): LiDAR-only world model. Use methods from Sections 2-7 with the existing 4-8 RoboSense LiDAR.

Phase 2 (When cameras are added): LiDAR-primary fusion world model. LiDAR provides the geometric backbone; cameras add texture, color, and semantic information. The world model prediction is still in 3D occupancy/point cloud space (the planner's native representation), but training benefits from camera supervision.

Phase 3 (Thor era): Full multi-modal world model with both LiDAR and camera prediction, enabling sensor redundancy verification. If the LiDAR prediction and camera prediction disagree, flag as anomaly.


9. AD-L-JEPA: Embedding-Space LiDAR Prediction

9.1 Overview

Paper: "AD-L-JEPA" (AAAI 2026) arXiv: 2501.04969GitHub: haoranzhuexplorer/ad-l-jepa-release

AD-L-JEPA is the first JEPA (Joint Embedding Predictive Architecture) for LiDAR-based driving perception. Rather than predicting future point clouds or occupancy (pixel/voxel space prediction), it predicts future embeddings -- compressed representations in a learned feature space.

9.2 Key Innovation: Predicting Embeddings, Not Pixels

The JEPA philosophy (LeCun, 2022) argues that predicting in pixel/voxel space is wasteful because most of the prediction capacity is spent on irrelevant details (exact surface texture, shadow patterns, background clutter). Predicting in embedding space focuses the model on semantically meaningful structure.

MAE-style (Occupancy-MAE, GD-MAE):
  Mask voxels → Predict exact voxel contents → Reconstruct input
  Problem: spends capacity on exact geometry of irrelevant regions

JEPA-style (AD-L-JEPA):
  Mask BEV regions → Predict EMBEDDINGS of masked regions
  → Match embedding of predicted vs. actual
  Advantage: learns semantic structure, not pixel-level reconstruction

9.3 Architecture and Results

  • Predicts Bird's-Eye-View embeddings rather than generating masked regions
  • Neither generative nor contrastive -- uses explicit variance regularization to avoid representation collapse
  • 1.9-2.7x reduction in GPU hours vs. Occupancy-MAE for comparable accuracy
  • 2.8-4x reduction in GPU memory during pre-training
  • Consistent improvements on 3D detection across KITTI3D, Waymo, and ONCE datasets

9.4 From Pre-training to World Modeling

AD-L-JEPA is currently a pre-training method (learning representations from LiDAR data for downstream detection). However, the JEPA framework naturally extends to temporal prediction:

Current AD-L-JEPA (pre-training):
  Mask spatial regions of single LiDAR frame
  Predict embeddings of masked regions from unmasked regions
  → Learns spatial understanding

Future AD-L-JEPA (world model):
  Given past LiDAR embeddings E_{t-T}, ..., E_t
  Predict future embeddings E_{t+1}, ..., E_{t+K}
  → Learns temporal dynamics in embedding space

Planning with JEPA world model:
  Compute goal embedding E_goal (desired future state)
  For each candidate action a:
    Predict E_{future}(a) = world_model(E_past, a)
    Score = similarity(E_{future}(a), E_goal)
  Select action that produces embedding closest to goal

This is analogous to V-JEPA 2-AC (Meta), which achieves 15x speedup over video-generation baselines for robotic control by planning in embedding space rather than pixel space. Extending this to LiDAR-based driving is an open research direction.

9.5 Airside Relevance

AD-L-JEPA is the most compute-efficient path to LiDAR representation learning -- critical for training on limited airside data with limited compute budget. The 2-4x efficiency gains over MAE-based methods mean:

  • Pre-training on Waymo + nuScenes in ~12 hours on 4x A100 (vs. ~30 hours for Occupancy-MAE)
  • Fine-tuning on airside data in ~2 hours (vs. ~6 hours)
  • Smaller memory footprint enables larger batch sizes on available hardware

Current status: Open-source code available. Pre-training works today. Temporal extension (world model) is not yet published but architecturally straightforward.


10. Training LiDAR World Models on Airside Data

10.1 The Data Gap

No public airside point cloud sequences exist. This is simultaneously the biggest challenge and the biggest opportunity -- whoever creates the first airside LiDAR benchmark owns the evaluation standard for the field.

Available data sources:

SourceTypeSizeAirside-Specific?Cost
nuScenesLiDAR + camera~300GB (trainval)No (urban road)Free
Waymo OpenLiDAR + camera~1.5TBNo (urban road)Free
Argoverse 2LiDAR + camera~1TBNo (urban road)Free
KITTILiDAR + camera~100GBNo (suburban road)Free
WOMD-LiDARLiDAR range images574hNo (road, motion)Free
reference airside fleetLiDAR (4-8 sensors)GrowingYesCollection cost
LidarDM syntheticLiDAR (generated)UnlimitedConfigurableCompute cost
LiDARCrafter syntheticLiDAR (generated)UnlimitedConfigurableCompute cost

10.2 Pre-training Strategy: Road Data First, Airside Fine-tuning

The recommended approach follows the transfer learning paradigm validated across the perception research:

Phase 1: Self-supervised pre-training on road data
  Dataset: Waymo Open + nuScenes + Argoverse 2 (combined)
  Method: AD-L-JEPA or GD-MAE (self-supervised, no labels needed)
  Objective: Learn general 3D point cloud structure and dynamics
  Compute: ~24-48 hours on 4x A100
  Output: Pre-trained LiDAR encoder weights

Phase 2: Self-supervised temporal pre-training on airside data
  Dataset: 10-50 hours of reference airside AV stack airside LiDAR recordings
  Method: UnO-style self-supervised occupancy prediction
  Objective: Learn airside-specific dynamics (pushback patterns,
             GSE trajectories, personnel movement)
  Compute: ~12-24 hours on 4x A100
  Output: Pre-trained LiDAR world model

Phase 3: Supervised fine-tuning (optional, for semantic prediction)
  Dataset: 500-1000 labeled airside LiDAR frames
  Method: PointLoRA (parameter-efficient fine-tuning)
  Objective: Add semantic class prediction to occupancy forecasting
  Compute: ~4-8 hours on 4x A100
  Output: Semantic 4D occupancy world model

10.3 Data Collection Requirements

For Phase 2 (self-supervised temporal pre-training):

RequirementMinimumRecommendedNotes
Recording hours10h50hMore data = better dynamics learning
Airports covered13+Domain diversity improves generalization
Turnaround sequences50200+Critical for learning turnaround dynamics
Pushback sequences20100+Aircraft trajectory prediction
Weather conditionsDryDry + rain + fogRobustness
Time of dayDayDay + night + twilightLighting variation
LiDAR frame rate10 Hz10 HzStandard for RoboSense
Storage per hour~20 GB~20 GB4-8 LiDAR at 10 Hz

Total storage for 50 hours: ~1 TB raw, ~300 GB compressed. Per the fleet data pipeline document, this fits comfortably in the hot storage tier (NVMe, 30-day retention).

10.4 Self-Supervised Pre-training Methods for LiDAR

MethodPretext TaskGPU HoursLabel SavingsBest For
AD-L-JEPAEmbedding predictionLowest (1.9x faster)50-80%Fast pre-training, limited compute
GD-MAEMasked voxel reconstructionModerate80%Detection backbone initialization
Occupancy-MAEMasked occupancy reconstructionModerate50%Occupancy prediction
TRENDTemporal forecastingModerateSignificantLearning dynamics from sequences
UnOSelf-supervised occupancy + flowHigherN/A (full SSL)World model pre-training
GPCLiDAR colorization (needs camera)Moderate80-95%When camera+LiDAR available

Recommendation: Start with AD-L-JEPA for perception backbone pre-training (cheapest, open-source). Progress to UnO-style self-supervised training for the world model itself.

10.5 Evaluation Metrics

MetricWhat It MeasuresRangeUsed By
Chamfer Distance (CD)Average nearest-neighbor distance between predicted and actual point cloudsLower = better (0.3-1.5 typical)Copilot4D, PointRNN
Occupancy IoUIntersection over union of predicted vs. actual occupied voxels0-100% (30-50% typical)OccWorld, Drive-OccWorld
mIoU (semantic)Mean IoU across semantic classes0-100% (30-60% typical)Semantic occupancy methods
Ray-Drop AccuracyCorrectly predicting which LiDAR beams return vs. miss0-100% (85-95% typical)LidarDM, range image methods
Flow EPEEnd-point error of predicted flow vectorsLower = better (0.1-0.5m typical)UnO, SPFNet
L2 Planning ErrorDownstream planning accuracy when using predicted worldLower = better (0.5-2.0m typical)Drive-OccWorld
Collision RateFraction of planned trajectories that collide in predicted futureLower = better (0.1-1.0% typical)Drive-OccWorld, DIO

10.6 Augmentation via LiDAR Simulation

Supplement real airside data with synthetic data:

1. AMDB map → Lanelet2 conversion → LidarDM input
   → Generate 1000+ synthetic airside LiDAR sequences
   → Cover scenarios too rare to collect naturally:
     - Near-miss events
     - Multi-vehicle intersection conflicts
     - FOD in various positions
     - Extreme weather conditions

2. LiDARCrafter language-guided generation
   → "Aircraft pushback colliding with baggage tractor"
   → Generate safety-critical scenarios for world model training
   → Evaluate world model's ability to predict dangerous futures

3. Domain randomization on synthetic data
   → Vary LiDAR beam pattern (match RoboSense RSHELIOS/RSBP specs)
   → Vary object sizes (different aircraft types, GSE models)
   → Vary surface reflectance (wet apron, de-icing fluid)

Per the sim-to-real transfer research: combined synthetic + 500 real scans achieves within 3-5% of fully real-data-trained models. This makes the data gap surmountable.


11. Deployment on NVIDIA Orin

11.1 Compute Budget

From occupancy-deployment-orin.md, the Orin AGX 64GB has:

ResourceTotalAvailable for World Model
GPU2048 CUDA cores, 64 Tensor Cores~30% (shared with perception, planning)
Memory64 GB unified4-8 GB for world model
INT8 TOPS275~80 TOPS
Power60W max~15W

11.2 Inference Time Budget

At 30 km/h (maximum airside speed), 10 Hz control rate, and 3-second prediction horizon:

ComponentBudgetNotes
LiDAR preprocessing5msPoint cloud merging, ego-motion compensation
Voxelization3msHard voxelization on GPU
BEV encoding7-15msPointPillars (7ms) or CenterPoint (15ms)
Temporal fusion5-10msConvGRU on BEV features
Future prediction (3 steps)15-30msAutoregressive or parallel
Occupancy decoding5msChannel-to-height (FlashOcc-style)
Total40-70msWithin 100ms budget

11.3 Model Architecture for Orin

The recommended architecture for deployable LiDAR world model on Orin:

┌─────────────────────────────────────────────────────────────┐
│  Orin-Deployable LiDAR World Model                          │
│                                                              │
│  Input: Merged multi-LiDAR point cloud (10 Hz)              │
│                                                              │
│  Stage 1: PointPillars BEV Encoding (TensorRT INT8)         │
│    Point cloud → Pillar features → BEV [C=64, H=200, W=200]│
│    Latency: ~7ms | Memory: ~0.5 GB                          │
│                                                              │
│  Stage 2: Temporal ConvGRU (TensorRT FP16)                  │
│    Past 5 BEV frames → Hidden state [C=128, H=200, W=200]  │
│    Latency: ~8ms | Memory: ~0.5 GB                          │
│                                                              │
│  Stage 3: Future BEV Prediction (TensorRT FP16)             │
│    Hidden → 3 future BEV frames (autoregressive)            │
│    Latency: ~15ms | Memory: ~1.0 GB                         │
│                                                              │
│  Stage 4: BEV → 3D Occupancy (TensorRT INT8)               │
│    FlashOcc-style channel-to-height expansion               │
│    Latency: ~5ms | Memory: ~0.5 GB                          │
│                                                              │
│  Output: Future 3D occupancy grids at t+0.3, t+0.6, t+0.9s │
│    Resolution: 0.4m voxels, 80m x 80m x 6m range           │
│    Classes: occupied / free / unknown                        │
│                                                              │
│  Total: ~35ms | ~2.5 GB VRAM                                │
└─────────────────────────────────────────────────────────────┘

11.4 TensorRT Optimization

Key optimizations for Orin deployment:

python
# TensorRT optimization recipe for LiDAR world model

# 1. Export BEV encoder to ONNX
import torch.onnx
torch.onnx.export(
    bev_encoder,
    dummy_input,
    "bev_encoder.onnx",
    opset_version=17,
    dynamic_axes={"points": {0: "num_points"}}
)

# 2. Build TensorRT engine with INT8 calibration
# Use pillar-based encoding (fully TensorRT-compatible)
# trtexec --onnx=bev_encoder.onnx \
#         --saveEngine=bev_encoder_int8.engine \
#         --int8 \
#         --calib=calibration_data.bin \
#         --workspace=4096

# 3. ConvGRU: Use FP16 (recurrent ops less suitable for INT8)
# trtexec --onnx=temporal_gru.onnx \
#         --saveEngine=temporal_gru_fp16.engine \
#         --fp16 \
#         --workspace=2048

# 4. Future prediction heads: FP16
# trtexec --onnx=future_heads.onnx \
#         --saveEngine=future_heads_fp16.engine \
#         --fp16

# 5. Occupancy decoder: INT8 (simple conv ops)
# trtexec --onnx=occ_decoder.onnx \
#         --saveEngine=occ_decoder_int8.engine \
#         --int8 \
#         --calib=calibration_data.bin

# 6. Multi-model orchestration with CUDA streams
# Run BEV encoding and temporal fusion on separate CUDA streams
# for overlap with other perception tasks (detection, segmentation)

11.5 Use Cases on the reference airside AV stack Stack

1. Safety verification (check planned trajectory against predicted future):

python
def verify_trajectory_safety(planned_trajectory, predicted_occupancy, clearance=1.0):
    """
    Check if planned ego trajectory is safe against predicted future occupancy.
    
    Args:
        planned_trajectory: [(x, y, theta, t)] ego poses over time
        predicted_occupancy: [K, X, Y, Z] future occupancy grids
        clearance: minimum required clearance in meters
    
    Returns:
        is_safe: bool
        first_conflict_time: float or None
        conflict_voxels: occupied voxels within clearance
    """
    ego_footprint = get_vehicle_footprint()  # third-generation tug dimensions + clearance buffer
    
    for k, (x, y, theta, t) in enumerate(planned_trajectory):
        if k >= len(predicted_occupancy):
            break
        
        # Transform ego footprint to world frame
        footprint_voxels = transform_footprint(ego_footprint, x, y, theta)
        
        # Inflate by clearance
        inflated_voxels = dilate_voxels(footprint_voxels, clearance / voxel_size)
        
        # Check for collision
        occ_grid = predicted_occupancy[k]
        conflicts = occ_grid[inflated_voxels] > 0.5  # occupied threshold
        
        if conflicts.any():
            return False, t, inflated_voxels[conflicts]
    
    return True, None, None

2. Gap detection (find safe windows in dynamic environment):

python
def find_safe_crossing_window(crossing_point, predicted_occupancy, 
                               ego_crossing_time=2.0):
    """
    Find temporal windows where a crossing point is predicted to be clear.
    
    Args:
        crossing_point: (x, y) point where ego path crosses another agent's path
        predicted_occupancy: [K, X, Y, Z] future occupancy grids
        ego_crossing_time: time needed to traverse the crossing zone
    
    Returns:
        windows: list of (start_time, end_time) safe crossing windows
    """
    crossing_voxels = get_voxels_around(crossing_point, radius=3.0)
    
    # Check occupancy at crossing point for each future timestep
    clear_at_t = []
    for k in range(len(predicted_occupancy)):
        occupied = predicted_occupancy[k][crossing_voxels].max() > 0.5
        clear_at_t.append(not occupied)
    
    # Find windows of consecutive clear timesteps
    windows = []
    start = None
    for k, is_clear in enumerate(clear_at_t):
        if is_clear and start is None:
            start = k * dt  # timestep to seconds
        elif not is_clear and start is not None:
            end = k * dt
            if (end - start) >= ego_crossing_time:
                windows.append((start, end))
            start = None
    
    return windows

3. Behavior prediction (will this GSE cross my path?):

python
def predict_agent_crossing(agent_voxels_current, predicted_occupancy,
                            ego_path_voxels):
    """
    Predict if a detected agent will cross the ego's planned path.
    
    Args:
        agent_voxels_current: current voxels occupied by the agent
        predicted_occupancy: [K, X, Y, Z] future occupancy grids
        ego_path_voxels: voxels along ego's planned route
    
    Returns:
        will_cross: bool
        crossing_time: estimated time of crossing
        crossing_confidence: probability
    """
    for k in range(len(predicted_occupancy)):
        # Find where the agent will be (track occupancy propagation)
        future_occ = predicted_occupancy[k]
        
        # Check overlap between predicted occupancy and ego path
        overlap = future_occ[ego_path_voxels] > 0.3  # lower threshold for prediction
        
        if overlap.any():
            confidence = future_occ[ego_path_voxels][overlap].mean()
            return True, k * dt, confidence.item()
    
    return False, None, 0.0

11.6 Integration with Simplex Architecture

The design spec describes a Simplex architecture where a high-performance new stack (AC) operates alongside a verified fallback stack (BC). The LiDAR world model integrates as a confidence signal for the AC:

┌────────────────────────────────────────────────────┐
│  Assured Controller (AC) — New Stack                │
│                                                      │
│  LiDAR World Model → Predicted future occupancy     │
│  Neural Planner → Candidate trajectory              │
│  Safety Check: verify_trajectory_safety()            │
│  If safe → publish trajectory to actuators           │
│  If unsafe → flag to Decision Module                 │
└──────────────────┬─────────────────────────────────┘


┌────────────────────────────────────────────────────┐
│  Decision Module                                     │
│                                                      │
│  AC confidence = f(world_model_certainty,            │
│                    prediction_horizon,                │
│                    OOD_score)                         │
│  If AC_confidence > threshold → use AC trajectory    │
│  If AC_confidence < threshold → switch to BC          │
└──────────────────┬─────────────────────────────────┘


┌────────────────────────────────────────────────────┐
│  Baseline Controller (BC) — Current Stack            │
│                                                      │
│  RANSAC perception → known obstacles only            │
│  Frenet planner → conservative trajectory            │
│  Always available, always verified                    │
│  No world model dependency                           │
└────────────────────────────────────────────────────┘

The world model's prediction uncertainty serves as a natural confidence metric. If the predicted future occupancy has high entropy (uncertain about whether a space is occupied), the AC should reduce speed or defer to the BC. This maps directly to the Simplex safety architecture.


12. Applications for Airside Safety

12.1 Pushback Prediction: Aircraft Swept Area 5-10s Ahead

Aircraft pushback is the highest-risk dynamic event on the apron. During pushback, a tow tractor reverses the aircraft from the stand onto the taxiway. The aircraft's tail sweeps a wide arc determined by the nose gear steering angle.

World model application:

Observation: Aircraft attached to tow tractor, pushback initiated
             LiDAR tracks aircraft position over past 3-5 seconds

World model input:
  Past occupancy grids showing aircraft and tractor positions
  Optional: inferred nose gear angle from observed curvature

World model output:
  Predicted occupancy at t+1s, t+3s, t+5s, t+10s
  → Aircraft swept area over next 10 seconds
  → Forbidden zone for ego vehicle path planning

Safety action:
  If ego planned path intersects predicted swept area within horizon
  → Stop and wait
  → Or reroute around predicted swept area

Why this is hard without a world model: Classical approaches estimate aircraft trajectory from kinematic models (bicycle model with known wheelbase). But the pushback tractor operator may vary speed, change steering angle, or stop/restart -- behaviors that a kinematic model cannot predict but a learned world model (trained on pushback sequences) can anticipate.

12.2 GSE Trajectory Prediction

Baggage tractors, belt loaders, catering trucks, and fuel bowsers follow semi-structured but variable patterns during turnaround. A world model trained on turnaround sequences learns these patterns.

Key scenarios:

ScenarioPrediction NeedTime HorizonAccuracy Required
Baggage train crossing apronWill it cross my path?3-5s2m positional
Belt loader approaching aircraftWill it block my route?5-10s3m positional
Fuel bowser departingWhen will it clear my path?5-15s5m positional
Catering truck maneuveringIs it reversing toward me?1-3s1m positional

Advantage over per-agent tracking: Classical tracking + prediction requires detecting each GSE, classifying its type, estimating its velocity, and running a per-agent motion model. The world model predicts the entire scene's future occupancy in a single forward pass -- inherently handling multi-agent interactions (e.g., one GSE yields to another).

12.3 Personnel Prediction

Ground crew prediction is the most safety-critical application. 27,000 ramp accidents occur annually, and ground crew movements are unpredictable -- they dart between vehicles, crouch under aircraft, and walk behind equipment.

World model capabilities:

  • Intent prediction: A person walking toward the ego vehicle's route will be predicted as occupying that route in future frames -- triggering early deceleration even before the person enters the path.
  • Crouching detection: Occupancy world models detect occupied voxels at any height. A crouching person (partially occluded behind a wheel chock) occupies voxels at 0.3-0.8m height. As long as the LiDAR observes sufficient returns, the world model tracks and predicts this occupancy.
  • Disocclusion prediction: If a person is fully occluded behind a vehicle, classical detection fails. Occupancy flow methods (UnO, ImplicitO) can predict that a moving agent will emerge from behind occlusion based on observed flow vectors -- predicting currently invisible agents.

12.4 FOD Persistence Modeling

Foreign Object Debris (FOD) on the apron ranges from small tools to luggage to vehicle parts. FOD detection requires distinguishing between truly stationary debris (must be avoided or reported) and transient occupied voxels (sensor noise, rain drops).

World model approach:

FOD classification via temporal occupancy:

Frame t:   Voxel (x, y, z) occupied, flow = 0
Frame t+1: Voxel (x, y, z) occupied, flow = 0
Frame t+2: Voxel (x, y, z) occupied, flow = 0
→ Persistent static occupancy → high confidence FOD

vs.

Frame t:   Voxel (x, y, z) occupied
Frame t+1: Voxel (x, y, z) empty
Frame t+2: Voxel (x, y, z) empty
→ Transient → likely sensor noise or rain

vs.

Frame t:   Voxel (x, y, z) occupied, flow = (1.2, 0.3, 0) m/s
Frame t+1: Voxel (x+0.12, y+0.03, z) occupied, flow = (1.2, 0.3, 0)
→ Moving object → not FOD, dynamic agent

The world model provides temporal smoothing automatically -- persistent occupancy predictions inherently filter out transient noise.

12.5 Jet Blast Prediction

Jet blast is the only airside hazard that is geometrically invisible -- there is no physical object to detect. However, jet exhaust interacts with airborne particles (dust, de-icing fluid mist, rain) creating a density change that LiDAR can partially observe.

World model approach to jet blast:

  • LiDAR may observe scattered returns in the jet exhaust plume (from entrained particles)
  • These returns appear as diffuse, low-density occupancy in voxel space
  • A world model trained on sequences near running engines can learn to predict plume expansion
  • The predicted "occupied" region expands behind the engine over time, even if LiDAR returns are sparse

Limitation: Pure LiDAR detection of jet blast is unreliable without particles in the exhaust. Thermal cameras (see 20-av-platform/sensors/thermal-ir-cameras.md) are the primary sensor for jet blast detection. However, a world model that incorporates historical jet blast observations can provide prediction even when current LiDAR returns are sparse.

12.6 Combining Predictions for Integrated Safety

Integrated airside safety prediction pipeline:

LiDAR World Model Output:
  Future occupancy grids: O_{t+1}, O_{t+2}, O_{t+3} (0.3s, 0.6s, 0.9s)
  Future flow fields:     F_{t+1}, F_{t+2}, F_{t+3}

Safety checks (run in parallel):
  1. Trajectory safety:   planned_path ∩ O_{t+k} = ∅ for all k?
  2. Clearance check:     min_dist(planned_path, O_{t+k}) > clearance?
  3. Crossing prediction: any O_{t+k} intersects ego path with |F| > 0?
  4. FOD detection:       persistent static O with no F across all k?
  5. Pushback zone:       large connected O moving with rotational F?

Decision:
  All safe → proceed at planned speed
  Any warning → reduce speed + alert operator
  Any critical → emergency stop + switch to BC

13. Comparison Table

13.1 LiDAR-Native World Models: Complete Comparison

ModelVenueYearRepresentationSelf-SupervisedAction Cond.Open SourceOrin DeployableAirside Fit
Copilot4DICLR2024BEV tokens (discrete)YesNoNoNeeds optimizationHIGH
UnOCVPR Oral2024Continuous implicit fieldYesNoNoNeeds optimizationVERY HIGH
LidarDMICRA2025Range images (diffusion)N/A (generative)N/AYesOffline onlyHIGH (data gen)
LiDARCrafterAAAI Oral20264D point cloudN/A (generative)N/A (language)YesOffline onlyHIGH (data gen)
DIOCVPR2025Decomposed implicitPartialYesNoNeeds optimizationHIGH
Cosmos-LidarGenNVIDIA2025Range images (tokenized)N/A (generative)N/AYesOffline onlyHIGH (data gen)
OccWorld (LiDAR-adapted)ECCV2024VQ-VAE voxelsNoNoYesFeasibleHIGH
Drive-OccWorld (LiDAR-adapted)AAAI2025VQ-VAE voxelsNoYesYesFeasibleVERY HIGH
AD-L-JEPAAAAI2026BEV embeddingsYesNo (yet)YesYes (lightweight)HIGH

13.2 Performance and Compute Comparison

ModelChamfer Dist @1sOcc IoU @1sParamsA100 InferenceOrin Inference (est.)VRAM
Copilot4D0.36--~50M~100ms~300ms~4 GB
UnO~0.40SOTA~30M~60ms~180ms~3 GB
OccWorld (LiDAR)--~35%~25M~50ms~150ms~2 GB
Drive-OccWorld (LiDAR)--~37%~30M~60ms~180ms~2.5 GB
Pillar-based prediction~0.8~25%~5M~15ms~45ms~1 GB
Range image prediction~0.7~28%~8M~20ms~60ms~1 GB
PointPillars + ConvGRU--~22%~7M~20ms~50ms~1 GB

13.3 Suitability Matrix for Airside

ApplicationBest ModelWhyFallback
Pushback swept areaDrive-OccWorldAction-conditioned, predicts conditional futuresKinematic model
GSE crossing predictionUnO / OccWorldScene-level occupancy + flowConstant velocity
Personnel safetyUnOContinuous field, fine resolution queriesRANSAC + buffer
FOD detectionAny temporal occPersistence filtering inherentMulti-frame voting
Jet blast predictionSpecialized model neededStandard methods lack thermal inputLookup table
Data generationLidarDM + LiDARCrafterSynthetic airside data from maps/languageManual collection
Pre-trainingAD-L-JEPAMost compute-efficient SSL for LiDARGD-MAE

Phase 1: Foundation (Months 1-3)

StepActionCostOutput
1.1Collect 10h unlabeled airside LiDAR sequences$0 (existing vehicles)Raw training data
1.2Pre-train LiDAR backbone with AD-L-JEPA on Waymo + nuScenes$500 computePre-trained encoder
1.3Implement pillar-based temporal prediction (PointPillars + ConvGRU)2 weeks engineeringBaseline world model
1.4Deploy baseline on Orin via TensorRT1 week engineering~50ms 3-step prediction
1.5Evaluate on airside data: does prediction help planning?1 week testingQuantitative evidence

Phase 2: Capability (Months 3-9)

StepActionCostOutput
2.1Collect 50h airside LiDAR with diverse scenarios$0 (fleet operations)Comprehensive training set
2.2Implement OccWorld with LiDAR BEV encoder3 weeks engineeringVQ-VAE occupancy world model
2.3Self-supervised pre-train on airside sequences$1,000 computeAirside-adapted world model
2.4Generate synthetic data with LidarDM + AMDB maps$2,000 compute1000+ synthetic scenarios
2.5Upgrade to Drive-OccWorld for action conditioning2 weeks engineeringPlanning-integrated world model
2.6Integrate with Simplex AC confidence signal2 weeks engineeringSafety-aware planning

Phase 3: Production (Months 9-18)

StepActionCostOutput
3.1Implement UnO-style continuous occupancy field4 weeks engineeringHigh-accuracy self-supervised model
3.2Add multi-resolution occupancy (0.2m/0.4m/0.8m)2 weeks engineeringOptimized memory + accuracy
3.3Safety certification evidence gatheringOngoingWorld model safety case
3.4Multi-airport transfer validation2 weeks per airportGeneralization evidence
3.5Thor deployment (when hardware available)2 weeks portingFull-scale world model

Total Estimated Investment

CategoryCost
Compute (Phases 1-3)$5,000-10,000
Engineering time6-12 person-months
Data collection$0 (uses fleet vehicles)
Labeling (Phase 3 only)$5,000-10,000 (500-1000 frames)
Total$10,000-20,000 + engineering time

15. References

LiDAR-Native World Models

PaperVenueYearLink
Copilot4DICLR2024arXiv, Website
UnOCVPR Oral2024arXiv
LidarDMICRA2025arXiv, GitHub
LiDARCrafterAAAI Oral2026GitHub
DIOCVPR2025Paper
Cosmos-LidarGenNVIDIA2025GitHub

4D Occupancy Forecasting

PaperVenueYearLink
OccWorldECCV2024arXiv, GitHub
Drive-OccWorldAAAI2025arXiv, GitHub
OccSoraarXiv2024arXiv
Cam4DOccCVPR2024arXiv
OccLLaMAarXiv2024arXiv
ImplicitOCVPR2023Paper

LiDAR Pre-training

PaperVenueYearLink
AD-L-JEPAAAAI2026arXiv, GitHub
GD-MAECVPR2023GitHub
Occupancy-MAEIEEE TIV2023GitHub
TRENDNeurIPS2025GitHub
GPCICLR2024GitHub
PointLoRACVPR2025Paper

Point Cloud Prediction

PaperVenueYearLink
PointRNNECCV2020arXiv
ViDARCVPR Highlight2024arXiv
Occupancy Flow FieldsWaymo2022arXiv
DocumentRelevance
30-autonomy-stack/world-models/overview.mdCamera-centric world models overview
30-autonomy-stack/world-models/occupancy-world-models.mdDetailed occupancy forecasting methods
30-autonomy-stack/world-models/occupancy-deployment-orin.mdOrin deployment specifics
30-autonomy-stack/world-models/occworld-implementation.mdOccWorld hands-on setup guide
30-autonomy-stack/world-models/tokenized-and-jepa.mdVQ-VAE tokenization and JEPA theory
30-autonomy-stack/perception/overview/lidar-foundation-models.mdLiDAR pre-training and backbone models
30-autonomy-stack/world-models/cutting-edge-2026.mdLatest 2025-2026 developments
90-synthesis/decisions/design-spec.mdSimplex architecture and world model integration
90-synthesis/poc-roadmaps/poc-proposals.mdPOC cost estimates
30-autonomy-stack/localization-mapping/overview/lidar-slam-algorithms.mdLiDAR SLAM for map building

Public research notes collected from public sources.