End-to-End Autonomous Driving Architectures & Foundation Models for Airside Autonomous Vehicles

Comprehensive Technical Report

Date: March 2026 Scope: End-to-end driving models, foundation models, occupancy networks, sensor fusion, and deployment considerations for airside autonomous vehicles.

End-to-End Autonomous Driving Models
Foundation Models for Driving
Occupancy-Based Approaches
Multi-Modal Sensor Fusion
Deployment Considerations for Airside
Synthesis: Applicability to Airside Autonomous Vehicles
Sources

1. End-to-End Autonomous Driving Models

1.1 UniAD (Unified Autonomous Driving) — CVPR 2023 Best Paper

Authors: OpenDriveLab, Wuhan University, SenseTime Research

UniAD is a planning-oriented framework that unifies perception, prediction, and planning into a single differentiable network. Rather than treating each task as a standalone module, UniAD casts them hierarchically with planning as the ultimate objective.

Architecture:

Input: Multi-view camera images (no LiDAR)
BEV Feature Extraction: Camera images are projected into a Bird's Eye View feature space
TrackFormer: Generates and tracks 3D agent bounding boxes
MapFormer: Online HD map construction (lane dividers, road boundaries, pedestrian crossings)
MotionFormer: Predicts future trajectories for each tracked agent
OccFormer: Voxel-based occupancy prediction for unstructured obstacles
Planner: Generates ego-vehicle trajectory conditioned on all upstream outputs

Training: Two-stage approach — perception modules (tracking + mapping) trained jointly for 6 epochs, then full end-to-end training across all modules for 20 epochs.

Key Contribution: Demonstrated that jointly optimizing all tasks with a planning objective yields significant improvements over individually optimized modules. Established a new paradigm for end-to-end driving research.

nuScenes Performance: L2 error 0.73m, collision rate 0.61%.

1.2 VAD / VADv2 (Vectorized Autonomous Driving) — ICCV 2023 / ICLR 2026

Authors: Huazhong University of Science and Technology (HUST)

VAD (v1): Models the driving scene as a fully vectorized representation. Instead of dense rasterized BEV maps, VAD uses vectorized agent motion and map elements as explicit instance-level planning constraints. This eliminates computation-intensive rasterized representations, achieving faster inference than prior end-to-end methods while improving planning safety.

VADv2 (2024): Extends VAD with probabilistic planning to handle uncertainty. Key innovations:

Input: Multi-view image sequences processed in a streaming manner
Processing: Sensor data transformed into environmental token embeddings
Output: Probabilistic distribution over actions (not deterministic waypoints)
Action space: A single action is sampled from the distribution to control the vehicle
No rule-based wrappers: Operates in a fully end-to-end fashion

Performance: VADv2 achieves state-of-the-art closed-loop performance on CARLA Town05 benchmark, significantly outperforming all existing methods. VAD v1 achieves L2 error 0.72m, collision rate 0.21% on nuScenes.

1.3 SparseDrive — ECCV 2024 (Best Paper Candidate)

Authors: Sun Wenchao et al.

SparseDrive introduces a sparse-centric paradigm for end-to-end driving, replacing dense BEV representations with instance-level sparse features.

Architecture:

Sparse Perception Module: Symmetric architecture unifying detection, tracking, and online mapping. Agent instances represented by features F_d (N_d x C) and anchor boxes B_d (N_d x 11) containing location, dimension, yaw, and velocity. Map elements represented as anchor polylines (N_m x N_p x 2). Uses 900 detection anchors with 6 decoder layers and 100 map polyline anchors with 20 points each.
Parallel Motion Planner: Leverages similarity between motion prediction and planning. Produces 6 multi-modal trajectory proposals for both tasks (12 future timesteps for prediction, 6 for planning).
Hierarchical Planning Selection: (1) Filter by high-level driving command (turn left/right/straight), (2) Apply collision-aware rescore module (sets collision trajectory scores to zero).
Instance Memory Queue: Temporal modeling via (N_d+1) x 3 memory queue. Three interaction types: agent-temporal cross-attention, agent-agent self-attention, agent-map cross-attention.
Ego Initialization: Ego features derived from front camera feature map via average pooling (avoids status leakage from ground truth velocity).

Performance vs. Baselines (nuScenes):

Method	L2 Error (m)	Collision Rate (%)
UniAD	0.73	0.61%
VAD	0.72	0.21%
SparseDrive-B	0.58	0.06%

Computational Efficiency:

Model	Training Time	Inference FPS	GPU Memory
UniAD	144h (8xA100)	1.8 FPS	2451 MB
SparseDrive-S	20h (7.2x faster)	9.0 FPS	1294 MB
SparseDrive-B	30h (4.8x faster)	7.3 FPS	1437 MB

SparseDrive represents the current state-of-the-art in efficiency-performance trade-off for end-to-end driving on nuScenes.

1.4 FusionAD — NeurIPS 2023 Workshop

The first unified framework fusing camera and LiDAR beyond perception into prediction and planning.

Architecture:

Camera Encoder: BEVFormer-based image encoder maps camera images to BEV space
LiDAR Encoder: Processes LiDAR point clouds into BEV features
Transformer Fusion Network: Fuses multi-modality information into unified BEV features
FMSPnP Modules: Fusion-aided Modality-aware prediction and Status-aware Planning with progressive interaction and refinement
Fusion-based Collision Loss: Models collision risk using fused multi-modal features

Results vs. UniAD (camera-only):

37% error reduction for trajectory prediction
29% enhancement for occupancy prediction
14% decrease in collision rates for planning

1.5 BEV-Based Approaches

BEVFormer — ECCV 2022

Camera-only framework learning unified BEV representations via spatiotemporal transformers.

Spatial Cross-Attention: BEV queries extract features from regions of interest across multiple camera views using deformable attention
Temporal Self-Attention: Recurrently fuses historical BEV information, enabling temporal reasoning without explicit tracking
Performance: 56.9% NDS on nuScenes test (on par with LiDAR baselines)
Significance: Proved camera-only approaches could match LiDAR-based methods in 3D perception, spawning numerous follow-up works

BEVFusion — ICRA 2023

Multi-task, multi-sensor fusion framework from MIT Han Lab.

Unified BEV Space: Projects both camera and LiDAR features into a shared BEV representation, preserving both geometric (from LiDAR) and semantic (from cameras) information
Optimized BEV Pooling: Reduces latency by 40x compared to naive implementations
Task-Agnostic Design: Seamlessly supports 3D detection, segmentation, and other tasks with minimal architectural changes
Results: +1.3% mAP/NDS for detection, +13.6% mIoU for BEV segmentation, at 1.9x lower computation cost

1.6 TransFuser — PAMI 2023

Multi-modal fusion transformer integrating image and LiDAR representations using attention mechanisms.

Architecture: Transformer-based fusion of camera images and LiDAR point clouds at multiple spatial scales
Sensor Fusion: Global attention between image and LiDAR feature maps at each resolution level
CARLA Performance: At submission, TransFuser outperformed all prior work on the CARLA leaderboard driving score by a large margin
TransFuser++: Achieved second place at CVPR 2024 CARLA Autonomous Driving Challenge

1.7 InterFuser — CoRL 2022

Safety-enhanced autonomous driving using interpretable sensor fusion transformer.

Architecture: Unified transformer fusing synchronized RGB and LiDAR sensor data for comprehensive scene understanding
Interpretable Outputs: Produces intermediate representations (attention maps, object detections) that explain downstream control decisions
Safety Focus: Explicitly optimized for safety through infraction-minimizing objectives
CARLA Performance: Highest driving score (76.18) on public CARLA Leaderboard with superior route completion and infraction metrics vs. TransFuser, LBC, NEAT

1.8 TCP (Trajectory-guided Control Prediction) — NeurIPS 2022

Architecture:

Dual-Branch Design: Trajectory planning branch + direct control branch
Trajectory Branch: Predicts future waypoints
Control Branch: Multi-step prediction scheme reasoning about current actions and future states, guided by trajectory branch via attention at each time step
Fusion: Situation-based scheme merges outputs from both branches
Input: Monocular camera only

Performance: Driving score 75.137, ranked 1st on public CARLA Leaderboard, surpassing methods using multiple cameras + LiDAR by 13.291 points.

1.9 NEAT (Neural Attention Fields) — ICCV 2021

Represents driving scenes as continuous neural attention fields, enabling dense spatial reasoning from sparse observations. Pioneered the use of neural implicit representations for end-to-end driving in CARLA.

1.10 ReasonNet — CVPR 2023

Three-Module Architecture:

Perception Module: Fuses multi-sensor data to generate BEV features, traffic sign features, and waypoints
Temporal Reasoning Module: Processes current and historic features via a memory bank for temporal context
Global Reasoning Module: Models interactions and relationships among objects and environment to detect adverse events (occlusion, sudden appearances)

Key Innovation: Explicit temporal and global reasoning modules that improve robustness to occlusions and rare events.

1.11 GameFormer — ICCV 2023 (Oral)

Architecture:

Game-Theoretic Formulation: Models multi-agent interaction prediction as a hierarchical game
Transformer Encoder: Models relationships between all scene elements (agents, lanes, traffic signals)
Hierarchical Transformer Decoder: At each level, agents respond to other agents' behaviors from the preceding level, iteratively refining the interaction process
Applications: Interactive prediction and planning for dense traffic scenarios

1.12 Significant 2024-2025 Models

DriveTransformer (2025)

A simplified E2E framework for scaling, featuring:

Task Parallelism: Agent, map, and planning queries directly interact at each transformer block (vs. sequential perception-prediction-planning)
Sparse Representation: Efficient scene encoding
Benefit: Eliminates cumulative errors from manual task ordering

PARA-Drive — CVPR 2024

Fully parallel end-to-end architecture achieving state-of-the-art in perception, prediction, and planning while significantly enhancing runtime speed. Demonstrates that parallelizing traditionally sequential tasks can improve both accuracy and latency.

GenAD — ECCV 2024 (CVPR 2024 Highlight)

Casts autonomous driving as a generative modeling problem. First large-scale video world model for driving, trained on 2000+ hours of diverse driving videos from the web. Can be adapted into an action-conditioned prediction model or a motion planner.

Hydra-MDP — CVPR 2024 Challenge Winner

Multi-teacher, student-teacher knowledge distillation framework. Integrates knowledge from both human and rule-based planners. Won first place and innovation award at CVPR 2024 E2E Driving at Scale Challenge on the nuPlan benchmark.

DriveVLM — CoRL 2024

Integrates Vision-Language Models (VLMs) for enhanced scene understanding. Three reasoning modules: scene description, scene analysis, and hierarchical planning. Bridges VLMs with traditional driving pipelines for improved interpretability.

Vision-Language-Action (VLA) Models (2025-2026)

Emerging paradigm integrating perception, language reasoning, and action generation:

End-to-End VLA: Single model for perception, reasoning, and planning
Dual-System VLA: Slow deliberation (VLM) + fast execution (planner)
DeepRoute.ai (GTC 2026): 40B parameter VLA model deployed across 250,000+ production vehicles
OpenDriveVLA (AAAI 2026): Open-source end-to-end driving with large VLA models

2. Foundation Models for Driving

2.1 NVIDIA Foundation Models

Alpamayo 1 (December 2025)

Architecture: 10-billion parameter chain-of-thought reasoning Vision-Language-Action (VLA) model
Input: Video
Output: Trajectories + reasoning traces explaining each decision
Design Philosophy: Teacher model for distillation — not deployed directly in vehicles. Developers fine-tune and distill into compact AV stack backbones
Target: Long-tail edge cases — rare, complex driving conditions
Open Source: Model weights on Hugging Face, inference scripts on GitHub, AlpaSim simulation framework
Dataset: Physical AI Open Datasets with 1,700+ hours of diverse driving data

Cosmos World Foundation Models (January 2025)

Purpose: Generative world foundation models for physical AI (AVs + robots)
Capabilities: Generate physics-based videos from text, image, video, or sensor/motion data. Model physical interactions, object permanence, diverse road conditions
Application: Generate synthetic training data; amplify variations of physically-based sensor data for AV simulation
Adopters: 1X, Agility Robotics, Figure AI, Foretellix, Skild AI, Uber

Cosmos-Reason (2025)

Reasoning model enabling vehicles to "think through" decisions — e.g., navigating construction zones, interpreting ambiguous traffic signals. Forms the foundation for Alpamayo.

DRIVE Platform Architecture

Three-Computer Solution: DGX (cloud training) -> DRIVE Sim (Omniverse testing) -> DRIVE AGX (in-vehicle inference)
DRIVE AGX Orin: 254 TOPS, powering DRIVE Concierge + DRIVE Chauffeur
DRIVE AGX Thor (next-gen): 8x GPU performance, 2.6x CPU improvement vs. Orin
Halos (March 2025): Unified system combining NVIDIA automotive hardware, software, and AV safety research

2.2 Waymo — EMMA (End-to-End Multimodal Model for Autonomous Driving) — October 2024

Architecture:

Foundation: Built on Google Gemini multimodal LLM (fine-tuned Gemini 1.0 Nano-1)
Formulation: Driving tasks reformulated as visual question-answering: O = G(T, V)
Input Representation: Surround-view camera videos stitched into images/sequences; all non-sensor inputs (navigation commands, ego status) represented as plain text
Output Representation: Trajectories, 3D detections, road graphs all represented as natural language text (x,y coordinates as floating-point strings)
Task-Specific Prompts: Different prompts select which task to perform

Chain-of-Thought Reasoning (R1-R4):

R1 — Scene description (weather, time, traffic, road characteristics)
R2 — Critical objects with 3D/BEV coordinate predictions
R3 — Behavior descriptions (status and intent of critical agents)
R4 — Meta driving decisions across 12 categories

Automated label generation (no human annotation needed)
6.7% improvement over standard E2E planning

Multi-Task Co-Training Results:

Co-training all three tasks: up to 5.5% improvement over single-task specialists
Tasks are complementary — co-training planning + detection improves detection by 2.4%

Performance:

nuScenes motion planning: L2 0.32m (EMMA), 0.29m (EMMA+)
Waymo WOMD: State-of-the-art on ADE at 1s, 3s, and 5s horizons
WOD 3D detection: 16.3% relative improvement in vehicle precision

Acknowledged Limitations:

Limited temporal context (max 4 frames)
No LiDAR/radar integration — camera only
No consistency guarantee between outputs and perception
Extremely high computational cost for sensor simulation
Not deployable in real-time on current hardware (requires distillation)

2.3 Tesla FSD Architecture Evolution

Current State (FSD V13, Summer 2025):

48 Neural Networks: Process inputs from 8 cameras (360-degree coverage)
End-to-End Architecture: FSD V12+ replaced 300,000+ lines of C++ rules with a single neural network pipeline mapping raw camera inputs directly to steering/acceleration/braking
Occupancy Networks (2nd Generation): Transform 2D camera images into 3D spatial understanding via BEV transformations. Predict "Occupancy Volume" and "Occupancy Flow" — what is free vs. occupied, and where occupied space will move
Training Scale: 70,000 GPU hours per cycle, 1.5+ petabytes of data from 4+ million vehicles, target 100 exaFLOPS by end of 2025
V13 Shift: Moves toward more unified end-to-end AI, away from siloed neural networks stitched with human-written code
Camera-Only: No LiDAR, radar, or ultrasonic sensors

2.4 Comma.ai — openpilot

Architecture (v0.11, 2025):

openpilot represents the most advanced open-source end-to-end driving system deployed at scale (325+ car models, 10,000+ users, 100M+ miles).

World Model Architecture:

Frame Compressor: ViT encoder (50M params) / decoder (100M params). Compresses dual camera feeds (narrow + wide, 3x128x256 each) into 32x16x32 latent space using Masked Auto Encoder with LPIPS, adversarial, and LSE losses
Diffusion Transformer: 2 billion parameter transformer (48 layers, 25 heads, 1600 embedding dimension). Processes 2s past context + 1s future conditioning + 0-7s simulation window. Block-causal attention masking. Trained on 2.5M minutes of driving video using Rectified Flow
Inference: 15 Euler steps, 12.2 frames/second/GPU throughput

Training Paradigm (Breakthrough):

Large world model (simulator) trains on unlabeled fleet data
Smaller policy network trains on world model rollouts using latest policy checkpoint
On-policy training with lateral and longitudinal noise injection
First real-world robotics agent shipped to users that was fully trained in learned simulation

Evolution:

Pre-v0.10: Hand-coded MPC planners
v0.10: World model for planning, reprojective simulator for videos
v0.11: Fully learned simulation + world model planning

2.5 Mobileye

Architecture:

Dual-System Design: Independent camera system + independent radar-lidar system (redundancy for safety)
SuperVision Platform: 11 cameras + optional radar, AI-powered surround computer vision
REM (Road Experience Management): Crowdsourced HD mapping from millions of production vehicles
RSS (Responsibility-Sensitive Safety): Formal mathematical model for driving policy ensuring provably safe decisions
Hardware: EyeQ6H chipset; modular ECU series supporting SuperVision (ADAS), Chauffeur (L3), and Drive (L4+)
Scale: 240,000+ Zeekr vehicles shipped with SuperVision; 19M+ EyeQ6H-based systems projected

Key Differentiator: True Redundancy — two independent perception subsystems must agree before action, providing safety guarantees absent from single-stack approaches.

2.6 Zoox (Amazon)

Purpose-Built Vehicle: No steering wheel, no pedals; bidirectional design with opposing seat rows
Sensor Suite: Multiple LiDARs, cameras, and radars for 360-degree perception
Geofenced Operation: Operates in defined urban areas (SoMa, Mission District in SF as of November 2025)
Architecture: Proprietary full-stack autonomous system designed from scratch for driverless operation
Status: "Zoox Explorers" program launched November 2025 in San Francisco

2.7 Cruise (GM — Shut Down 2024)

GM shut down Cruise's robotaxi business in 2024 following safety incidents and regulatory issues. The technology and some teams were absorbed into GM's ADAS/autonomy efforts. Represents a cautionary case for the industry regarding safety validation and public trust.

2.8 Aurora Innovation

Aurora Driver: SAE Level 4 system for trucks and ride-hailing vehicles
FirstLight LiDAR: Custom FMCW long-range LiDAR measuring position AND velocity of every point. Current gen: 500m range; next gen (mid-2026): 1,000m range, 50% hardware cost reduction
Common Core Architecture: Single software stack deployable across multiple truck platforms
Partnership: Continental for production-ready hardware (2027 target)
Status: 100,000+ driverless miles on public roads without safety incidents (Q3 2025), commercial trucking service operating

3. Occupancy-Based Approaches

3.1 Occupancy Networks: Core Concept

Occupancy networks discretize the 3D driving environment into voxel grids, assigning each cell a probability of being occupied and optionally a semantic label. This representation offers several advantages over bounding boxes:

General Object Representation: Can represent arbitrary shapes (construction debris, fallen cargo, unusual obstacles) without requiring pre-defined categories
Free-Space Reasoning: Explicitly models traversable vs. non-traversable space
Fine-Grained Geometry: Sub-meter resolution volumetric understanding
Dynamic Prediction: Occupancy flow predicts where occupied space will move

3.2 Tesla's Occupancy Network

Tesla pioneered practical deployment of occupancy networks at scale:

1st Generation (2022): 3D occupancy volume prediction from multi-camera input via BEV transformations
2nd Generation (2023+): Adds Occupancy Flow — predicts future positions of occupied voxels. Handles amorphous/unusual obstacles (debris, cargo, animal groups). Functions as a real-time high-fidelity 3D map of immediate surroundings
Integration: Core component of FSD V12+ pipeline, replacing explicit object detection for obstacle avoidance

3.3 TPVFormer — CVPR 2023

Tri-Perspective View Representation:

Accompanies BEV with two additional perpendicular planes (front view + side view)
Models each 3D point by summing its projected features on all three planes
Transformer-based TPV encoder processes multi-camera images
Key Result: Achieves semantic occupancy prediction for all voxels with only sparse supervision (no dense 3D labels needed)
Described as "an academic alternative to Tesla's occupancy network"

3.4 SurroundOcc — ICCV 2023

Predicts volumetric occupancy of surrounding 3D scenes from multi-camera images
Uses 2D-to-3D feature lifting with spatial cross-attention
Dense occupancy prediction with multi-scale supervision
Provides a pipeline for generating dense occupancy ground truth from sparse LiDAR

3.5 OpenOccupancy — ICCV 2023

First surrounding semantic occupancy perception benchmark
Extends nuScenes with dense semantic occupancy annotations
Defines evaluation metrics for the occupancy prediction task
Baseline models and benchmarking framework for fair comparison

3.6 Occ3D — NeurIPS 2023

Large-scale 3D occupancy prediction benchmark
Label generation pipeline producing dense, visibility-aware labels
Accounts for sensor occlusion in ground truth generation
Supports both nuScenes and Waymo datasets

3.7 Occupancy Prediction and World Models: The Connection

Occupancy prediction is foundational to world models for autonomous driving:

OccWorld — ECCV 2024:

3D occupancy-based world model that jointly predicts ego-car movement and surrounding scene evolution
Given past 3D occupancy observations, forecasts future scenes and ego movements
Compatible with self-supervised (SelfOcc), LiDAR-collected (TPVFormer), or machine-annotated (SurroundOcc) occupancy data
Scalable to large-scale training

Drive-OccWorld — AAAI 2025:

Vision-centric world model for 4D occupancy and flow forecasting
Integrated with end-to-end planning
Plans trajectories by forecasting future occupancy state and selecting optimal trajectory via occupancy-based cost function

OccSora (2024):

Uses 4D occupancy generation models as world simulators
Generates future occupancy sequences for training and evaluation

Key Insight: Occupancy representations bridge the gap between perception and world modeling. They provide a natural intermediate representation that is:

General enough to capture arbitrary geometry
Structured enough for physics-based reasoning
Compatible with generative models for future prediction
Directly usable for planning via free-space cost functions

4.1 Camera-Only vs. Camera + LiDAR

Aspect	Camera-Only	Camera + LiDAR
Cost	Low ($50-500 per camera)	High (LiDAR $500-$10,000+)
Depth Accuracy	Estimated (errors at range)	Direct measurement (mm accuracy)
Semantic Richness	High (color, texture, signs)	Limited (geometry only)
Weather Robustness	Degraded in rain/fog/snow	LiDAR robust; rain can cause noise
Lighting	Requires illumination	LiDAR works in total darkness
Data Volume	Moderate	High (point clouds are large)
Scaling	Easy (cameras are ubiquitous)	Difficult (cost, calibration)
Proponents	Tesla, Comma.ai	Waymo, Zoox, Aurora, Mobileye

Current Research Consensus (2024-2025): For safety-critical applications, multi-sensor fusion remains the gold standard. BEVFusion demonstrated that unified BEV fusion preserves both geometric accuracy (LiDAR) and semantic density (cameras). However, camera-only approaches (BEVFormer, UniAD, SparseDrive) have narrowed the gap significantly, with some now matching LiDAR-based perception accuracy.

Relevance for Airside: Low-speed, structured environments may favor multi-sensor approaches due to:

Safety-critical proximity to aircraft and personnel
Need for precise positioning (centimeter-level)
Operation in all weather conditions
Lower cost sensitivity (fewer vehicles, higher per-unit safety investment)

4.2 Radar Integration

Traditional automotive radar provides velocity measurements and operates in all weather, but has low angular resolution. Integration approaches:

Late Fusion: Radar detections fused with camera/LiDAR objects at the object level
Mid-Level Fusion: Radar features fused in BEV space with camera/LiDAR BEV features
Early Fusion: Raw radar signals fused with other sensor modalities at the feature level

4.3 4D Radar Developments

4D imaging radar represents a significant advancement, providing elevation information (azimuth + elevation + range + Doppler velocity):

Market Growth: USD 677M (2024) projected to USD 1,043M (2032)
Key Datasets:
- V2X-R (Xiamen University, 2024): 12,079 scenes, LiDAR-camera-4D radar fusion with adverse weather simulation. Multi-modal Denoising Diffusion uses radar features to clean noisy LiDAR data
- MAN TruckScenes: 747 scenes, 360-degree 4D radar data, largest annotated 3D bounding box radar dataset
Advantages for Airside:
- All-weather operation (critical for 24/7 airport operations)
- Direct velocity measurement (useful for tracking ground support equipment)
- Lower cost than LiDAR
- Robust to rain, snow, fog, dust — common airport conditions
- Penetrates jet exhaust and FOD (Foreign Object Debris) that may scatter LiDAR

4.4 Sensor Fusion Architectures for Low-Speed Environments

For constrained, low-speed environments like airports, ports, and warehouses:

Typical Sensor Suite:

3D LiDAR (1-4 units for 360-degree coverage)
Cameras (6-12 for surround view)
4D Radar (for weather robustness and velocity)
RTK-GNSS (centimeter-level positioning)
IMU (inertial measurement)
Ultrasonic (close-range obstacle detection during docking)

Architecture Patterns:

FPGA-Based Fusion: Ultra-low latency, high reliability for safety-critical applications
Adaptive EKF Fusion: Selectively integrates multi-sensor data while minimizing energy consumption
BEV Fusion with LIO-SAM: High-precision dynamic environment maps for logistics operations
Infrastructure-Augmented: V2X communication with airport infrastructure for enhanced situational awareness

5. Deployment Considerations for Airside

5.1 Real-Time Inference Requirements

For airside autonomous vehicles operating at 5-25 km/h:

Requirement	Value	Rationale
Perception Latency	< 100 ms	Obstacle detection at low speed
Planning Cycle	50-100 ms (10-20 Hz)	Smooth trajectory generation
Control Loop	20-50 ms (20-50 Hz)	Vehicle dynamics response
End-to-End Latency	< 200 ms	Sensor-to-actuator total
3D Detection Range	50-100 m	Adequate for low-speed ODD
Occupancy Resolution	0.1-0.5 m	Precise docking with aircraft

Model Inference Benchmarks (for reference):

SparseDrive: 7.3-9.0 FPS on A100 — marginal for real-time, requires optimization
UniAD: 1.8 FPS — not real-time capable without significant optimization
openpilot policy model: Runs on Snapdragon 8 Gen 3 (mobile SoC)
EMMA: Not deployable on-edge without distillation

5.2 Edge Computing Platforms

NVIDIA Jetson Family

Platform	AI Performance	Power	Form Factor	Availability
Jetson Orin Nano	67 TOPS	7-25 W	Smallest	Available
Jetson Orin NX	157 TOPS	10-40 W	Small	Available
Jetson AGX Orin	275 TOPS	15-60 W	Module	Available
Jetson AGX Thor	2000+ TOPS	TBD	Module	2025-2026

Jetson AGX Orin is the recommended platform for airside autonomous vehicles:

275 TOPS sufficient for optimized perception + planning models
Power configurable for energy-constrained electric GSE
Mature ecosystem (JetPack SDK, TensorRT, DeepStream)
Proven in robotics and autonomous machine deployments

NVIDIA DRIVE Platform

Platform	Target	Performance
DRIVE AGX Orin	Production ADAS/AV	254 TOPS
DRIVE AGX Thor	Next-gen AV	2000+ TOPS

DRIVE AGX is designed for automotive (ISO 26262 certified silicon), making it directly applicable to airside vehicles requiring functional safety certification.

Model Optimization Pipeline

Training (DGX/Cloud) -> ONNX Export -> TensorRT Optimization -> Edge Deployment

TensorRT Optimization Techniques:

Quantization: FP32 -> FP16/INT8/FP8 (2-4x speedup)
Layer Fusion: Merge adjacent operations (conv + BN + ReLU)
Kernel Auto-Tuning: Hardware-specific kernel selection
Dynamic Batching: Optimize throughput for multi-model pipelines
Sparsity: Structured pruning for 2x additional speedup (Ampere+)

TensorRT Edge-LLM (2025): Open-source C++ framework for efficient LLM/VLM inference on embedded platforms (DRIVE AGX Thor, Jetson Thor). Enables reasoning-capable models on edge hardware.

5.3 Safety Certification

ISO 26262 (Road Vehicles — Functional Safety)

Applicability: Directly applicable to airside autonomous vehicles as they are ground vehicles
ASIL Levels: A (lowest) through D (highest). Airside vehicles likely require ASIL B-D depending on proximity to aircraft and personnel
Key Requirements:
- Hazard analysis and risk assessment (HARA)
- Safety goals derivation
- Functional safety concept
- Hardware/software design verification
- Systematic capability and random hardware failure metrics
AI/ML Challenge: ISO 26262 assumes deterministic software. Neural networks are probabilistic, creating certification gaps
Industry Response: EasyMile achieved ISO 26262 certification for its autonomous driving platform (the first autonomous shuttle company to do so), demonstrating feasibility

DO-178C (Software Considerations in Airborne Systems)

Applicability: Not directly required for ground vehicles, but relevant for airside operations near aircraft
Design Assurance Levels (DAL): A (catastrophic) through E (no effect)
Relevance: Airside vehicles operating in proximity to aircraft may need to satisfy aviation safety authorities (FAA, EASA) who reference DO-178C standards. The FAA's guidance on AGVS indicates increasing regulatory interest
Hybrid Approach: Some operators may adopt ISO 26262 for the vehicle platform and elements of DO-178C or aviation safety management practices for airside-specific operations

Practical Certification Path for Airside

ISO 26262 ASIL B-C for the autonomous driving platform
Airport-specific safety management per FAA CertAlert 24-02 / Emerging Entrants Bulletin 25-02
EASA/FAA coordination for operations in aircraft movement areas
Operational safety case demonstrating risk mitigation for the specific ODD
Redundancy architecture (dual-channel perception per Mobileye's approach)

5.4 Operational Design Domain (ODD) for Airside

The ODD defines the specific conditions under which an autonomous vehicle is designed to operate safely. For airside operations:

Proposed Airside ODD Parameters

Parameter	Specification
Speed	0-25 km/h (ramp), 0-40 km/h (service roads)
Operating Area	Defined apron, taxiways, service roads, baggage areas
Weather	All weather with degraded-mode for extreme conditions (heavy fog <50m visibility, ice storms)
Time of Day	24/7 operation
Traffic	Mixed traffic with aircraft, GSE, personnel, FOD
Infrastructure	Paved surfaces, marked lanes (where available), geo-fenced boundaries
Connectivity	V2X with airport operations center, A-SMGCS integration
Exclusion Zones	Active runways (without explicit clearance), passenger boarding areas
Fallback	Remote operator takeover, safe stop in designated areas
Cargo/Payload	ULD containers, baggage dollies, up to specified weight

Key ODD Challenges for Airside

Jet Blast: High-velocity exhaust from taxiing/departing aircraft can affect sensor performance and vehicle stability
FOD (Foreign Object Debris): Small objects on apron surfaces that must be detected and avoided
Aircraft Priority: All ground vehicles must yield to aircraft unconditionally
Dynamic Geometry: Aircraft pushback paths, variable stand assignments, temporary construction
Electro-Magnetic Interference: Radar, radio communications, navigation aids
Multi-stakeholder Environment: Airlines, ground handlers, airport authority, ATC — complex operational coordination

6. Synthesis: Applicability to Airside Autonomous Vehicles

6.1 Most Applicable Architectural Patterns

For airside autonomous GSE (baggage tugs, cargo dollies, service vehicles), the following architectural elements are most relevant:

Perception

BEV Fusion approach (camera + LiDAR + 4D radar): Robust multi-modal perception suited for safety-critical, all-weather operations
Occupancy networks (TPVFormer / SurroundOcc style): Ideal for detecting arbitrary obstacles (FOD, loose cargo, ground equipment) without requiring category-specific training
Sparse instance representations (SparseDrive style): Efficient for tracking known agent types (aircraft, other GSE, personnel)

Planning

Probabilistic planning (VADv2 style): Handles uncertainty inherent in shared airside environments
Collision-aware trajectory selection (SparseDrive): Critical for safe operation near aircraft
Game-theoretic interaction modeling (GameFormer): Useful for multi-agent coordination on shared aprons

World Models

Occupancy-based world models (OccWorld/Drive-OccWorld): Natural fit for predicting scene evolution in structured environments
Learned simulation (openpilot v0.11): Enables training from fleet data without expensive manual annotation

Safety & Verification

Dual-system redundancy (Mobileye): Independent perception channels for safety-critical decisions
Chain-of-thought reasoning (EMMA): Explainability for safety case arguments
RSS formal safety model (Mobileye): Mathematical guarantees on safe distances and right-of-way

6.2 Recommended Architecture for Airside AV

Sensors:
  6-8 cameras (surround view)
  2-3 LiDAR (360-degree, solid-state preferred for reliability)
  4D radar (all-weather backup)
  RTK-GNSS + IMU (precise positioning)
  Ultrasonic (docking)
  V2X radio (airport integration)

Perception:
  BEV fusion backbone (camera + LiDAR + radar -> unified BEV)
  3D occupancy prediction (free-space + obstacle)
  Object detection + tracking (aircraft, GSE, personnel)
  Online map element detection (lane markings, stand boundaries)

Prediction:
  Occupancy flow (future obstacle positions)
  Agent trajectory prediction (multi-modal)

Planning:
  Probabilistic multi-modal trajectory planning
  Collision-aware selection with safety margins
  Integration with airport traffic management

Control:
  Low-level trajectory following
  Emergency stop capability
  Remote operator fallback

Compute:
  NVIDIA Jetson AGX Orin (275 TOPS) or DRIVE AGX
  TensorRT-optimized models (INT8/FP16)
  Redundant compute for safety-critical paths

Safety:
  Dual-channel perception (camera-based + LiDAR-based, independent)
  Formal safety model (RSS-style)
  Watchdog / system monitor
  ISO 26262 ASIL B-C certified platform

6.3 Current Industry Deployments

Company	Product	Airports	Technology
EasyMile	EZTow, EZDolly	Dubai, Changi, Narita, DFW, Schiphol	LiDAR + camera, 400+ deployments globally
AeroVect	AeroVect Driver	Multiple US airports	LiDAR + camera + RTK-GNSS, platform-agnostic
reference airside AV stack	autonomous baggage/cargo tug	6 airports (incl. Zurich, Changi, Schiphol)	LiDAR + 360-degree cameras, all-weather
dnata	EZTow (EasyMile)	Dubai World Central	L3 autonomy, upgrading to L4 in 2026

6.4 Regulatory Landscape

FAA CertAlert 24-02 (2024): First formal guidance for AGVS at Part 139 airports
FAA Emerging Entrants Bulletin 25-02 (May 2025): Testing and demonstration guidance
Key FAA Positions:
- Supports AGVS testing in controlled, low-congestion airport areas
- Permits operational testing in closed movement and safety areas with proper risk mitigation
- Airports should not close areas exclusively for AGVS testing
- Regional FAA Airport Certification and Safety Inspectors must be engaged early

6.5 Key Research Gaps for Airside

Airside-specific datasets: No large-scale public dataset exists for airside driving scenarios (unlike nuScenes, Waymo for roads)
Aircraft interaction modeling: No published game-theoretic models for GSE-aircraft interactions
Jet blast and FOD perception: Limited research on perception robustness to jet engine exhaust and small debris detection
Cross-domain transfer: Limited study on transferring road-driving models to airside environments
Regulatory framework maturity: FAA/EASA guidance still emerging; no established certification pathway combining ISO 26262 + aviation safety
V2X integration: Standards for AGVS-airport infrastructure communication are not yet established

SLAM Methods

Methods

End-to-End Autonomous Driving Architectures & Foundation Models for Airside Autonomous Vehicles ​

Comprehensive Technical Report ​

Table of Contents ​

1. End-to-End Autonomous Driving Models ​

1.1 UniAD (Unified Autonomous Driving) — CVPR 2023 Best Paper ​

1.2 VAD / VADv2 (Vectorized Autonomous Driving) — ICCV 2023 / ICLR 2026 ​

1.3 SparseDrive — ECCV 2024 (Best Paper Candidate) ​

1.4 FusionAD — NeurIPS 2023 Workshop ​

1.5 BEV-Based Approaches ​

BEVFormer — ECCV 2022 ​

BEVFusion — ICRA 2023 ​

1.6 TransFuser — PAMI 2023 ​

1.7 InterFuser — CoRL 2022 ​

1.8 TCP (Trajectory-guided Control Prediction) — NeurIPS 2022 ​

1.9 NEAT (Neural Attention Fields) — ICCV 2021 ​

1.10 ReasonNet — CVPR 2023 ​

1.11 GameFormer — ICCV 2023 (Oral) ​

1.12 Significant 2024-2025 Models ​

DriveTransformer (2025) ​

PARA-Drive — CVPR 2024 ​

GenAD — ECCV 2024 (CVPR 2024 Highlight) ​

Hydra-MDP — CVPR 2024 Challenge Winner ​

DriveVLM — CoRL 2024 ​

Vision-Language-Action (VLA) Models (2025-2026) ​

2. Foundation Models for Driving ​

2.1 NVIDIA Foundation Models ​

Alpamayo 1 (December 2025) ​

Cosmos World Foundation Models (January 2025) ​

Cosmos-Reason (2025) ​

End-to-End Autonomous Driving Architectures & Foundation Models for Airside Autonomous Vehicles

Comprehensive Technical Report

Table of Contents

1. End-to-End Autonomous Driving Models

1.1 UniAD (Unified Autonomous Driving) — CVPR 2023 Best Paper

1.2 VAD / VADv2 (Vectorized Autonomous Driving) — ICCV 2023 / ICLR 2026

1.3 SparseDrive — ECCV 2024 (Best Paper Candidate)

1.4 FusionAD — NeurIPS 2023 Workshop

1.5 BEV-Based Approaches

BEVFormer — ECCV 2022

BEVFusion — ICRA 2023

1.6 TransFuser — PAMI 2023

1.7 InterFuser — CoRL 2022

1.8 TCP (Trajectory-guided Control Prediction) — NeurIPS 2022

1.9 NEAT (Neural Attention Fields) — ICCV 2021

1.10 ReasonNet — CVPR 2023

1.11 GameFormer — ICCV 2023 (Oral)

1.12 Significant 2024-2025 Models

DriveTransformer (2025)

PARA-Drive — CVPR 2024

GenAD — ECCV 2024 (CVPR 2024 Highlight)

Hydra-MDP — CVPR 2024 Challenge Winner

DriveVLM — CoRL 2024

Vision-Language-Action (VLA) Models (2025-2026)

2. Foundation Models for Driving

2.1 NVIDIA Foundation Models

Alpamayo 1 (December 2025)

Cosmos World Foundation Models (January 2025)

Cosmos-Reason (2025)