LLM-Based Reasoning, Planning, and Decision-Making for Autonomous Vehicles
Comprehensive Technical Report
Table of Contents
- LLMs as AV Planners and Reasoners
- Chain-of-Thought and Reasoning for Driving
- Multimodal LLMs for Driving
- LLM-Based Traffic Simulation
- Applicability to Airside Operations
- Taxonomy and Landscape Summary
- Key Challenges and Open Problems
1. LLMs as AV Planners and Reasoners
1.1 GPT-Driver (NeurIPS Workshop 2023)
Core idea: Reformulates motion planning as a language modeling problem. Driving trajectory waypoints are tokenized as language and GPT-3.5 is used to generate them.
- Input/Output: Scene descriptions and ego-vehicle state are encoded as text prompts; the LLM outputs a sequence of coordinate waypoints as natural-language tokens.
- Prompting-Reasoning-Finetuning strategy: A three-stage pipeline that (1) constructs structured prompts from the driving scene, (2) uses chain-of-thought to stimulate numerical reasoning, and (3) fine-tunes the LLM on trajectory data to improve coordinate precision.
- Results on nuScenes: Outperforms state-of-the-art motion planners in effectiveness, generalization, and interpretability.
- Significance: The first work to demonstrate that a general-purpose LLM can serve as a competitive motion planner when the planning problem is reformulated as next-token prediction.
Reference: GPT-Driver: Learning to Drive with GPT
1.2 LanguageMPC (October 2023)
Core idea: Uses LLMs as high-level decision makers whose outputs parameterize a Model Predictive Controller (MPC).
- Architecture: The LLM receives a structured textual description of the driving scenario (road layout, surrounding vehicles, traffic signals) and produces a high-level decision (e.g., "slow down and yield to oncoming traffic"). This decision is translated into MPC parameter adjustments through guided parameter matrix adaptation.
- Key capabilities:
- Traffic rule reasoning: For an unsignalized intersection left turn, the LLM correctly decides to yield to oncoming traffic.
- User preference adaptation: Switches between conservative (follow the car ahead) and aggressive (overtake) driving styles based on natural language instructions.
- Multi-vehicle coordination: Handles complex multi-agent scenarios using commonsense reasoning.
- Results: Consistently surpasses baseline approaches in single-vehicle tasks and successfully handles multi-vehicle coordination.
Reference: LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving
1.3 LLM-Assist (2024)
Core idea: A hybrid planner that combines a conventional rule-based planner with an LLM-based planner for closed-loop driving.
- Problem addressed: Rule-based planners generalize well but fail on complex maneuvers; learning-based planners overfit and have poor long-tail performance. LLM-Assist uses LLMs (GPT-3.5 and GPT-4) to augment rule-based planning when the base planner fails.
- Mechanism: The LLM defines planner parameters (not raw trajectories) to safely navigate scenarios, acting as a "consultant" that adjusts the rule-based planner's behavior.
- Key finding: GPT-4 shows a marginal edge in unconstrained settings, while GPT-3.5 performs slightly better in constrained scenarios.
- Significance: Demonstrates a practical integration pattern where the LLM provides high-level reasoning while the rule-based system handles low-level execution, reducing latency and safety concerns.
Reference: LLM-Assist: Enhancing Closed-Loop Planning with Language-Based Reasoning
1.4 DriveGPT4 (2023) and DriveGPT4-V2 (CVPR 2025)
Core idea: An interpretable end-to-end autonomous driving system built on a multimodal LLM.
- Architecture: A multimodal language model with a dedicated video tokenizer for processing multi-frame video sequences, combined with a shared text/control signal de-tokenizer.
- Training: Fine-tuned on 56K video-text instruction-following pairs for driving plus 223K general instruction-following data.
- Capabilities:
- Processes multi-frame video inputs and textual queries
- Interprets vehicle actions with natural-language reasoning
- Predicts low-level vehicle control signals end-to-end
- Answers diverse user questions about driving scenarios
- DriveGPT4-V2 (CVPR 2025): Extends to closed-loop autonomous driving with enhanced LLM capabilities.
- Results: Superior performance on the BDD-X dataset; comparable to GPT-4V on driving grounding tasks when domain-fine-tuned.
Reference: DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model
1.5 DiLu -- Dilemma-driven LLM for Driving (ICLR 2024)
Core idea: A knowledge-driven framework with four modules that enables LLMs to accumulate driving experience and continuously improve, inspired by how humans learn to drive.
- Four modules:
- Environment: Highway-env simulation providing real-time vehicle positions, speeds, and accelerations.
- Reasoning Module: Encodes driving scenes as text, retrieves similar past experiences from a vector database, constructs few-shot prompts, and queries GPT-3.5 for driving decisions using chain-of-thought.
- Reflection Module: Post-episode evaluation using GPT-4 -- for collisions, it identifies the error cause, generates corrected reasoning, and proposes preventive strategies.
- Memory Module: A vector database storing scene descriptions paired with reasoning processes. Initialized with just 5 human-crafted seed scenarios.
- Key results:
- With only 40 memory items, DiLu matches the performance of RL methods trained over 600,000 episodes.
- Superior generalization: when transferred to more complex scenarios (5 lanes, higher density), DiLu maintains a 35% success rate where RL suffers an 85% performance drop.
- Real-world transferability: experiences from the CitySim trajectory dataset improved simulation performance.
Reference: DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models
1.6 Agent-Driver (2024)
Core idea: Treats the LLM as a cognitive agent with a tool library, cognitive memory, and a multi-step reasoning engine, mimicking an anthropomorphic driving process.
- Tool Library: 20+ functions abstracting neural module outputs into text, spanning detection (
get_leading_object), prediction (retrieve predicted trajectories), occupancy (check location occupancy probabilities), and mapping (lane/road shoulder info). The LLM selectively invokes relevant functions via dynamic calls. - Cognitive Memory:
- Commonsense Memory: traffic regulations and safe driving knowledge.
- Experience Memory: past driving scenarios with decisions, retrieved via a two-stage search (KNN in embedding space + LLM-based fuzzy ranking).
- Reasoning Engine (4 steps):
- Chain-of-Thought Reasoning: Identifies key objects and their effects on driving.
- Task Planning: High-level plans combining discrete behaviors (move_forward, change_lane, turn, stop) with velocity estimates.
- Motion Planning: Generates 6 waypoints over 3 seconds through LLM fine-tuning on human driving data.
- Self-Reflection: Collision-checks trajectories against occupancy maps; refines colliding trajectories.
- Results on nuScenes:
- Collision rate: 0.09% average (35.7% reduction vs. second-best).
- L2 error: 0.37m average.
- Few-shot: with just 1% of training data, exceeds UniAD trained on full datasets.
- Zero invalid outputs even with minimal training data.
Reference: A Language Agent for Autonomous Driving
1.7 DriveMLM (2023, published Springer 2025)
Core idea: Aligns multimodal LLM decision outputs with the behavioral planning module of a modular AD system (Apollo).
- Architecture:
- Multi-modal tokenizer: transforms multi-view images, LiDAR, traffic rules, and user requirements into unified tokens.
- MLLM decoder: makes decisions based on unified tokens and generates natural-language explanations.
- Key innovation: Behavioral Planning States Alignment -- the LLM's linguistic decisions map directly to Apollo's behavioral planning states, enabling seamless conversion to vehicle control signals.
- Data engine: Custom dataset with decision state annotations and corresponding explanations.
- Results on CARLA Town05 Long: Driving Score of 76.1 (outperforming Apollo by 4.7 points); highest miles-per-intervention (0.96) among all compared systems.
Reference: DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States
1.8 LimSim++ (2024)
Core idea: The first open-source closed-loop simulation platform for deploying and evaluating MLLMs in autonomous driving.
- Dual-component architecture: Simulation System (SUMO + CARLA) and an MLLM-powered Driver Agent interacting in a closed-loop fashion.
- Capabilities: Prompt engineering, model evaluation, framework enhancement, continuous learning through reflection and memory.
- Results: LLM agents achieved over 90% route completion in complex scenarios; GPT-3.5's success rate improved from 50% to 100% through continuous learning.
Reference: LimSim++: A Closed-Loop Platform for Deploying Multimodal LLMs in Autonomous Driving
1.9 SurrealDriver (2023)
Core idea: Builds human-like generative driving agents using post-driving self-report "driving-thinking" data from real human drivers.
- Data collection: Post-driving self-reports from 24 real-world drivers captured their verbalized considerations and decision-making processes during urban driving.
- Framework design: Basic driving pipeline, safety and memory mechanism, and human-aligned long-term driving guidelines informed by human thought demonstrations.
- Results: Incorporating expert demonstration data reduced collision rates by 81.04% and increased human likeness by 50% compared to baseline LLM-based agents.
- Significance: First application of LLMs to construct an agent capable of human-like urban driving behavior.
Reference: SurrealDriver: Designing LLM-powered Generative Driver Agent Framework
1.10 DriveGPT -- Waymo (ICML 2025)
Core idea: A large autoregressive behavior model that applies LLM-style scaling laws to driving.
- Scale: 1B+ parameters, trained on 100M+ high-quality human driving demonstrations in dense urban settings -- approximately 3x larger and 50x more data than prior published behavior models.
- Architecture: Transformer encoder-decoder where the encoder summarizes scene context and the decoder predicts future agent states autoregressively as tokens.
- Key finding: Scaling experiments reveal LLM-like scaling laws for driving -- performance improves predictably with more data and compute, and autoregressive decoders show better scalability than one-shot decoders.
- Real-world deployment: Deployed as a real-time planner in complex urban scenarios on actual Waymo vehicles.
Reference: DriveGPT: Scaling Autoregressive Behavior Models for Driving
1.11 Additional Notable Systems
| System | Year | Key Contribution |
|---|---|---|
| Drive Like a Human | 2023 | Identifies three key LLM abilities for driving: reasoning, interpretation, memorization. Zero-shot pass rate > 60% in HighwayEnv. |
| Senna | 2024 | Bridges LVLM (Senna-VLM) with end-to-end planner (Senna-E2E). Reduces planning error by 27.12% and collision rate by 33.33%. |
| AutoVLA | NeurIPS 2025 | Unifies reasoning and action generation. Introduces dual thinking modes (fast/slow) and GRPO-based reinforcement fine-tuning. Top score on Waymo Challenge Spotlight metric. |
| KnowVal | 2025 | Integrates driving knowledge graph (traffic laws, ethics, defensive driving) with VLM reasoning. Achieves lowest collision rate on nuScenes. |
| Wayve LINGO-2 | 2024 | First closed-loop vision-language-action driving model tested on public roads. Language serves as both input prompt and output explanation. |
| DriveLLM-V | 2025 | Explicit natural-language "Vehicle Intention-Based Control Signals" embedded in LLM reasoning for behavioral explainability. |
| LLaDA (NVIDIA) | 2024 | Large Language Driving Assistant enabling policy adaptation to traffic rules in new locations via zero-shot LLM generalization. |
2. Chain-of-Thought and Reasoning for Driving
2.1 Decomposing Driving Tasks with LLMs
LLMs decompose the complex, monolithic driving task into structured sub-problems that mirror human cognitive processes:
Perception -> Prediction -> Planning -> Action
| | | |
"What do "What will "What "Execute
I see?" they do?" should trajectory"
I do?"Agent-Driver's four-step decomposition is the canonical example:
- Chain-of-Thought Reasoning: "The vehicle ahead is braking. A pedestrian is crossing 30m ahead. The traffic light is yellow."
- Task Planning: "Decelerate and prepare to stop. Do not change lanes -- pedestrian in adjacent lane."
- Motion Planning: Generate 6 waypoints over 3 seconds with decreasing velocity.
- Self-Reflection: Check waypoints against occupancy map; refine if collision detected.
DriveLM's Graph VQA structures reasoning as a directed graph:
- Nodes: QA pairs about perception ("What objects are nearby?"), prediction ("Will the pedestrian cross?"), and planning ("Should I yield?")
- Edges: Logical progressions between reasoning steps
- This graph structure enforces causal reasoning chains and prevents the LLM from "jumping to conclusions."
2.2 Spatial Reasoning with LLMs
Spatial reasoning remains a fundamental challenge. Key approaches:
Talk2BEV (ICRA 2024): Constructs language-enhanced Bird's-Eye View maps by aligning vision-language features for each detected object. The BEV representation can be queried for spatial reasoning ("Is there a vehicle in my blind spot?"), distance estimation, and future scenario planning.
BEVLM (2025): Distills semantic knowledge from LLMs into BEV representations, improving LLM reasoning accuracy by 46% in cross-view driving scenes and safety-critical scenario performance by 29%.
Spatial-aware VLMs: Contemporary VLMs struggle with metric spatial understanding because they primarily consume 2D imagery. BEV representations bridge this gap by providing explicit spatial encoding that LLMs can reason over.
Visualization-of-Thought (NeurIPS 2024): Augments chain-of-thought with interleaved "visualizations" that mimic internal sketching, improving multi-hop spatial planning and navigation success rates.
2.3 Traffic Rule Understanding and Compliance
DriveReg / Driving with Regulation (2024):
- Traffic Regulation Retrieval (TRR) Agent: Uses RAG with
text-embedding-ada-002embeddings and FAISS similarity search to retrieve relevant regulations from state laws, DMV manuals, city ordinances, court cases, and driving norms. - Cascading retrieval: Paragraph-level selection followed by sentence-level re-embedding for precision.
- Dual-level evaluation: The reasoning agent (GPT-4o) performs both compliance checking (mandatory rules) and safety assessment (guidelines).
- Results: Without RAG: 76% accuracy; with RAG: 100% accuracy on 30 hypothesized scenarios. 15/17 correct on real-world nuScenes samples.
- Latency: ~2 seconds per decision with detailed reasoning; ~1 second for shorter outputs.
LLaDA (NVIDIA, CVPR 2024): Enables AVs to adapt to traffic rules in new geographic locations through zero-shot LLM generalization, interpreting and applying unfamiliar local regulations without retraining.
KnowVal (2025): Encodes traffic laws, defensive driving principles, and ethical norms in a comprehensive driving knowledge graph, with an LLM-based retrieval mechanism tailored for driving scenarios.
2.4 Common-Sense Reasoning for Edge Cases
LLMs are uniquely valuable for long-tail scenarios where rule-based logic and statistical models fail:
- Traffic cone on a truck: Human drivers use common sense to recognize traffic cones as cargo rather than construction markers. Most perception systems fail on this distinction, but LLMs can reason: "The cones are on the truck bed, not on the road surface, so they are being transported."
- Drive Like a Human: Demonstrates that LLMs achieve > 60% zero-shot pass rate on highway scenarios that require common-sense reasoning, outperforming RL and search-based methods without any training.
- DiLu's Reflection Module: When the LLM makes an error (e.g., collision), GPT-4 analyzes the failure, identifies the root cause, and generates corrective strategies stored for future use -- mimicking how human drivers learn from mistakes.
- AutoScenario (2025): A multimodal LLM framework that converts real-world safety-critical data into textual representations, enabling LLMs to generalize risk factors and generate realistic corner cases for testing.
2.5 Multi-Step Planning with Language
Sequential reasoning patterns used in driving:
| Pattern | Example | Systems Using It |
|---|---|---|
| Chain-of-Thought | "I see X -> This means Y -> Therefore I should Z" | Agent-Driver, DiLu, DriveLM |
| Tree-of-Thought | Explore multiple plan candidates, evaluate each | AutoScenario, LLM4ADSTest |
| Graph-of-Thought | Non-linear reasoning with cross-connections | DriveLM Graph VQA |
| Reflection/Self-Critique | Generate plan -> Evaluate -> Refine | Agent-Driver, DiLu, LimSim++ |
| RAG-Augmented Reasoning | Retrieve rules/experience -> Reason with context | DriveReg, KnowVal, DiLu |
3. Multimodal LLMs for Driving
3.1 GPT-4V/4o Applied to Driving Analysis
"On the Road with GPT-4V(ision)" (ICLR 2024):
- Strengths: Outperforms existing systems in scene understanding and causal reasoning. Successfully recognizes weather conditions, traffic lights/signs across countries, and positions/actions of traffic participants.
- Limitations: Cannot predict numerical control signals; fails to correctly understand dynamic vehicle actions (turning, accelerating). Insufficient for real-time control but valuable for high-level scene analysis.
GPT-4V Explorations: Mining Autonomous Driving (2024):
- Extended evaluation of GPT-4V for mining driving-relevant insights from camera data.
GPT-4o in DriveReg: Used as the reasoning agent for traffic regulation compliance, achieving 100% accuracy with RAG-retrieved regulations.
3.2 Waymo's Driving VLM (Gemini-based)
Waymo uses a Driving VLM trained on top of Gemini for complex semantic reasoning:
- Fine-tuned on Waymo's driving data and tasks
- Leverages Gemini's world knowledge for rare and novel scenarios
- Example: When encountering a vehicle on fire, the VLM provides semantic signals prompting the system to reroute
- Represents a production-grade deployment of multimodal LLM reasoning in autonomous driving
3.3 DriveLM (ECCV 2024 Oral)
Core innovation: Graph Visual Question Answering (Graph VQA)
- QA pairs structured as a directed graph mimicking human reasoning: Perception -> Prediction -> Planning
- Each QA node connects to others via logical progression edges
- Instantiated on nuScenes and CARLA datasets
DriveLM-Agent baseline:
- Performs end-to-end driving competitively with driving-specific architectures
- Strong zero-shot generalization to unseen objects and sensor configurations
- Served as a main track in the CVPR 2024 Autonomous Driving Challenge
3.4 NuScenes-QA (AAAI 2024)
The first VQA benchmark for autonomous driving:
- 34K visual scenes, 460K question-answer pairs
- Multi-modal input: 6-view RGB cameras + 5D LiDAR point clouds
- Programmatically generated QA pairs from 3D detection annotations and scene graphs
- Covers Boston and Singapore driving scenes with diverse conditions
- Establishes standardized evaluation for VQA capabilities in driving contexts
Reference: NuScenes-QA: A Multi-modal Visual Question Answering Benchmark
3.5 How MLLMs Understand Driving Scenes
The emerging MLLM pipeline for driving scene understanding:
Multi-view Cameras + LiDAR
|
Visual Encoder (e.g., CLIP, SigLIP, InternViT)
|
BEV / Spatial Feature Extraction
|
Visual-Language Alignment (projection/adapter)
|
Large Language Model (reasoning backbone)
|
Outputs: Scene description, decisions, trajectories, explanationsKey challenges identified:
- Metric spatial understanding: 2D image encoders lose 3D depth; BEV integration helps
- Temporal reasoning: Multi-frame processing is computationally expensive but necessary
- Hallucination: LLMs may confidently describe objects/scenarios that don't exist -- critical for safety
- Latency: Most MLLMs require seconds per inference, too slow for reactive control
4. LLM-Based Traffic Simulation
4.1 ChatSim (CVPR 2024 Highlight)
The first system for editable photo-realistic 3D driving scene simulation via natural language.
- Multi-agent LLM collaboration: Multiple LLM agents with specialized roles decompose simulation demands into specific editing tasks, mirroring human company workflows.
- Rendering: Novel multi-camera neural radiance field for photo-realistic outcomes with scene-consistent lighting estimation.
- Evaluation: Demonstrated on Waymo Open Dataset handling complex language commands for scene editing.
Reference: ChatSim: Editable Scene Simulation for Autonomous Driving via LLM-Agent Collaboration
4.2 Language Conditioned Traffic Generation -- LCTGen (CoRL 2023)
Natural language to traffic scenario generation:
- Interpreter Module: LLM converts natural language descriptions ("A busy intersection with a jaywalking pedestrian") into structured representations and retrieves matching maps from a real-world map library.
- Generator Module: Query-based Transformer generates the full traffic scenario (initial states + motions) in a single pass.
- Applications: Instructional traffic editing, controllable policy evaluation.
- Results: Outperforms prior work in both unconditional and conditional scene generation for realism and fidelity.
Reference: Language Conditioned Traffic Generation
4.3 LLMs for Realistic Agent Behavior Modeling
Cognitive Agents with LLM Reasoning:
- Multi-horizon memory-driven planning with reflection and adaptation
- Modal decisions emerge from LLM-driven reflection and situational feedback rather than predefined rules
- Enables investigation of emergent, profile-sensitive behaviors across stable and perturbed conditions
Promptable Closed-Loop Traffic Simulation (2024):
- LLM comprehends natural language prompts to generate language-conditioned policy queries for each traffic agent
- Enables controllable simulation where individual agent behaviors can be specified via text
AnchorDrive (2025):
- Two-stage framework: LLM generates semantically controllable scenarios under natural language constraints; diffusion model regenerates realistic trajectories while preserving intent
- Bridges the gap between linguistic controllability and physical realism
4.4 LLM-Driven Scenario Generation for Testing
| System | Approach | Key Innovation |
|---|---|---|
| AutoScenario | Multimodal LLM for corner case generation | Converts real-world safety-critical data to text for LLM reasoning |
| OmniTester | MLLM-driven scenario testing | Comprehensive multimodal testing framework |
| LLM4ADSTest | Tree-of-Thoughts strategy | Structured prompts + red-teaming refinement |
| AnchorDrive | LLM + diffusion model | Semantically controllable + physically realistic |
4.5 GAIA-1 -- Wayve (2023)
Generative world model for driving, architecturally paralleling LLMs:
- 9-billion parameter model trained on ~4,700 hours of UK driving data
- Uses vector-quantized representations to reframe video prediction as next-token prediction (analogous to LLM pretraining)
- Generates semantically meaningful, temporally consistent driving videos
- Can predict several minutes into the future from seconds of input
- Exhibits LLM-like scaling laws: performance improves with model size and data
5. Applicability to Airside Operations
5.1 Current State of Airport Autonomous Ground Vehicles
FAA regulatory status (CertAlert 24-02):
- Testing, deployment, and operation of Autonomous Ground Vehicle Systems (AGVS) have not been authorized by the FAA at Part 139 certified airports for operational use.
- Testing is supported in controlled environments: remote airport areas, landside locations, and movement areas closed to aircraft operations.
- Applications under consideration: maintenance vehicles (mowers, snow removal, sweepers, FOD detection), perimeter security, self-driving aircraft tugs, baggage carts, employee buses, and passenger shuttles.
- Existing safety requirements and standards were not developed with autonomous vehicles in mind; the FAA is developing new standards and guidance.
ICAO-level discussion: EASA recommends international regulatory coordination through ICAO for autonomous airport vehicles.
5.2 LLMs Understanding Airport Ground Operations
Direct relevance of LLM-for-driving research to airside:
The airside environment shares key characteristics with road driving that make LLM-based reasoning applicable:
| Road Driving Challenge | Airside Equivalent | LLM Capability |
|---|---|---|
| Traffic rules (state laws, signals) | ICAO Annex 14, FAA AC 150/5210-20, local airport rules | RAG-based rule retrieval and compliance checking |
| Multi-agent coordination | Aircraft, GSE, personnel on ramp/taxiway | Multi-agent reasoning (LanguageMPC, Agent-Driver) |
| Edge cases / corner cases | FOD, weather events, emergency vehicles | Common-sense reasoning (DiLu, Drive Like a Human) |
| Scene understanding | Gate areas, taxiways, apron markings | Multimodal scene analysis (GPT-4V, DriveLM) |
| Explainable decisions | Safety-case requirements for certification | Natural language explanations (DriveGPT4, Senna) |
5.3 Encoding ICAO/FAA Rules as Language Prompts
The DriveReg / RAG-based compliance framework provides a directly transferable architecture for airside regulation compliance:
Proposed adaptation:
- Regulation Corpus: Ingest ICAO Annex 2 (Rules of the Air), Annex 14 (Aerodromes), FAA AC 150/5210-20A (Ground Vehicle Operations), FAA AC 150/5340-1M (Marking Standards), local airport SOPs, and NOTAMs.
- Retrieval Agent: Embed regulations with a domain-specific model, use FAISS or similar for real-time retrieval based on the current operational context (e.g., "vehicle approaching active runway crossing").
- Reasoning Agent: LLM evaluates proposed actions against retrieved regulations:
- Mandatory compliance: "All vehicles must obtain clearance before entering a movement area" (FAA AC 150/5210-20)
- Safety guidelines: "Maintain safe distance from aircraft engine intake zones"
- Context-dependent rules: "During low-visibility operations, additional restrictions apply"
Key advantage: Unlike hard-coded rule engines, an LLM-based system can handle:
- Ambiguous or conflicting regulations
- Novel scenarios not explicitly covered by rules
- Integration of NOTAMs and temporary operational changes via natural language
- Regional variations between airports and jurisdictions
5.4 Natural Language Interfaces for AV Operators
Existing work directly applicable to airside:
NASA Research on Digital Taxi Instructions (AIAA 2024):
- Uses Natural Language Understanding (intent classification + slot filling) to automatically generate digital taxi instructions from ATC speech.
- Motivation: Reduces errors from voice-only communication and enables data-link integration.
- Challenge: LLMs alone can make unsafe assignments (e.g., multiple aircraft on same runway); requires deterministic guardrails.
Moonware HALO Platform:
- World's first AI-powered Ground Traffic Control system for airports.
- Consolidates operational inputs from across the airside into centralized coordination.
- On-demand task allocation replacing paper-based scheduling.
- Active at US hub airports (British Airways, Aerocharter) and testing at Tokyo International Airport with Japan Airlines.
- Results: 20% reduction in delays, 5-minute average decrease in turnaround time.
- Future roadmap: Integration with autonomous GSE for pushback, baggage handling, and cargo.
LLM-based ATC Agent (Delft University, 2024):
- Language model agent with function-calling resolves air traffic conflicts autonomously.
- Three-agent architecture: planner, executor, verifier.
- Experience Library (vector database) stores and retrieves past conflict resolutions.
- Best configuration resolved 119/120 imminent conflict scenarios, including 4-aircraft simultaneous conflicts.
- Provides human-level text explanations of reasoning -- directly addresses the transparency requirement for safety certification.
Proposed airside operator interface capabilities:
- Voice/text commands: "Clear GSE Unit 7 to proceed to Gate B12 via Taxilane Alpha"
- Status queries: "What is the current status of the pushback at Gate C4?"
- Exception handling: "Divert all ground traffic from Taxiway Bravo -- FOD reported"
- The LLM interprets intent, validates against current operational state, and either executes or explains why the action cannot be performed
5.5 Explainable Decisions for Safety Cases
LLM-based explainability is a critical enabler for airside AV certification:
Why it matters for airside:
- Airside operations require safety cases compliant with ICAO and national regulations
- Regulators need to understand why an autonomous vehicle made each decision
- Post-incident investigation requires traceable decision logs
- Operators need real-time visibility into AV reasoning
Transferable approaches from driving research:
| Approach | Source | Airside Application |
|---|---|---|
| Natural-language decision rationales | DriveGPT4, Senna-VLM, DriveLLM-V | "I stopped because aircraft N12345 is performing pushback across my planned path" |
| Chain-of-thought reasoning logs | Agent-Driver, DiLu | Complete traceable reasoning chain for each decision |
| Regulation-linked justification | DriveReg (RAG) | "Per FAA AC 150/5210-20, Section 7: vehicles must yield to aircraft at all times" |
| Reflection and error analysis | DiLu Reflection Module | Post-incident: "The collision occurred because the system did not detect the baggage cart behind the aircraft nose" |
DriveLLM-V's VICS (Vehicle Intention-Based Control Signals) are particularly relevant: they embed acceleration/deceleration and steering commands together with the reason for each action in plain language, enabling human-readable diagnostics.
5.6 Integration with Ground Control Instructions
Architecture for LLM-integrated airside AV operations:
Ground Control / HALO
|
[Natural Language Instructions]
|
LLM Reasoning Layer
/ | \
RAG: Scene Memory:
ICAO/FAA Understanding Past Operations
Regulations (Camera+LiDAR) & Experiences
\ | /
Decision Engine
|
[Validated Action Plan]
|
MPC / Path Planner
|
Autonomous GSE VehicleKey integration points:
- ATC/Ground Control communication: LLM parses natural language or data-link instructions, extracts intent (taxi route, hold short, give way), and validates against current operational state.
- Conflict detection: Similar to the Delft ATC agent, the LLM detects potential conflicts between multiple ground vehicles and aircraft, proposing deconfliction strategies.
- Dynamic replanning: When operational conditions change (gate reassignment, weather, emergency), the LLM re-reasons about the plan using updated context from RAG-retrieved procedures.
- Multi-agent coordination: Drawing from LanguageMPC and Agent-Driver patterns, multiple autonomous GSE vehicles coordinate through shared LLM reasoning to avoid conflicts and optimize turnaround operations.
5.7 Active Industry Players in Autonomous Airside Operations
| Company | Technology | Status |
|---|---|---|
| Moonware | AI-powered Ground Traffic Control (HALO) | Operational at US hubs, testing at Tokyo |
| EVIE Autonomous | Autonomous aircraft tugs and luggage pods | Development/testing phase |
| reference airside AV stack | autonomous baggage/cargo tug and airside autonomy simulator autonomous baggage | Approved for wider use across 60+ airports (Royal Schiphol Group) |
| UVU / Academic | Autonomous electric aircraft tug | Prototype demonstrated at Provo Airport |
| TaxiBot / WheelTug | Semi-autonomous aircraft taxiing | Various stages of deployment |
6. Taxonomy and Landscape Summary
6.1 How LLMs Are Used in AV Systems
Based on the LLM4AD survey taxonomy:
LLM Applications in Autonomous Driving
|
+-- Perception Enhancement
| +-- Scene description and narration
| +-- Visual question answering (NuScenes-QA, DriveLM)
| +-- 3D spatial reasoning (Talk2BEV, BEVLM)
|
+-- Planning & Decision-Making
| +-- High-level decision making (LanguageMPC, DiLu)
| +-- Trajectory generation (GPT-Driver, Agent-Driver)
| +-- End-to-end driving (DriveGPT4, Senna, AutoVLA)
| +-- Rule compliance (DriveReg, KnowVal)
|
+-- Control Integration
| +-- MPC parameterization (LanguageMPC)
| +-- Behavioral planning alignment (DriveMLM)
| +-- Vision-Language-Action models (LINGO-2, AutoVLA)
|
+-- Simulation & Testing
| +-- Scene generation (ChatSim)
| +-- Traffic scenario generation (LCTGen)
| +-- Corner case generation (AutoScenario)
| +-- Agent behavior modeling (SurrealDriver, LimSim++)
| +-- World models (GAIA-1)
|
+-- Explainability & Safety
+-- Decision rationales (DriveGPT4, Senna-VLM)
+-- Regulation compliance (DriveReg)
+-- Safety diagnostics (DriveLLM-V)6.2 Key Learning Methods
| Method | Examples | Pros | Cons |
|---|---|---|---|
| Zero/Few-shot Prompting | DiLu, Drive Like a Human | No training needed; immediate deployment | Lower precision; inconsistent |
| Supervised Fine-Tuning | DriveGPT4, DriveMLM | Domain-specific precision | Requires labeled data; may overfit |
| LoRA/QLoRA | Senna, various | Parameter-efficient adaptation | Limited capacity vs. full fine-tuning |
| RAG | DriveReg, KnowVal | Dynamic knowledge; no retraining | Retrieval latency; corpus maintenance |
| Reinforcement Learning (GRPO) | AutoVLA | Optimizes for driving metrics directly | Complex training; reward design |
| Reflection/Self-Improvement | DiLu, LimSim++ | Continuous improvement; error correction | Requires evaluation oracle |
7. Key Challenges and Open Problems
7.1 Latency
- Most LLM-based planners require 1-2+ seconds per decision.
- Driving demands reactive control at 10-20 Hz.
- Mitigation strategies: Dual fast/slow thinking (AutoVLA), LLM for high-level decisions only (LanguageMPC, LLM-Assist), on-device inference with smaller models, asynchronous reasoning.
7.2 Hallucination
- LLMs may confidently describe objects or scenarios that do not exist.
- In safety-critical driving, a hallucinated "clear road" when a pedestrian is present could be fatal.
- Mitigation: Grounding in perception outputs (Agent-Driver tool library), self-reflection against occupancy maps, confidence calibration, multi-model verification.
7.3 Safety Verification
- No established framework for formally verifying LLM-based decisions.
- Stochastic outputs make traditional safety analysis (FMEA, SOTIF) challenging.
- Opportunity for airside: Lower speeds, more controlled environment, and smaller operational design domain may make formal verification more tractable than road driving.
7.4 Deployment Constraints
- Cloud-based LLMs (GPT-4) have connectivity and latency issues.
- Edge deployment of large models requires significant compute.
- Trend: Smaller, distilled models (7B-13B parameters) fine-tuned for driving achieve competitive performance while being deployable on-vehicle.
7.5 Evaluation Standardization
- Open-loop vs. closed-loop evaluation produces very different conclusions.
- L2 displacement error doesn't capture safety-critical behavior.
- Emerging benchmarks: Bench2Drive, LaMPilot-Bench, CARLA Leaderboard, Waymo Challenge.
7.6 Airside-Specific Challenges
- No existing datasets for airside autonomous driving.
- Airport ground markings, signage, and operational procedures differ significantly from road driving.
- Multi-agent interactions (aircraft, vehicles, personnel) are more heterogeneous than road traffic.
- Communication modality (radio/data-link) differs from road driving assumptions.
- Regulatory frameworks (ICAO, FAA) are evolving and not yet accommodating of autonomous systems.
Sources
LLMs as AV Planners/Reasoners
- GPT-Driver: Learning to Drive with GPT
- LanguageMPC: Large Language Models as Decision Makers for Autonomous Driving
- LLM-Assist: Enhancing Closed-Loop Planning with Language-Based Reasoning
- DriveGPT4: Interpretable End-to-end Autonomous Driving via Large Language Model
- DiLu: A Knowledge-Driven Approach to Autonomous Driving with Large Language Models
- Agent-Driver: A Language Agent for Autonomous Driving
- DriveMLM: Aligning Multi-Modal Large Language Models with Behavioral Planning States
- LimSim++: A Closed-Loop Platform for Deploying Multimodal LLMs
- SurrealDriver: Designing LLM-powered Generative Driver Agent Framework
- DriveGPT: Scaling Autoregressive Behavior Models for Driving (Waymo)
- Drive Like a Human: Rethinking Autonomous Driving with Large Language Models
- Senna: Bridging Large Vision-Language Models and End-to-End Autonomous Driving
- AutoVLA: Vision-Language-Action Model for End-to-End Autonomous Driving
- KnowVal: A Knowledge-Augmented and Value-Guided Autonomous Driving System
- DriveGPT4-V2: Enhanced Closed-Loop Autonomous Driving (CVPR 2025)
Chain-of-Thought and Reasoning
- Driving with Regulation: Retrieval-Augmented Reasoning via LLM
- Driving Everywhere with Large Language Model Policy Adaptation (CVPR 2024)
- Visualization-of-Thought Elicits Spatial Reasoning (NeurIPS 2024)
- Talk2BEV: Language-Enhanced Bird's Eye View Maps
- BEVLM: Distilling Semantic Knowledge from LLMs into BEV
Multimodal LLMs for Driving
- On the Road with GPT-4V(ision) (ICLR 2024)
- GPT-4V Explorations: Mining Autonomous Driving
- DriveLM: Driving with Graph Visual Question Answering (ECCV 2024)
- NuScenes-QA: Multi-modal VQA Benchmark (AAAI 2024)
- Wayve LINGO-1
- Wayve LINGO-2: Driving with Natural Language
- DriveLLM-V: Explainable End-to-End Autonomous Driving
- Waymo Demonstrably Safe AI for Autonomous Driving
Traffic Simulation
- ChatSim: Editable Scene Simulation via LLM-Agent Collaboration (CVPR 2024)
- Language Conditioned Traffic Generation (CoRL 2023)
- GAIA-1: A Generative World Model for Autonomous Driving
- AutoScenario: Realistic Corner Case Generation with Multimodal LLM
- AnchorDrive: LLM Scenario Rollout with Anchor-Guided Diffusion
- Promptable Closed-loop Traffic Simulation
Airside Operations and ATC
- FAA: Autonomous Ground Vehicle Systems on Airports
- FAA CertAlert 24-02: AGVS Technology on Airports
- NASA: NLU Approach for Digitizing Aircraft Ground Taxi Instructions (AIAA 2024)
- Automatic Control With Human-Like Reasoning: LLM Embodied Air Traffic Agents
- Moonware: AI-Powered Ground Traffic Control
- EVIE Autonomous Air Side Systems