Motion Prediction, Trajectory Forecasting, and World Models for Autonomous Driving

Comprehensive Technical Report

1. State-of-the-Art Motion Prediction Models

1.1 MotionLM (Waymo, ICCV 2023)

MotionLM reformulates multi-agent motion prediction as a language modeling task by representing continuous trajectories as sequences of discrete motion tokens. This reconceptualization enables the direct application of autoregressive sequence modeling techniques to trajectory forecasting.

Key Technical Innovations:

Discrete Tokenization of Trajectories: Continuous motion data is converted into a vocabulary of discrete motion tokens, creating a "language" of motion
Single Unified Objective: Uses a standard language modeling loss (maximizing average log probability over sequence tokens), eliminating the need for anchor-based methods or explicit latent variable optimization
Autoregressive Joint Decoding: Produces joint distributions over interactive agent futures in a single autoregressive decoding process, removing the need for post-hoc interaction heuristics
Temporal Causality: Sequential factorization naturally enables temporally causal conditional rollouts, meaning predictions respect the arrow of time

Results: Ranked 1st on the Waymo Open Motion Dataset interactive challenge leaderboard. The language modeling framing is significant because it connects motion prediction to the massive body of work on large language models, suggesting that scaling laws and architectural innovations from NLP may transfer to motion prediction.

1.2 MTR / MTR++ (Motion Transformer, NeurIPS 2022 Oral / TPAMI 2024)

MTR (Motion Transformer) reformulates motion prediction as a two-stage optimization: global intention localization followed by local movement refinement.

Key Technical Innovations:

Learnable Motion Query Pairs: Instead of dense goal candidate grids, MTR uses a small set of learnable motion query pairs where each pair is responsible for trajectory prediction and refinement for a specific motion mode. This stabilizes training and improves multimodal coverage
Global Intention Localization: First identifies the high-level "intention" or destination region of each agent
Local Movement Refinement: Then refines the detailed trajectory within that intention region
Spatial Intention Priors: The query mechanism provides spatial priors that guide the model toward diverse but realistic motion modes

MTR++ extends MTR to simultaneous multi-agent prediction with two key additions:

Symmetric Context Modeling: Processes scene information in a balanced way across all agents, ensuring fair representation regardless of agent position
Mutually-Guided Intention Querying: Agents' predicted future behaviors inform each other, creating scene-compliant trajectories that respect interactive dynamics

Results: MTR ranked 1st on both marginal and joint Waymo Open Motion Dataset leaderboards (NeurIPS 2022). MTR++ was the winning approach for the Waymo Motion Prediction Challenge in 2022 and 2023.

1.3 QCNet / QCNeXt (CVPR 2023 Workshop Winner)

QCNet adopts a query-centric encoding paradigm for joint multi-agent trajectory forecasting, endowing the scene encoder with three critical mathematical properties:

Permutation equivariance across agent sets
Roto-translation invariance in spatial dimensions
Translation invariance in temporal dimensions

These properties enable accurate multi-agent forecasting and support streaming processing -- the ability to incrementally update predictions as new observations arrive without re-encoding the entire scene.

Decoder Design: Employs a multi-agent DETR-like decoder that models agents' interactions at future time steps, facilitating joint prediction. The authors demonstrated that joint prediction models outperform marginal prediction models even when evaluated on marginal metrics -- an important finding that suggests interaction modeling is universally beneficial.

Results: 1st place on the Argoverse 2 multi-agent motion forecasting benchmark (CVPR 2023 Workshop on Autonomous Driving).

1.4 Wayformer (Waymo)

Wayformer explores a surprisingly simple yet effective approach: using homogeneous attention mechanisms across all input modalities rather than specialized modules for different input types.

Architecture:

Unified attention-based scene encoder and decoder
Processes road geometry, lane connectivity, traffic light states, and agent motion histories through the same attention mechanism

Three Fusion Strategies Investigated:

Early Fusion: Combines all inputs at the initial stage -- simple and modality-agnostic
Late Fusion: Processes modalities separately before integration
Hierarchical Fusion: Structures fusion across multiple levels

Efficiency Strategies:

Factorized attention (decomposing attention operations)
Latent query attention (using learned queries for reduced computation)

Key Finding: Early fusion proved surprisingly effective despite its simplicity, achieving state-of-the-art results on both WOMD and Argoverse leaderboards. This challenges the assumption that modality-specific processing is necessary.

1.5 HiVT (Hierarchical Vector Transformer, CVPR 2022)

HiVT decomposes motion prediction into two hierarchical layers:

Local Context Extraction: Captures fine-grained interactions in an agent's immediate neighborhood
Global Interaction Modeling: Captures long-range dependencies across the entire scene

Invariance Properties:

Translation-Invariant Scene Representation: Maintains consistent predictions regardless of absolute agent position
Rotation-Invariant Spatial Learning Modules: Extracts features robust to geometric transformations

These properties enable HiVT to make fast, accurate multi-agent predictions in a single forward pass with a remarkably small model size.

Results: State-of-the-art on Argoverse motion forecasting benchmark with high efficiency, making it a popular choice as a baseline and backbone in subsequent work.

1.6 LaneGCN (ECCV 2020 Oral)

LaneGCN pioneered the use of graph neural networks on lane-level map representations for motion prediction, replacing rasterized map inputs.

Key Design Choices:

Lane Graph Construction: Builds a lane graph directly from vectorized map data, explicitly preserving road topology
Multiple Adjacency Matrices: Captures diverse relationships within the lane network (predecessor, successor, left neighbor, right neighbor)
Along-Lane Dilation: Addresses long-range dependencies across connected road segments

Four-Way Interaction Fusion:

Actor-to-Lane: How agents relate to nearby road segments
Lane-to-Lane: How road segments relate to each other
Lane-to-Actor: How road context influences agent behavior
Actor-to-Actor: How agents influence each other

Legacy: LaneGCN established the paradigm of vectorized, graph-based map encoding that nearly all subsequent methods build upon. Its four-way interaction design remains influential.

1.7 SceneTransformer (ICLR 2022)

SceneTransformer introduced joint prediction of all agents simultaneously using a unified attention-based architecture.

Key Innovations:

Masked Sequence Modeling: Inspired by language modeling, uses masking as a query mechanism. A single model can predict agent behavior in many ways -- marginally, jointly, or conditioned on specific agents' goals or full trajectories
Three-Dimensional Attention: Combines attention across road elements, agent-to-agent interactions, and temporal sequences
Scene-Centric Prediction: Rather than per-agent prediction, reasons about the entire scene holistically
Agent Permutation Equivariance: The model is invariant to the ordering of agents

The masked approach elegantly unifies marginal prediction, joint prediction, and conditional prediction within a single framework. Conditioning on one agent's future trajectory to predict another's is achieved simply by unmasking the conditioning agent.

1.8 MultiPath++ (Waymo)

MultiPath++ advances the original MultiPath architecture with several key innovations:

Sparse Scene Encoding: Replaces dense image-based encoding with compact polylines for road features and raw agent state vectors (position, velocity, acceleration)
Multi-Context Gating Fusion: A context-aware fusion component that effectively combines heterogeneous scene elements
Learned Latent Anchor Embeddings: End-to-end learned anchors replace static pre-defined trajectory clusters, enabling more flexible trajectory representation
Ensembling and Aggregation: Explores ensemble techniques for probabilistic multimodal output

Results: State-of-the-art on both Argoverse and Waymo Open Dataset.

1.9 GoRela (Goal-Relative Prediction)

GoRela addresses a fundamental efficiency vs. accuracy tradeoff in multi-agent prediction:

Encoding each agent in its own reference frame is computationally expensive
Using a shared coordinate system is sample-inefficient

Solution: Leverages pair-wise relative positional encodings to represent geometric relationships between agents and map elements within a heterogeneous spatial graph. This achieves viewpoint invariance while allowing pre-computed map embeddings to be reused offline.

Goal Prediction: Predicts agent goals on the lane graph to enable diverse and context-aware multimodal prediction. Separates the problem into offline map processing and online agent reasoning.

1.10 MotionDiffuser (CVPR 2023 Highlight)

MotionDiffuser applies diffusion models to multi-agent motion prediction, representing a fundamentally different approach from discriminative methods.

Key Innovations:

Diffusion-Based Generation: Learns multimodal distributions of future trajectories through iterative denoising
Single L2 Loss Objective: Does not require trajectory anchors
PCA-Based Trajectory Compression: Enhances performance and enables efficient exact log probability calculations
Constrained Sampling Framework: Enables controllable trajectory generation using differentiable cost functions -- enforcing physical constraints, rule compliance, or creating tailored simulation scenarios
Permutation-Invariant Joint Prediction: Naturally models joint distributions across multiple agents

Results: State-of-the-art on Waymo Open Motion Dataset for multi-agent prediction. The constrained sampling capability is particularly valuable for generating safety-critical test scenarios.

1.11 Notable 2024-2025 Models

GenAD (Generalized Predictive Model, CVPR 2024 Highlight):

First large-scale video prediction model for autonomous driving
Trained on 2000+ hours of diverse web-collected driving videos with text descriptions
Built on latent diffusion with novel temporal reasoning blocks
Zero-shot generalization to unseen driving datasets
Adaptable as action-conditioned predictor or motion planner

ViDAR (Visual Point Cloud Forecasting, 2024):

Pre-training approach that predicts future LiDAR point clouds from historical visual input
Latent Rendering operator transforms visual embeddings into 3D geometric space
Improvements: 3.1% NDS in 3D detection, ~10% error reduction in motion forecasting, ~15% collision rate reduction in planning

Vista (NeurIPS 2024):

Generalizable driving world model with high-fidelity future prediction
Novel losses for learning moving instances and structural information
Latent replacement approach for coherent long-horizon rollouts
Supports high-level (goal points, intentions) and low-level (trajectory, angle, speed) control simultaneously
55% FID and 27% FVD improvements over prior driving world models

2. World Models as Prediction Engines

2.1 How World Models Naturally Solve Prediction

A world model learns a compressed representation of environment dynamics and uses it to simulate possible futures. Unlike discriminative predictors that directly map observations to predicted trajectories, world models learn the underlying generative process of how scenes evolve.

The Prediction-Generation Connection:

Discriminative models answer: "Given the past, what trajectories are likely?"
World models answer: "Given the past and possible actions, what will the world look like?"

This shift is profound because:

Prediction becomes a byproduct of understanding. A model that truly understands how the world works can naturally predict what will happen next
Action-conditioned prediction is native. World models naturally answer "what if?" questions -- what happens if I accelerate, brake, or change lanes?
Multimodal futures emerge naturally from the stochastic generation process

2.2 Key World Models for Autonomous Driving

GAIA-1 (Wayve, 2023):

Generative world model leveraging video, text, and action inputs
Casts world modeling as unsupervised sequence modeling by mapping inputs to discrete tokens and predicting next tokens
Demonstrates emergent understanding of geometry, scene dynamics, and contextual awareness
The learned representation captures expectations of future events while generating realistic samples

Copilot4D (ICLR 2024):

By Waabi (Urtasun et al.) -- first tokenizes sensor observations with VQVAE, then predicts future via discrete diffusion
Enhanced Masked Generative Image Transformer adapted as discrete diffusion
65% reduction in Chamfer distance for 1-second point cloud prediction, >50% for 3-second prediction
Addresses two key bottlenecks: managing complex unstructured observation spaces and scalable generative modeling

OccWorld (2023):

World model built on 3D occupancy representation rather than bounding boxes
Two-stage: reconstruction-based tokenizer converts 3D occupancy to discrete tokens, then GPT-like transformer generates future tokens
Simultaneously predicts ego car movement and surrounding scene evolution
Captures fine-grained 3D structure more expressively than object boxes
Works with both vision and LiDAR sensors, no instance-level or map supervision required

DriveDreamer (2023):

First world model built entirely from real-world driving data (nuScenes)
Uses diffusion models to handle complex driving scene modeling
Two-stage training: first learns traffic structure constraints, then develops future state anticipation
Enables generation of realistic driving policies beyond just video synthesis

Drive-WM (2023):

First driving world model compatible with existing end-to-end planning models
Generates multiview video sequences through joint spatial-temporal modeling with view factorization
Supports "driving into multiple futures based on distinct driving maneuvers"
Demonstrates safe planning via image-based reward computation

UniSim (CVPR 2023 Highlight):

Neural closed-loop sensor simulator converting single recorded logs into realistic multi-sensor simulations
Neural feature grids reconstruct static backgrounds and dynamic actors
Composition mechanisms synthesize LiDAR and camera data at novel viewpoints
Enables closed-loop evaluation on safety-critical scenarios "as if it were in the real world"

2.3 Joint Prediction of All Scene Elements vs. Per-Agent Prediction

Traditional motion prediction operates per-agent: for each agent of interest, encode the scene from that agent's perspective and decode its future trajectory. This has several limitations:

Aspect	Per-Agent Prediction	Joint Scene Prediction
Consistency	No guarantee that predicted trajectories are mutually consistent	All predictions are generated together, ensuring consistency
Interactions	Interactions must be modeled explicitly or post-hoc	Interactions emerge naturally from joint generation
Scalability	Linear in number of agents (but repeated computation)	Amortized: all agents predicted simultaneously
Collision avoidance	Requires post-processing to avoid predicted collisions	Joint generation naturally avoids implausible configurations
Compute	Redundant scene encoding per agent	Single encoding shared across all agents

World models inherently perform joint prediction -- they simulate the entire scene forward in time, producing consistent futures for all elements simultaneously.

2.4 Occupancy Flow Prediction vs. Trajectory Prediction

The Waymo Occupancy and Flow Prediction Challenge introduced an alternative representation for prediction:

Trajectory Prediction:

Predicts a set of discrete future waypoints for each detected agent
Outputs: (x, y) coordinates at future timestamps
Requires object detection and tracking as prerequisites
Cannot represent undetected or occluded agents
Crisp but limited to tracked objects

Occupancy Flow Prediction:

Predicts dense 256x256 bird's-eye-view grids showing where space will be occupied
Occupancy grids: probability of each cell being occupied at each future timestep
Flow fields: 2D displacement vectors showing how occupied regions move between timesteps
Three sub-tasks: currently observed vehicles, currently occluded vehicles, and future flow

Advantages of Occupancy Flow:

No tracking required: Can predict space occupancy without maintaining identity-consistent tracks
Handles occlusions natively: Can predict where currently unseen agents will appear
Represents uncertainty spatially: Soft occupancy naturally encodes spatial uncertainty
Aligns with planning: Ego vehicle planning often uses occupancy grids; prediction in the same representation eliminates format conversion
Captures non-agent dynamics: Can represent space occupied by debris, construction zones, or other non-standard obstacles

World models that predict future 3D occupancy (like OccWorld) naturally produce this representation, bridging the gap between prediction and world modeling.

2.5 Scene-Level vs. Agent-Level Prediction

Agent-Level Prediction (Traditional):

Input: one target agent + scene context
Output: K possible trajectories for that agent
Must be run N times for N agents
Each prediction is independent unless explicitly coupled

Scene-Level Prediction (World Models):

Input: entire current scene state
Output: possible future scenes (all agents, all objects, road surface, etc.)
One forward pass predicts everything
Consistency is inherent

Scene-level prediction is more aligned with how planning systems consume predictions: a planner needs to understand the complete future scene, not individual agent trajectories in isolation.

2.6 Advantages of Generative Prediction

Multimodal Futures: Generative models (diffusion, autoregressive, VAE) naturally produce diverse samples from learned distributions. Each generated future is a complete, coherent scenario rather than an isolated trajectory.

Compositionality: Generative world models can combine learned dynamics with novel constraints at inference time (e.g., MotionDiffuser's constrained sampling).

Scalable Data Utilization: World models can learn from unlabeled video (no annotation of trajectories needed), enabling training on orders of magnitude more data.

Simulation and Planning: Generated futures can be used directly for planning -- evaluating candidate actions by "imagining" their consequences through the world model.

3. Interaction-Aware Prediction

3.1 Modeling Interactions Between Agents

Interaction modeling is perhaps the most critical and challenging aspect of motion prediction. Agents do not move independently -- their behaviors are coupled through physical constraints, social conventions, and strategic reasoning.

Approaches to Interaction Modeling:

Graph Neural Networks (GNN-Based):

Represent agents as nodes and interactions as edges
LaneGCN's actor-to-actor attention
EvolveGraph: dynamic relational reasoning where interaction graphs evolve over time
Strengths: explicit relational structure, interpretable interactions
Weaknesses: requires defining graph connectivity heuristics

Attention-Based Interaction:

Self-attention across agents captures pairwise relationships
HiVT's local-global hierarchy: local attention for nearby agents, global attention for long-range interactions
Wayformer: homogeneous attention across all scene elements including agents
SceneTransformer: three-dimensional attention (spatial, agent, temporal)
Strengths: learnable, flexible, handles varying numbers of agents
Weaknesses: quadratic complexity in number of agents

Social Force Models (Classical):

Model agents as particles subject to attractive/repulsive forces
Forces include: goal attraction, collision avoidance repulsion, lane-keeping forces
Social Force Model (Helbing & Molnar, 1995) remains foundational for pedestrian modeling
Modernized versions learn force parameters from data
Strengths: physically interpretable, computationally efficient
Weaknesses: limited expressiveness, struggles with complex strategic behavior

3.2 Game-Theoretic Prediction

Game-theoretic approaches model agents as rational (or boundedly rational) players who strategically optimize their behavior while accounting for others' decisions.

GameFormer (ICCV 2023):

Models prediction and planning as a multi-agent game
Hierarchical Transformer Decoder: Implements level-k reasoning through iterative refinement
- Level 0: Agents predict independently (non-strategic baseline)
- Level 1: Each agent responds to others' Level-0 predictions
- Level k: Each agent responds to others' Level-(k-1) predictions
At each level, the decoder uses prediction outcomes from the previous level plus shared environmental context to refine the interaction process
An agent's behavior at the current level is regulated to respond to other agents' behaviors from the preceding level
State-of-the-art on Waymo interaction prediction task and nuPlan planning benchmark

Game-Theoretic Advantages:

Captures strategic behavior: Agents in traffic are strategic -- they signal, yield, or assert right-of-way based on expectations of others' responses
Handles negotiation: Merging, lane changing, and intersection crossing involve implicit negotiation
Avoids "frozen robot" problem: By modeling others as rational responders, the ego vehicle avoids overly conservative predictions

3.3 Conditional Prediction

Conditional prediction answers: "If Agent A does X, how will Agent B respond?"

M2I (CVPR 2022):

Decomposes interactive prediction into influencer-reactor pairs
Influencers' trajectories are predicted independently (marginal prediction)
Reactors' trajectories are predicted conditioned on influencer predictions (conditional prediction)
Combined via joint likelihood scoring
Insight: many interactions have asymmetric causal structure (e.g., a vehicle entering a roundabout reacts to existing traffic)

SceneTransformer's Masking Approach:

By selectively masking/unmasking agent futures, can condition on any subset of agents
Provides a unified framework for marginal, joint, and conditional prediction
Particularly useful for planning: "What will others do if I follow this plan?"

3.4 Joint Prediction of Interacting Agents

FJMP (CVPR 2023):

Represents agent interactions as a sparse directed interaction graph
Prunes the interaction graph into a Directed Acyclic Graph (DAG), establishing partial ordering among agents based on causal relationships
Decomposes joint prediction into a sequence of marginal and conditional predictions following the DAG's topological order
Uses a Directed Acyclic Graph Neural Network (DAGNN) for decoding
Ranked 1st on INTERACTION dataset multi-agent leaderboard
Produces more scene-consistent predictions than non-factorized approaches

MotionLM's Autoregressive Joint Decoding:

Generates joint agent futures through a single autoregressive process
Each token generation step is conditioned on all previously generated tokens (including other agents)
Natural temporal causality ensures realistic interaction dynamics

MTR++'s Mutually-Guided Intention Querying:

Agents' intention queries inform each other through cross-attention
Creates bidirectional information flow between interacting agents
Avoids the assumption of fixed influencer-reactor roles

4. Long-Horizon Prediction

4.1 The Horizon Challenge

Motion prediction quality degrades dramatically with prediction horizon:

1-3 seconds: Kinematics-dominated; trajectory is largely determined by current velocity, acceleration, and road geometry
3-8 seconds: Intent-dominated; where the agent is going (lane change, turn, stop) matters more than kinematics
8-30 seconds: Goal-dominated; the agent's high-level destination and route drive the trajectory
>30 seconds: Route/schedule-dominated; requires knowledge of the agent's plan, not just current observations

Most state-of-the-art models predict 6-8 seconds ahead (the Waymo challenge uses 8-second horizons). Beyond this, uncertainty explodes and traditional discriminative models fail.

4.2 Goal-Conditioned Prediction

Goal-conditioned prediction explicitly separates where an agent is going from how it gets there:

Goal Prediction: Estimate the distribution over possible destinations (typically on the lane graph)
Path Prediction: For each candidate goal, predict the trajectory to reach it
Scoring: Rank goal-trajectory pairs by likelihood

MTR's Intention Localization is a form of goal-conditioned prediction: learnable queries capture distinct destination modes, then local refinement generates the trajectory.

GoRela predicts goals on the lane graph, naturally decomposing prediction into goal selection and trajectory completion.

TNT (Target-driven Trajectory Prediction): Samples target endpoints from the lane graph, generates trajectory conditioned on each target, and scores the candidates.

4.3 Intent Prediction

Beyond geometric goals, intent prediction estimates what the agent is trying to do semantically:

Is this vehicle going to change lanes?
Is this pedestrian going to cross the street?
Is this vehicle yielding or asserting right-of-way?

Intent prediction typically operates as a classification task upstream of trajectory prediction:

Classify agent intent (lane change left, lane change right, keep lane, turn, stop, etc.)
Condition trajectory prediction on the inferred intent
Weight predictions by intent probability

This two-stage approach naturally supports long-horizon prediction because intent captures the high-level plan that determines long-term trajectory.

4.4 Route/Destination Prediction

For even longer horizons:

Map matching: Identify which roads/lanes the agent is likely following
Turn-by-turn route prediction: Estimate probable routes through the road network
Destination prediction: Estimate where the agent is ultimately headed (using priors from time of day, agent type, historical patterns)

These are primarily used in fleet prediction, traffic simulation, and urban planning rather than real-time autonomous driving.

4.5 How World Models Enable Longer Horizons

World models have a structural advantage for long-horizon prediction:

Autoregressive Rollout:

Discriminative models predict the full trajectory in a single shot, requiring the model to learn the mapping from observations to distant future states directly
World models predict one step at a time and feed predictions back as input, allowing indefinite rollout
Each step only needs to model short-term dynamics, which is easier to learn
Vista's latent replacement approach explicitly addresses coherent long-horizon rollouts

Compounding Error vs. Compositional Understanding:

Single-shot discriminative models avoid compounding errors but must learn increasingly complex mappings for longer horizons
Autoregressive world models may compound errors but can compose simple short-horizon dynamics into complex long-horizon behaviors
Modern world models (GAIA-1, Copilot4D) mitigate compounding error through learned priors and diffusion-based generation

Implicit Intent Modeling:

World models that process video or occupancy sequences implicitly capture agent intent through the patterns they learn
GAIA-1 demonstrates "contextual awareness" -- understanding that a car at a red light will wait and then proceed

Conditioning on Plans:

World models naturally support conditioning on ego plans: "what happens if I follow this trajectory?"
This enables model-predictive control with the world model as the forward dynamics model
Vista demonstrates this with multi-level control signals (goal points, trajectory, speed)

5. Airport-Specific Prediction Challenges

Airport environments present a unique and underexplored domain for motion prediction. While sharing fundamental challenges with on-road autonomous driving, airport operations introduce domain-specific complexities that require adapted approaches.

5.1 Aircraft Movement Prediction

Pushback Prediction:

Aircraft pushback from gates is a highly constrained, low-speed maneuver
Prediction inputs: flight schedule data, gate assignment, pushback clearance timing
Challenges: variable pushback durations, tug vehicle coordination, adjacent gate conflicts
Unlike road vehicles, aircraft during pushback are externally controlled (by tow tractors) with very limited maneuverability

Taxi Trajectory Prediction:

Aircraft taxi along designated taxiways following ATC clearances
The taxiway network forms a graph analogous to the road network in autonomous driving
Key differences from road prediction:
- Routes are largely pre-determined by ATC clearances
- Speed profiles are more constrained (typical taxi speeds: 10-20 knots)
- Aircraft have much larger turning radii than road vehicles
- Wingspan creates complex collision boundaries (not point-like agents)
Prediction is primarily about timing rather than path: the path is usually known, but when the aircraft will reach each point is uncertain

Relevant Data Sources:

ASDE-X (Airport Surface Detection Equipment, Model X): radar-based surface surveillance providing position updates every ~1 second
ADS-B surface messages: GPS-based position reports from equipped aircraft
SWIM (System Wide Information Management): FAA data sharing platform
Airport CDMS (Collaborative Decision Making System): flight schedule and operational data

5.2 Ground Vehicle Movement Prediction

Airport aprons host a diverse fleet of ground service equipment (GSE), each with distinct movement patterns:

Tow Tractors / Tugs:

Move between fixed points (gates, hangars, maintenance areas)
Follow designated vehicle lanes on the apron
Speed: 5-15 mph with frequent stops
Prediction challenge: which aircraft they are servicing next (scheduling dependency)

Belt Loaders:

Position at aircraft cargo doors during ground handling
Movement limited to immediate vicinity of aircraft
Prediction tied to baggage/cargo operations schedule

Fuel Trucks:

Route between fuel depot and aircraft gates
Service sequence depends on fueling schedule
Large vehicles with limited maneuverability

Catering Trucks, Lavatory Trucks, Ground Power Units:

Each has characteristic approach angles and positions relative to aircraft
Movement is highly scheduled and procedural

Unique Ground Vehicle Prediction Challenges:

Vehicles operate in a semi-structured environment (painted lanes but less enforcement than roads)
Movement patterns are tied to flight schedules and airline procedures
Right-of-way rules differ from road traffic (aircraft always have priority)
High density of different vehicle types in small areas during turnaround operations
No standardized "rules of the road" across airports

5.3 Pedestrian Prediction on Apron

Ground crew pedestrian behavior on airport aprons is fundamentally different from urban pedestrian prediction:

Goal-directed with known goals: Crew members move between specific equipment and aircraft doors/positions
Task-dependent trajectories: A marshaller walks to a specific position, wing walkers go to wingtips, ground crew approaches cargo doors
PPE and safety constraints: Must stay within designated walking paths, avoid jet blast zones, propeller arcs, and active vehicle lanes
Team coordination: Ground handling involves coordinated teams whose movements are interdependent
Low-visibility challenges: Operations continue in rain, fog, darkness with limited visibility

Prediction Approaches:

Task-sequence models: predict the crew member's current task and associated movement pattern
Role-based prediction: different roles (marshaller, chock handler, headset operator) have distinct and predictable movement patterns
Team-coordination models: predict crew movements as a coordinated group, not independently

5.4 Highly Structured but Complex Movement Patterns

Airport operations exhibit a paradox: individual movements are highly procedural and predictable, but the overall system is complex due to:

Temporal coupling: Many operations must happen in sequence (arrive gate -> chocks -> bridge -> doors -> services)
Resource constraints: Limited gates, taxiways, runways create dependencies
Cascading delays: A delay in one operation propagates through the system
Multi-stakeholder coordination: Airlines, ground handlers, ATC, airport authority all influence operations

This structure suggests that graph-based world models (similar to LaneGCN's lane graph approach) could be particularly effective, with the taxiway/apron layout as the spatial graph and the operational sequence as the temporal graph.

5.5 Integration with Flight Schedule Data

Unlike road traffic, airport movements are largely scheduled:

Flight arrival/departure times are known hours in advance
Gate assignments are planned (though subject to change)
Ground handling procedures have standard durations
Crew and equipment are allocated to specific flights

This creates a unique opportunity for prediction:

Prior knowledge of upcoming events (e.g., a flight arriving in 10 minutes triggers predictable ground handling mobilization)
Schedule adherence as a prediction feature
Deviation from schedule as an anomaly signal
Integration of ACARS (Aircraft Communications Addressing and Reporting System) messages for real-time schedule updates

Recommended Approach: Hybrid Prediction

Schedule-based prior: Start with the planned sequence and timing of operations
Observation-based refinement: Use sensor data (cameras, radar, ADS-B) to detect actual positions and update predictions
World model for simulation: A learned world model can generate the likely sequence of events given current state and schedule, producing multi-step predictions of how the apron scene will evolve

5.6 Predicting Gate Operations Sequences

Gate turnaround is a complex, multi-step process with interdependencies:

Typical Turnaround Sequence:
1. Aircraft arrives and parks at gate
2. Chocks placed, engines shut down
3. Jet bridge connects
4. Passenger doors open
5. Ground power connected
6. Baggage unloading begins (belt loader approaches)
7. Fueling begins (fuel truck approaches)
8. Catering service (catering truck approaches)
9. Cabin cleaning
10. Water/lavatory service
11. New cargo/baggage loading
12. Fueling complete (fuel truck departs)
13. New passengers board
14. Jet bridge disconnects
15. Pushback clearance obtained
16. Tug connects and pushback begins

Prediction as Sequence Modeling:

Each step has a characteristic duration distribution
Some steps can occur in parallel (fueling, catering, baggage)
Some steps have strict ordering constraints (chocks before bridge, bridge before doors)
The entire sequence can be modeled as a temporal graph with nodes (tasks) and edges (dependencies)

World Model Approach: A world model trained on turnaround sequences could:

Predict the timing of each step given current progress
Identify delays early (if baggage unloading is slow, predict delayed departure)
Generate realistic turnaround scenarios for planning and optimization
Handle the compositional nature of the process (different aircraft types, different airlines have different procedures)

This is structurally similar to MotionLM's language modeling approach -- the turnaround sequence is a "language" of operations that can be modeled autoregressively.

5.7 Applying Autonomous Driving Prediction to Airports

Several autonomous driving prediction paradigms transfer well to airport environments:

AD Paradigm	Airport Application
Lane graph (LaneGCN)	Taxiway/apron layout graph
Goal-conditioned prediction (GoRela, MTR)	Gate/runway assignment as goals
Joint multi-agent prediction (SceneTransformer)	Coordinated ground handling team prediction
Game-theoretic (GameFormer)	Aircraft-vehicle-pedestrian right-of-way negotiation
Occupancy flow prediction	Apron area occupancy prediction for safety zones
World model rollout (GAIA-1, OccWorld)	Turnaround sequence simulation and what-if analysis
Conditional prediction (M2I, FJMP)	"If this aircraft pushes back, how do adjacent vehicles react?"
Diffusion-based prediction (MotionDiffuser)	Generating diverse gate operation scenarios for robustness

Key Adaptations Needed:

Flight schedule integration: No analog in road driving; requires fusing structured database information with sensor observations
Role-based agent modeling: Different vehicle/person types have fundamentally different behavior models
Sequence-level prediction: Predicting multi-step operational sequences (turnaround), not just trajectories
Safety zone modeling: Jet blast zones, propeller arcs, and wing sweep areas create dynamic no-go zones
Longer horizons: Airport prediction may need 30-60 minute horizons vs. 8 seconds for driving
Lower speeds, higher precision: Airport movements are slower but require higher positional precision due to tight clearances

6. Summary and Key Insights

Evolution of Motion Prediction

The field has evolved through several paradigms:

Physics-based (pre-2018): Constant velocity/acceleration models
CNN + Rasterization (2018-2020): Encode the scene as a bird's-eye-view image, use CNNs to predict
Graph + Vectorization (2020-2022): LaneGCN, VectorNet -- represent scenes as graphs, use GNNs
Transformer-based (2021-2023): HiVT, Wayformer, SceneTransformer -- attention mechanisms replace hand-designed interaction modules
Language/Diffusion-based (2023-present): MotionLM, MotionDiffuser -- connect prediction to generative modeling
World Model-based (2023-present): GAIA-1, Copilot4D, OccWorld, Vista -- learn the world dynamics and derive prediction as a consequence

Key Takeaways

Joint prediction outperforms marginal prediction -- even on marginal metrics (QCNet finding). Modeling interactions is always beneficial.
Language modeling for trajectories (MotionLM) is a surprisingly effective paradigm that may benefit from scaling laws and architectural innovations in NLP.
World models as prediction engines represent a paradigm shift: instead of training specialized prediction modules, learn the dynamics of the world and derive predictions as rollouts.
Occupancy flow prediction is a complementary representation to trajectory prediction that handles occlusions, untracked objects, and spatial uncertainty more naturally.
Game-theoretic approaches (GameFormer) are essential for modeling the strategic aspects of multi-agent interaction that purely data-driven approaches may miss.
Airport environments are a promising but underexplored application domain where the structured nature of operations, availability of schedule data, and safety criticality make world model-based prediction particularly valuable.
The convergence of prediction and planning is accelerating: models like GenAD, Vista, and Drive-WM blur the boundary between predicting what will happen and deciding what to do.

Report compiled March 2026. Based on published research through early 2025.

SLAM Methods

Methods

Motion Prediction, Trajectory Forecasting, and World Models for Autonomous Driving ​

Comprehensive Technical Report ​

1. State-of-the-Art Motion Prediction Models ​

1.1 MotionLM (Waymo, ICCV 2023) ​

1.2 MTR / MTR++ (Motion Transformer, NeurIPS 2022 Oral / TPAMI 2024) ​

1.3 QCNet / QCNeXt (CVPR 2023 Workshop Winner) ​

1.4 Wayformer (Waymo) ​

1.5 HiVT (Hierarchical Vector Transformer, CVPR 2022) ​

1.6 LaneGCN (ECCV 2020 Oral) ​

1.7 SceneTransformer (ICLR 2022) ​

1.8 MultiPath++ (Waymo) ​

1.9 GoRela (Goal-Relative Prediction) ​

1.10 MotionDiffuser (CVPR 2023 Highlight) ​

1.11 Notable 2024-2025 Models ​

2. World Models as Prediction Engines ​

2.1 How World Models Naturally Solve Prediction ​

2.2 Key World Models for Autonomous Driving ​

2.3 Joint Prediction of All Scene Elements vs. Per-Agent Prediction ​

2.4 Occupancy Flow Prediction vs. Trajectory Prediction ​

2.5 Scene-Level vs. Agent-Level Prediction ​

2.6 Advantages of Generative Prediction ​

3. Interaction-Aware Prediction ​

3.1 Modeling Interactions Between Agents ​

3.2 Game-Theoretic Prediction ​

3.3 Conditional Prediction ​

3.4 Joint Prediction of Interacting Agents ​

4. Long-Horizon Prediction ​

4.1 The Horizon Challenge ​

4.2 Goal-Conditioned Prediction ​