Skip to content

Airside Closed-Loop Planning Benchmark and Metrics

Purpose: Define a planning benchmark for autonomous vehicles that operate in airport airside environments, with metrics that capture safety, rule compliance, progress, comfort, throughput, and deployability. The focus is the planning stack from behavior decision through trajectory generation, not only perception accuracy or open-loop imitation.

Key Takeaway: Airside planning cannot be validated with L2 trajectory error alone. A useful benchmark must replay realistic stand, apron, service-road, and movement-area scenarios in closed loop, score binary safety gates before weighted performance metrics, and expose failure modes that matter to airport operators: aircraft contact, personnel proximity, hold-short violations, jet-blast/no-go-zone entry, blocked turnaround flow, and uncomfortable or untrackable trajectories.

Research current as of: 2026-05-09


Problem Framing

Road-driving benchmarks such as nuPlan, NAVSIM, and Bench2Drive moved planning evaluation away from static open-loop imitation toward simulation-based scoring. That shift is directly relevant to airside autonomy: a planner can match human trajectories in logs while still deadlocking at a baggage-road pinch point, crossing a hold-short line without clearance, or producing a trajectory that the controller cannot track around an aircraft stand.

The airside gap is domain-specific, not just data-volume-specific. Airport vehicles operate at lower speed than road AVs, but the environment contains large aircraft footprints, tight stand clearances, task sequencing constraints, ground crew, GSE with unusual kinematics, ramp-control instructions, FOD, jet blast, de-icing zones, and movement-area authority rules. A closed-loop benchmark therefore needs to model operational state and rule authority alongside geometry.

The target benchmark should answer four questions:

  1. Can the planner complete the mission without violating hard safety or authority constraints?
  2. Does it make progress without causing avoidable stand or service-road blockage?
  3. Are trajectories smooth and trackable by the vehicle controller under delay and saturation?
  4. Does performance degrade predictably when perception, map, V2X, weather, or actor behavior is imperfect?

Method and Architecture Taxonomy

Evaluation Modes

ModeWhat It MeasuresStrengthLimit
Open-loop log replayDistance from expert trajectory, rule labels, comfort from planned pathFast, cheap, useful for imitation training regression testsDoes not reveal compounding errors or agent reactions
Pseudo closed loopEgo plan is unrolled against logged or non-reactive actors, NAVSIM-styleScalable over real logs, lower sim gap than synthetic-onlyOther actors do not react to ego mistakes
Reactive closed-loop simulationEgo and actors interact in a simulator or digital twinReveals deadlock, yielding, blocking, and recovery behaviorRequires credible actor models and calibrated maps
Scenario / fault injectionHand-authored or generated rare cases such as FOD, emergency vehicle priority, V2X dropoutCovers safety-critical long tailEasy to overfit if scenario set is small
HIL / test-track replayController, DBW, timing, and compute stack run against simulated or controlled physical scenariosCatches actuator and latency problemsLower throughput and higher operating cost

Airside Scenario Suite

The benchmark should use short, scenario-focused routes rather than only long aggregate missions. Bench2Drive's short-route design is a better pattern than a single long "airport lap" because it separates abilities and reduces score variance.

Scenario FamilyExamplesPrimary Stress
Stand approach and dockingBelt loader or tug approaches aircraft service zonePrecision, clearance envelopes, low-speed control
Stand exit and clear-outGSE clears before TOBT / pushbackTask progress, schedule constraints, right-of-way
Pushback interactionAutonomous vehicle yields to tug and aircraft tail sweepLarge dynamic obstacle geometry, priority rules
Service-road routingNarrow bidirectional service roads and one-vehicle pinch pointsDeadlock prevention, reservation compliance
Hold-short and movement-area accessVehicle approaches taxiway/runway hold lineDefault-deny authority, clearance expiry
Occluded pedestrian/GSEWorker or tug emerges from behind aircraft or loaderPrediction, cautious progress, V2X/infrastructure value
FOD / spill / jet-blast zonePlanner must avoid or stop before invisible or semantic hazardsHazard-map and V2X integration
Emergency priorityRescue vehicle or ramp-control override interrupts missionArbitration, replanning, fail-safe behavior
Adverse conditionsRain, night, glare, low visibility, wet apron, GNSS multipathODD boundaries and degradation policy

Metric Stack

Use hard safety gates first, then weighted performance. A high progress score should never compensate for aircraft contact or unauthorized movement-area entry.

Metric LayerExample MetricsNotes
Safety gatesNo aircraft contact, no personnel collision, no GSE collision, no runway/taxiway incursion, no jet-blast/no-go-zone entryBinary or severity-gated multipliers
Authority and rule complianceHold-short compliance, clearance TTL, speed zones, stand-access permissions, A-CDM/ramp-control stateMust be evaluated against time-stamped operational truth
Progress and mission successRoute completion, task completion, missed deadline, turnaround blocking timeScore per scenario family and aggregate
Interaction qualityUnnecessary stops, deadlock/livelock, courtesy/yield correctness, predicted TTC marginsCritical for dense ramp operations
Comfort and trackabilityLongitudinal/lateral acceleration, jerk, yaw rate, curvature continuity, controller tracking errorPlanned path and executed path should both be scored
RobustnessScore under sensor delay, map drift, V2X dropout, actor model mismatch, weather, localization covarianceReport as degradation curves, not only one number
OperationsIntervention rate, remote-assistance calls, safe-stop rate, recovery time, blocked-zone occupancyBridges benchmark results to deployment readiness

Reference Architecture

text
Scenario definition
  -> map and operational truth
  -> actors, aircraft state, GSE tasks, V2X messages, weather, faults
  -> planner API
  -> closed-loop simulator / pseudo simulator / HIL runner
  -> executed trajectory and event log
  -> safety gates
  -> weighted planning metrics
  -> root-cause tags and replay bundle

The planner API should accept perception/tracking outputs, localization state, map context, mission route, authority state, and V2X/infrastructure messages. It should output a trajectory plus semantic intent: maneuver class, reason for stops, drivable area, expected yielding agents, and fallback state. Those annotations make failures diagnosable and align with Autoware's planning-factor concept.


Evaluation and Deployment Notes

The first benchmark release should avoid claiming "airside autonomy solved." It should be a deployment gate and a regression harness. A practical thresholding approach is:

  1. Developer gate: thousands of pseudo-closed-loop log snippets, quick enough for nightly CI.
  2. Release gate: curated reactive simulation suite with fixed seeds plus randomized variants.
  3. Site gate: airport-specific map, routes, stand layouts, procedures, and weather/lighting profiles.
  4. Operational gate: supervised dry runs, shadow-mode comparison to human or production stack, then constrained autonomous missions.

Scenario files need stable identifiers and versioned ground truth. Each failure should produce a replay bundle containing inputs, planned trajectory, executed trajectory, controller commands, map version, V2X messages, random seeds, and metric breakdown. Without this, benchmark scores become unactionable.

Recommended reporting:

  • Report score by scenario family, not just one aggregate number.
  • Publish safety-gate pass rates separately from performance scores.
  • Include confidence intervals across seeds and randomized actor policies.
  • Track "planner caused stop" vs. "safety monitor caused stop" vs. "controller could not track".
  • Keep a hidden site-specific set to prevent overfitting.
  • Add new scenarios from every real incident, near miss, remote-assistance call, and blocked-mission replay.

The metric design should reuse proven road-driving ideas where possible: NAVSIM's PDMS/EPDMS style of hard multipliers plus progress/TTC/comfort terms, Bench2Drive's short-route success and driving-score protocol, and nuPlan's reactive closed-loop simulation philosophy. The airside additions are aircraft geometry, airport authority state, apron task state, and operations-level blockage metrics.


Indoor / Outdoor / Airside Fit

DomainFitAdaptation
Indoor warehouse / factoryHigh for low-speed AMRs and forkliftsReplace aircraft and ramp rules with aisle, dock-door, pedestrian-zone, WMS, and load-stability constraints
Outdoor yards / depotsHighReuse service-road, blocked-route, trailer/container, GNSS, weather, and teleoperation metrics
Public-road AVMediumRoad benchmarks already exist; this framework is useful mainly for industrial/private-road ODDs
Airside apron and standVery highPrimary target: aircraft separation, pushback priority, stand sequencing, A-CDM/A-SMGCS/ramp-control context
Movement area / runway-adjacentHigh but stricterRequires default-deny authority, tower/ramp clearance truth, runway-incursion severity scoring, and regulator-reviewed test cases

Airside is unusually benchmark-friendly because the airport is a bounded, mapped, single-operator environment. It is also benchmark-risky because rare failures have severe consequences and local procedures differ across airports. The benchmark therefore needs site adapters rather than a single universal score.


Failure Modes

Failure ModeWhy It MattersMitigation in Benchmark
Open-loop overconfidenceLow L2 error hides closed-loop collapseRequire closed-loop and pseudo-closed-loop scores
Non-reactive actor optimismLogged actors do not respond to ego blocking or creepingInclude reactive actor models and adversarial yield/non-yield variants
Aircraft geometry simplificationBounding boxes miss wingtip, tail sweep, engine intake, and jet blast zonesUse aircraft-specific swept volumes and semantic hazard polygons
Rule-truth mismatchPlanner appears wrong because benchmark does not encode the actual clearance stateVersion A-CDM, ramp-control, NOTAM, and hold-short authority data
Controller-blind scoringPlanner emits a path that is mathematically safe but physically untrackableScore executed trajectory and controller tracking error
Map and zone driftConstruction, stand reconfiguration, or temporary closures invalidate truthInclude map versioning and dynamic restrictions
V2X dependence without fallbackPlanner succeeds only when cooperative messages are perfectInject packet loss, latency, stale messages, bad actors, and total dropout
Scenario overfittingTeams tune to a fixed scenario listUse hidden scenarios, random seeds, procedural variants, and incident-derived additions
Missing operations metricsPlanner is safe but blocks turnaround flowScore blocked-zone time, mission lateness, deadlock, and remote-assistance rate


Sources

Public research notes collected from public sources.