Skip to content

Airside Autonomy Benchmark and Dataset Specification

Key Takeaway: Airside autonomy needs its own benchmark. Road datasets and indoor AMR benchmarks do not cover aircraft priority, apron geometry, ground support equipment, stand operations, jet blast, FOD, marshalling, A-SMGCS context, or airport sponsor approval constraints. The practical benchmark should combine logged multi-sensor data, NAVSIM-style pseudo-simulation, closed-loop digital-twin routes, and controlled real-world evidence.


Scope and Goal

This specification defines a benchmark for autonomous ground vehicle systems operating on airport airside surfaces: aprons, stands, service roads, cargo areas, baggage routes, de-icing zones, and controlled crossings near taxiways. It is intended for generic autonomy stacks, not only one vehicle type.

Target vehicles:

  • Baggage tractors and autonomous dollies.
  • Cargo tugs and ULD movers.
  • Follow-me, inspection, FOD retrieval, and perimeter vehicles.
  • Autonomous service vehicles operating around aircraft stands.
  • Low-speed passenger or crew shuttles in restricted airside zones.

The benchmark should measure full autonomy behavior, not only perception. It must support modular stacks, end-to-end driving policies, VLA planners, cooperative V2X policies, and world-model-based planners through a common route/task interface.


Benchmark Taxonomy

Evaluation Tiers

TierNamePurposeExecution mode
T0Dataset quality and coverage auditCheck sensor sync, labels, scenario balance, map validityOffline
T1Component evaluationPerception, tracking, occupancy, prediction, VLM reasoningOffline
T2Logged trajectory evaluationNAVSIM-style path scoring on real missionsOffline/non-reactive
T3Pseudo-simulationPerturb ego endpoint and render/score nearby observationsOffline/pseudo-closed-loop
T4Closed-loop digital twinInteractive route and scenario simulationSimulator
T5Shadow-mode replayRun model on real vehicle logs without control authorityOn-vehicle/offline replay
T6Controlled-site operational testSafety-driver or monitor-backed airport testReal-world evidence

Task Families

Task familyExamplesRequired metrics
Route transitDepot to stand, stand to cargo, service-road loopProgress, speed limits, route adherence, smoothness
Stand approachApproach aircraft stand with active GSE and personnelAircraft clearance, stand zone compliance, personnel safety
Pushback interactionYield to pushback, cross after clearance, follow tow pathAircraft priority, swept-volume avoidance, clearance state
GSE sequencingMerge with baggage trains, pass belt loader, convoy with tugsDeadlock, right-of-way, gap acceptance, V2X intent use
FOD encounterDetect, avoid, report, or retrieve debrisFOD recall, false alarm, stop/reroute correctness
Jet blast and engine hazardsAvoid active blast zones and engine intake zonesHazard-zone compliance, fallback if no engine-state message
Communication-dependent operationA-SMGCS/ramp-control instruction, V2X task update, link lossDefault-deny behavior, stale-message rejection, fallback
Adverse conditionsRain, glare, night, fog, wet apron, de-icing residueRobustness deltas and ODD state transitions

Data Modalities

ModalityRequired?Notes
Surround camerasRequiredAt least 6 views for E2E/VLA and inspection tasks
3D LiDARRequiredMetric obstacle and aircraft geometry
4D radarStrongly recommendedWeather/night fallback and Doppler velocity
IMU/GNSS/RTKRequiredPose, speed, and localization provenance
Wheel odometry / steering / actuator stateRequiredControl and tracking diagnostics
V2X messagesRequired for cooperative tracksCAM-like awareness, task assignment, clearance, stand status, jet blast, FOD
Airport operational feedsRecommendedAODB/A-CDM/A-SMGCS, stand assignment, aircraft movement status
MapsRequiredLanelet2 or airport graph plus AMDB-style surfaces, stands, zones, speed limits

Dataset Schema

Scene Record

Each scenario clip should include:

  • scenario_id, airport_id, stand_or_zone_id, and route_id.
  • ODD metadata: weather, visibility, lighting, surface condition, construction/de-icing state.
  • Sensor packets with original timestamps and clock-domain provenance.
  • Ego state and control state at sensor and control rates.
  • Map version and geofence/zone version.
  • V2X and airport-system messages with publish, receive, and consume timestamps.
  • Ground-truth annotations or derived labels.
  • Safety monitor events and human intervention markers.

Required Labels

Label typeContents
3D object boxes/tracksAircraft, GSE, road vehicles, personnel, FOD, cones/barriers, jetbridge, static equipment
Semantic occupancyDrivable apron, stand safety envelope, no-go zone, aircraft swept area, jet blast zone, unknown/occluded
Map elementsService roads, hold lines, stand markings, taxiway boundaries, crossings, parking/staging zones
Agent intent/stateAircraft parked/taxi/pushback, GSE loading/unloading/reversing, personnel walking/marshalling
Event labelsClearance granted/revoked, pushback start, FOD detected, intervention, e-stop, network loss
VLM/VLA QA labelsSafety-relevant scene questions, rule questions, instruction-following labels

Splits

SplitPurposeAnti-leakage rule
train_normalModel training on routine operationsExclude hard safety events
train_augmentedSynthetic and replay perturbationsMark synthetic provenance explicitly
val_site_seenRoutine validation at known airport/siteNo overlap by route/time clip
val_site_unseenGeneralization to new stands or airport areasHold out stands/zones
test_publicPublic reproducibilityLimited edge cases, fixed evaluator
test_privateLeaderboard overfitting guardHidden routes and event mix
test_safety_hardRelease gatingWithheld high-severity events and rare weather

Metric Taxonomy

Composite Airside PDM Score

An airside PDM-style score should multiply hard safety gates by weighted quality scores:

AirsideScore = SafetyGatesProduct * WeightedQualityScore

Hard safety gates:

  • No aircraft contact.
  • No personnel collision.
  • No GSE/static-object collision.
  • No entry into active jet blast or engine intake zone.
  • No unauthorized hold-short/taxiway/runway-area crossing.
  • No stand safety-envelope violation.
  • No unsafe response to communication loss or stale clearance.

Weighted quality scores:

  • Task progress and route completion.
  • Aircraft/personnel/GSE clearance margins.
  • Comfort and payload stability.
  • Speed compliance by zone.
  • Rule compliance and right-of-way.
  • Operational efficiency, including avoidable delay and deadlock.
  • V2X usage correctness when available.

Component Metrics

ComponentMetrics
PerceptionmAP/NDS-style object detection, per-class recall, FOD recall, personnel recall under occlusion
OccupancySemantic IoU, free-space precision, unknown-space calibration, future occupancy IoU
TrackingMOTA/HOTA/ID switches, velocity error, track latency
PredictionminADE/minFDE, brier-minFDE, occupancy flow error, swept-volume miss rate
PlanningCollision rate, progress, comfort, rule compliance, intervention rate
V2XDeadline miss rate, stale-message rejection, pose/time alignment error, fallback correctness
VLM/VLAVisual grounding, hallucination rate, text-only leakage, corruption robustness, action validity

Relevance by Domain

Generic AV

The benchmark creates a reusable pattern for non-road autonomy: scenario-driven evaluation with operational rules, not only lane-following. It can inform robotaxi depots, industrial campuses, ports, and private roads.

Indoor Autonomy

Indoor AMR and forklift benchmarks can borrow the task-progress, near-field personnel, communication-loss, and facility-map governance patterns. Replace aircraft/stand concepts with dock doors, aisles, pallets, and WMS tasks.

Outdoor Industrial Autonomy

Logistics yards, ports, mining roads, construction sites, and campuses share low-to-medium speed operation, mixed manual/autonomous traffic, private maps, and central dispatch. Airside evaluation patterns transfer well to their route/task scoring.

Airside Autonomy

This is the primary domain. Airside operations need explicit modeling of aircraft, turnaround state, right-of-way, ramp-control instructions, FOD, jet blast, and no-go zones. These are not optional edge cases; they are core operating constraints.


Implementation Notes

Minimum Viable Benchmark

Start with:

  • 50 to 100 hours of synchronized camera, LiDAR, radar, GNSS/IMU, odometry, and V2X logs from one controlled airside area.
  • 20 scenario types, each with at least 10 real examples or high-fidelity digital-twin variants.
  • A baseline modular stack and one E2E/VLA baseline.
  • NAVSIM-style offline trajectory scoring.
  • A closed-loop digital-twin evaluator for 5 to 10 interaction-heavy scenarios.
Raw logs + maps + V2X messages
  -> data validation and synchronization audit
  -> scenario mining and tagging
  -> component labels and derived occupancy
  -> logged replay evaluator
  -> pseudo-simulation renderer
  -> closed-loop digital twin
  -> benchmark report and evidence package

Evaluation Protocol

  1. Validate clock sync and transforms before any model score is accepted.
  2. Run vehicle-only baseline first.
  3. Run modular autonomy baseline second.
  4. Run E2E/VLA/world-model submissions through the same route/task interface.
  5. Report aggregate score and per-scenario breakdown.
  6. Keep all safety-gate failures visible even when the composite score is high.
  7. Publish model input assumptions: sensors, maps, V2X, privileged labels, and external data.

Evidence Artifacts

Each evaluation run should emit:

  • Scenario manifest and map version.
  • Model version, config, weights, and input modalities.
  • Metric JSON and human-readable summary.
  • Failure clips and event timeline.
  • Sensor/V2X latency histogram.
  • Safety monitor state and intervention records.
  • ODD assumptions and exclusions.

Failure Modes

Failure modeBenchmark riskControl
Road benchmark transfer overclaimsStrong NAVSIM/Bench2Drive result is mistaken for airside readinessRequire airside-specific scenario gates
Label sparsity for FOD/personnelSmall objects and occluded workers are underrepresentedOversample safety-hard scenarios and use targeted annotation
Synthetic-domain biasDigital twin makes policies overfit clean textures or scripted agentsMix real logs, neural reconstructions, and domain-randomized variants
Overreliance on V2XModel behaves unsafely when messages are delayed or missingInclude V2X-loss and stale-message tests
Stop-to-winModel avoids infractions by excessive stoppingAdd task progress and operational delay metrics
Hidden map leakageModel sees future route, clearance, or labels unavailable at runtimeEnforce input-modality declarations and evaluator audits
Poor clock/transform provenanceCooperative and multi-sensor scores are invalidReject runs with sync/transform QA failures
Aggregate metric maskingAircraft or personnel failures are hidden by high route completionSafety-gate failures must be reported separately

DocumentRelevance
Evaluation Benchmarks: NAVSIM and Bench2DriveMethod sources for pseudo-sim and closed-loop scoring
Airside Scenario TaxonomyScenario library seed
Evaluation Methods and MetricsGeneral metrics background
Simulators for AirsideCandidate simulator stack
Neural Simulation PlatformsNeural reconstruction and generative sim
Airport Digital TwinsAirside digital-twin context
V2X Protocols AirsideCooperative message and fallback requirements
Cooperative V2X E2E DrivingCooperative benchmark track
VLM/VLA Reliability BenchmarksLanguage reasoning and hallucination tests
Failure Modes AnalysisSafety evidence and failure taxonomy context

Sources

Public research notes collected from public sources.