Airside Autonomy Benchmark and Dataset Specification

Key Takeaway: Airside autonomy needs its own benchmark. Road datasets and indoor AMR benchmarks do not cover aircraft priority, apron geometry, ground support equipment, stand operations, jet blast, FOD, marshalling, A-SMGCS context, or airport sponsor approval constraints. The practical benchmark should combine logged multi-sensor data, NAVSIM-style pseudo-simulation, closed-loop digital-twin routes, and controlled real-world evidence.

Scope and Goal

This specification defines a benchmark for autonomous ground vehicle systems operating on airport airside surfaces: aprons, stands, service roads, cargo areas, baggage routes, de-icing zones, and controlled crossings near taxiways. It is intended for generic autonomy stacks, not only one vehicle type.

Target vehicles:

Baggage tractors and autonomous dollies.
Cargo tugs and ULD movers.
Follow-me, inspection, FOD retrieval, and perimeter vehicles.
Autonomous service vehicles operating around aircraft stands.
Low-speed passenger or crew shuttles in restricted airside zones.

The benchmark should measure full autonomy behavior, not only perception. It must support modular stacks, end-to-end driving policies, VLA planners, cooperative V2X policies, and world-model-based planners through a common route/task interface.

Benchmark Taxonomy

Evaluation Tiers

Tier	Name	Purpose	Execution mode
T0	Dataset quality and coverage audit	Check sensor sync, labels, scenario balance, map validity	Offline
T1	Component evaluation	Perception, tracking, occupancy, prediction, VLM reasoning	Offline
T2	Logged trajectory evaluation	NAVSIM-style path scoring on real missions	Offline/non-reactive
T3	Pseudo-simulation	Perturb ego endpoint and render/score nearby observations	Offline/pseudo-closed-loop
T4	Closed-loop digital twin	Interactive route and scenario simulation	Simulator
T5	Shadow-mode replay	Run model on real vehicle logs without control authority	On-vehicle/offline replay
T6	Controlled-site operational test	Safety-driver or monitor-backed airport test	Real-world evidence

Task Families

Task family	Examples	Required metrics
Route transit	Depot to stand, stand to cargo, service-road loop	Progress, speed limits, route adherence, smoothness
Stand approach	Approach aircraft stand with active GSE and personnel	Aircraft clearance, stand zone compliance, personnel safety
Pushback interaction	Yield to pushback, cross after clearance, follow tow path	Aircraft priority, swept-volume avoidance, clearance state
GSE sequencing	Merge with baggage trains, pass belt loader, convoy with tugs	Deadlock, right-of-way, gap acceptance, V2X intent use
FOD encounter	Detect, avoid, report, or retrieve debris	FOD recall, false alarm, stop/reroute correctness
Jet blast and engine hazards	Avoid active blast zones and engine intake zones	Hazard-zone compliance, fallback if no engine-state message
Communication-dependent operation	A-SMGCS/ramp-control instruction, V2X task update, link loss	Default-deny behavior, stale-message rejection, fallback
Adverse conditions	Rain, glare, night, fog, wet apron, de-icing residue	Robustness deltas and ODD state transitions

Data Modalities

Modality	Required?	Notes
Surround cameras	Required	At least 6 views for E2E/VLA and inspection tasks
3D LiDAR	Required	Metric obstacle and aircraft geometry
4D radar	Strongly recommended	Weather/night fallback and Doppler velocity
IMU/GNSS/RTK	Required	Pose, speed, and localization provenance
Wheel odometry / steering / actuator state	Required	Control and tracking diagnostics
V2X messages	Required for cooperative tracks	CAM-like awareness, task assignment, clearance, stand status, jet blast, FOD
Airport operational feeds	Recommended	AODB/A-CDM/A-SMGCS, stand assignment, aircraft movement status
Maps	Required	Lanelet2 or airport graph plus AMDB-style surfaces, stands, zones, speed limits

Dataset Schema

Scene Record

Each scenario clip should include:

scenario_id, airport_id, stand_or_zone_id, and route_id.
ODD metadata: weather, visibility, lighting, surface condition, construction/de-icing state.
Sensor packets with original timestamps and clock-domain provenance.
Ego state and control state at sensor and control rates.
Map version and geofence/zone version.
V2X and airport-system messages with publish, receive, and consume timestamps.
Ground-truth annotations or derived labels.
Safety monitor events and human intervention markers.

Required Labels

Label type	Contents
3D object boxes/tracks	Aircraft, GSE, road vehicles, personnel, FOD, cones/barriers, jetbridge, static equipment
Semantic occupancy	Drivable apron, stand safety envelope, no-go zone, aircraft swept area, jet blast zone, unknown/occluded
Map elements	Service roads, hold lines, stand markings, taxiway boundaries, crossings, parking/staging zones
Agent intent/state	Aircraft parked/taxi/pushback, GSE loading/unloading/reversing, personnel walking/marshalling
Event labels	Clearance granted/revoked, pushback start, FOD detected, intervention, e-stop, network loss
VLM/VLA QA labels	Safety-relevant scene questions, rule questions, instruction-following labels

Splits

Split	Purpose	Anti-leakage rule
`train_normal`	Model training on routine operations	Exclude hard safety events
`train_augmented`	Synthetic and replay perturbations	Mark synthetic provenance explicitly
`val_site_seen`	Routine validation at known airport/site	No overlap by route/time clip
`val_site_unseen`	Generalization to new stands or airport areas	Hold out stands/zones
`test_public`	Public reproducibility	Limited edge cases, fixed evaluator
`test_private`	Leaderboard overfitting guard	Hidden routes and event mix
`test_safety_hard`	Release gating	Withheld high-severity events and rare weather

Metric Taxonomy

Composite Airside PDM Score

An airside PDM-style score should multiply hard safety gates by weighted quality scores:

AirsideScore = SafetyGatesProduct * WeightedQualityScore

Hard safety gates:

No aircraft contact.
No personnel collision.
No GSE/static-object collision.
No entry into active jet blast or engine intake zone.
No unauthorized hold-short/taxiway/runway-area crossing.
No stand safety-envelope violation.
No unsafe response to communication loss or stale clearance.

Weighted quality scores:

Task progress and route completion.
Aircraft/personnel/GSE clearance margins.
Comfort and payload stability.
Speed compliance by zone.
Rule compliance and right-of-way.
Operational efficiency, including avoidable delay and deadlock.
V2X usage correctness when available.

Component Metrics

Component	Metrics
Perception	mAP/NDS-style object detection, per-class recall, FOD recall, personnel recall under occlusion
Occupancy	Semantic IoU, free-space precision, unknown-space calibration, future occupancy IoU
Tracking	MOTA/HOTA/ID switches, velocity error, track latency
Prediction	minADE/minFDE, brier-minFDE, occupancy flow error, swept-volume miss rate
Planning	Collision rate, progress, comfort, rule compliance, intervention rate
V2X	Deadline miss rate, stale-message rejection, pose/time alignment error, fallback correctness
VLM/VLA	Visual grounding, hallucination rate, text-only leakage, corruption robustness, action validity

Relevance by Domain

Generic AV

The benchmark creates a reusable pattern for non-road autonomy: scenario-driven evaluation with operational rules, not only lane-following. It can inform robotaxi depots, industrial campuses, ports, and private roads.

Indoor Autonomy

Indoor AMR and forklift benchmarks can borrow the task-progress, near-field personnel, communication-loss, and facility-map governance patterns. Replace aircraft/stand concepts with dock doors, aisles, pallets, and WMS tasks.

Outdoor Industrial Autonomy

Logistics yards, ports, mining roads, construction sites, and campuses share low-to-medium speed operation, mixed manual/autonomous traffic, private maps, and central dispatch. Airside evaluation patterns transfer well to their route/task scoring.