Skip to content

Evaluation Benchmarks for End-to-End Driving: NAVSIM, Bench2Drive, and Airside Transfer

Key Takeaway: End-to-end driving systems should not be judged only by open-loop trajectory imitation. NAVSIM and Bench2Drive represent two complementary directions: NAVSIM scales real-data evaluation with non-reactive and pseudo-simulation metrics, while Bench2Drive stresses policies in closed-loop CARLA routes with multi-ability scenarios. For airside autonomy, the right pattern is a three-tier suite: logged-data pseudo-simulation for fast iteration, closed-loop digital-twin routes for interaction safety, and controlled-site tests for regulatory evidence.


Why This Matters

Open-loop metrics such as L2 trajectory error and collision checks against logged actors are useful regression tests, but they do not measure recovery from distribution shift, compounding error, deadlock, comfort, or interaction with agents that respond to the ego vehicle. End-to-end driving models, VLA planners, and world-model planners need evaluation that answers four questions:

  1. Does the predicted trajectory remain legal and safe when it deviates from the human log?
  2. Does the model make progress without exploiting metric loopholes such as stopping forever?
  3. Does the model recover from slightly off-expert states?
  4. Does the benchmark expose which capability failed: perception, mapping, interaction, rule compliance, comfort, or control?

NAVSIM and Bench2Drive are the most useful open references for building that evaluation discipline today.


Benchmark Taxonomy

Evaluation classTypical examplesStrengthMain gap
Open-loop imitationnuScenes planning, Waymo motion planning, logged ego trajectory L2Cheap, reproducible, works with real dataWeak correlation with closed-loop safety and progress
Non-reactive simulationNAVSIM v1 PDM-style scoringTests a predicted path in a real scene without full simulator costBackground agents do not respond to ego behavior
Pseudo-simulationNAVSIM v2Adds synthetic observations near candidate trajectory endpoints and better approximates closed-loop correlationStill not fully interactive and depends on reconstruction quality
Closed-loop simulationBench2Drive, CARLA Leaderboard, AlpaSimMeasures route completion, infractions, interaction, recoverySim-to-real and behavior-agent realism gaps
Scenario-level safety validationNeuroNCAP-style, ISO 3450x, airside digital twinsTargets rare/high-severity hazardsNeeds domain-specific scenario libraries and evidence governance
Real-world shadow/controlled testingFleet replay, safety driver trials, AGVS airport testsHighest fidelityExpensive, hard to repeat, limited rare-event statistics

What It Is

NAVSIM is a data-driven AV planning benchmark intended to sit between open-loop replay and full closed-loop simulation. NAVSIM v1 introduced non-reactive simulation on real scenes. NAVSIM v2 adds pseudo-simulation, where synthetic observations near planned trajectory endpoints support scalable metric computation without running an interactive simulator sequentially.

The main branch of the official NAVSIM repository is now NAVSIM v2, used for the 2025 NAVSIM challenge. The v1 branch is still available for the original NeurIPS 2024 benchmark.

Method Taxonomy

NAVSIM componentRoleNotes
Real-scene inputUses logged sensor and annotation dataKeeps domain closer to real driving than pure CARLA
Planner submissionModel outputs a trajectoryEasier to evaluate than full stack execution
PDM / EPDMS scoringScores no-collision, drivable-area compliance, progress, comfort, and related penaltiesNAVSIM v2 extends the original PDM score
Pseudo-simulationUses synthetic observations near the planned trajectoryReported to correlate better with closed-loop simulation than conventional open-loop metrics
Human filter / false-positive filteringReduces unfair penalties where the human driver also violates a rule or scene data is ambiguousImportant for real-log benchmark fairness

Strengths

  • Uses real-world observations rather than fully synthetic roads.
  • Scales faster than full closed-loop simulation.
  • Rewards progress and comfort rather than only matching the expert path.
  • Exposes common E2E failure modes: off-road drift, static collisions, poor progress, and comfort violations.
  • Provides an Apache-2.0 codebase and public challenge infrastructure.

Limitations

  • It is not a full interactive simulator; other agents are only approximated.
  • It evaluates a trajectory interface, not every possible runtime integration issue.
  • Synthetic observations inherit reconstruction, rendering, and sensor-domain errors.
  • Road-driving maps and rules do not directly cover warehouses, yards, ports, or airport aprons.
  • A high NAVSIM score is useful evidence, not deployment proof.

Bench2Drive

What It Is

Bench2Drive is a closed-loop CARLA-based benchmark for multi-ability evaluation of end-to-end autonomous driving. Its project page describes 2 million fully annotated frames collected from 13,638 clips by the Think2Drive world-model RL expert, distributed across 44 interactive scenarios, 23 weather settings, and 12 towns. The evaluation protocol uses 220 routes to disentangle driving capabilities under different situations.

Method Taxonomy

Bench2Drive componentRoleNotes
World-model RL expertGenerates training clips and expert behaviorMakes the dataset richer than hand-authored fixed routes
CARLA Leaderboard v2 environmentRuns closed-loop routesLets ego actions affect future observations
Multi-sensor observationsCameras, LiDAR, radar, IMU/GNSS, maps and annotationsUseful for camera-only, fusion, VLA, and privileged baselines
Interactive scenariosCut-in, overtaking, detour, construction, door opening, parking, etc.Tests multiple driving abilities rather than a single route score
Driving Score / success / ability metricsSummarize performance and per-skill breakdownBetter diagnostic value than a single L2 number

Strengths

  • Measures closed-loop behavior directly.
  • Covers many interactive scenarios and weather conditions.
  • Provides rich annotations for perception, mapping, and planning analysis.
  • Includes radar in the sensor suite, which is valuable for adverse-condition and fusion research.
  • Lets methods fail through runtime decisions, not only through logged trajectory mismatch.

Limitations

  • CARLA behavior and sensor models still differ from real domains.
  • Route completion can hide narrow safety-margin failures unless scenario metrics are inspected.
  • Simulator crashes and runtime variability can affect evaluation cost.
  • Road scenario taxonomy does not include stands, aircraft, GSE, jet blast, FOD, marshalling, or airport operations.
  • License and dataset size constraints make full use heavier than NAVSIM-style evaluation.

Benchmark Comparison

DimensionNAVSIM v2Bench2DriveAirside benchmark implication
Core paradigmReal-data pseudo-simulationClosed-loop simulation in CARLAUse both: logged-airside pseudo-sim for iteration, digital-twin closed loop for safety cases
InteractionApproximate/non-sequentialInteractive simulationAirside pushback, stand entry, and GSE conflicts require closed-loop coverage
CostLowerHigherUse NAVSIM-like stage in CI; reserve closed-loop for release gates
Main metricEPDMS/PDM-style trajectory scoreDriving score, success, multi-ability metricsBuild airside EPDMS with no-aircraft-contact and personnel-safety gates
DomainPublic-road drivingPublic-road drivingNeeds airside object classes, rules, maps, and hazards
Failure visibilityGood for trajectory qualityGood for runtime interactionCombine logs, metric traces, and scenario labels

Relevance by Domain

Generic Road AV

NAVSIM is useful as a scalable planning benchmark on real data; Bench2Drive is useful as a closed-loop stress test for interactive behavior. A production road AV program should maintain both types and report them separately.

Indoor Autonomy

Indoor AMRs and forklifts do not need road traffic-light metrics, but the pattern transfers well. Replace drivable-area compliance with aisle, dock, and safety-zone compliance; replace route progress with task progress; add near-field personnel and pallet occlusion scenarios.

Outdoor Industrial Autonomy

Yards, ports, mining sites, and campuses need mixed reactive simulation and operational-rule scoring. NAVSIM-like pseudo-simulation can score logged yard missions quickly, while Bench2Drive-like closed-loop routes should cover blind corners, trailer moves, loading-zone conflicts, and degraded GNSS.

Airside Autonomy

Airside vehicles need an airside-specific evaluation suite. Road benchmarks do not model aircraft priority, pushback swept volumes, jet blast, FOD, stand markings, marshaller signals, A-SMGCS context, or airport sponsor approval constraints. NAVSIM and Bench2Drive provide methodology, not sufficient domain coverage.


Airside Evaluation Notes

Suggested Metric Stack

LayerMetric familyAirside adaptation
Safety gatesBinary or ternary multipliersNo aircraft contact, no personnel collision, no GSE collision, no runway/taxiway clearance violation, no jet-blast-zone entry
Route/task progressWeighted scoreComplete baggage route, stand approach, service-road crossing, tow route, or FOD retrieval task
Rule compliancePenalty or gateAircraft right-of-way, hold-short/default-deny clearance, stand exclusion zones, speed-by-zone, marshaller/ramp-control instructions
Comfort/controlWeighted scoreAcceleration, jerk, steering rate, load shift, dolly stability, towbar stress
Perception dependencyDiagnostic scoreWas failure caused by missed aircraft/GSE/personnel/FOD, stale V2X, map error, or planner choice?
RobustnessScenario variantsRain, glare, night, fog, wet apron, de-icing residue, GPS multipath, terminal shadow, network dropout

Minimum Airside Benchmark Split

SplitPurposeExample content
airside_trainModel developmentNormal stand transit, service-road driving, baggage tug routes
airside_valTuning and regressionSame scenario families with unseen stands and weather
airside_test_publicPublic leaderboard and reproducibilitySanitized routes and scenarios
airside_test_privateOverfitting guardWithheld stands, aircraft types, route assignments, and edge cases
airside_safety_hardRelease gatePushback, aircraft crossing, FOD, personnel occlusion, jet blast, clearance loss

Implementation Pattern

  1. Start with a NAVSIM-style logged replay evaluator for recorded airside missions.
  2. Add pseudo-sim perturbations around ego trajectory endpoints using 3DGS or neural reconstruction where available.
  3. Build closed-loop routes in CARLA/AWSIM/Isaac/airport digital twin for interaction-heavy scenarios.
  4. Score E2E policies through the same route/task interface used by modular planners.
  5. Keep all metric traces, scenario tags, V2X messages, and safety-monitor events as evidence artifacts.

Failure Modes

Failure modeHow it appears in benchmark resultsMitigation
Open-loop overfittingLow L2 error but poor route completion or collision avoidanceRequire NAVSIM/Bench2Drive-style progress and safety metrics
Stop-to-win behaviorNo collisions but low progressUse progress floors and task completion gates
Simulator exploitationPolicy learns CARLA-specific artifactsEvaluate on real logs, pseudo-sim, multiple simulators, and controlled real tests
Metric maskingStrong aggregate score hides aircraft/proximity hazardReport per-scenario and per-safety-gate breakdowns
Non-reactive optimismEgo trajectory passes through space an agent would reactively occupyUse closed-loop digital twin for interaction-heavy cases
False-positive penaltiesBenchmark punishes necessary deviations from unsafe human logsUse human-filter logic and scenario review
Domain mismatchRoad benchmarks miss airside failure classesBuild airside classes, maps, rules, and hazard metrics

DocumentRelevance
Evaluation Methods, Benchmarks, and MetricsBroader benchmark and metric taxonomy
Airside Scenario TaxonomyCandidate airside scenario families
Open-Source Simulators for AirsideSimulator choices for closed-loop airside routes
Neural Simulation PlatformsNeural reconstruction and world-model simulation options
End-to-End ArchitecturesE2E model families that need these benchmarks
End-to-End World Model PipelineWorld-model planner interfaces and metrics
VLM/VLA Reliability BenchmarksLanguage/reasoning benchmark layer
Airside Autonomy Benchmark SpecProposed domain-specific benchmark design

Sources

Public research notes collected from public sources.