Skip to content

Perception-SLAM Leaderboard Interpretation

Last updated: 2026-05-09

Purpose

Public perception and SLAM leaderboards are useful for comparability, regression detection, and method selection. They are not release evidence by themselves for airside autonomy. This guide explains how to interpret KITTI, nuScenes, Waymo Open Dataset, OpenAD/OOD, Hilti, and internal airside benchmark metrics when deciding whether a perception-SLAM stack is ready for deployment.

Interpretation Rules

  1. Treat public leaderboard scores as B1 evidence: useful, reproducible, and comparable, but not ODD-specific.
  2. Require internal airside B2/B3/B4 evidence before release.
  3. Compare candidate against the current production baseline under the same code, inputs, runtime, and post-processing where possible.
  4. Report metrics by ODD slice, class, range, weather, lighting, map age, sensor kit, and route zone.
  5. Include runtime latency, memory, dropped frames, calibration sensitivity, and monitor actions. Accuracy without deployability is not release-ready.
  6. Never convert a single headline score into a safety claim.

Public Metrics and Pitfalls

SourceMetric anchorUseful forPitfall
KITTI objectAP/AOS, class IoU thresholds, easy/moderate/hardBasic detection regression and historical comparabilityRoad-domain, limited airside objects, small test set by modern standards
KITTI odometryTranslational error percent and rotational error over subsequencesOdometry drift comparabilityDoes not cover airside weak-feature aprons or map-change operations
nuScenes detectionmAP plus NDS with translation/scale/orientation/velocity/attribute errorsMulti-sensor 3D detection quality beyond APNDS weighting may not match safety cost of false-free-space or aircraft clearance
nuScenes trackingAMOTA/AMOTP plus MOTA/MOTP/IDS/FP/FNTracking stability and identity behaviorConfidence-threshold optimization can hide safety-specific low-recall issues
Waymo Open Dataset2D/3D detection, tracking, segmentation, motion/e2e tasksLarge-scale, diverse perception comparisonWaymo states dataset is for research, not real-life vehicle performance evaluation
OpenADOpen-world 3D object detection and corner casesOpen-set and cross-dataset capabilityStill road-centric and benchmark ontology may not include airside hazards
ProOOD/OOD occupancyOOD voxel/object scoring, occupancy mIoU, AUPR/AUROC-style OOD metricsUnknown/novel occupancy risk thinkingResearch method evidence, not a certified monitor
Hilti SLAMMulti-session SLAM across sensor constellationsRobust SLAM and calibration stress outside road domainConstruction-site geometry differs from apron operations

Release Translation

Leaderboard observationRelease interpretation
Candidate improves public AP/NDS but regresses internal FOD recallBlock or restrict; airside critical slice wins
Candidate improves ATE but increases relocalization failures near standsBlock affected route/zone
Candidate improves OOD AUROC but planner does not consume unknown stateDiagnostic only; not safety evidence
Candidate wins latency but drops small-object recallRequire safety review; do not trade away protected-zone recall silently
Candidate passes aggregate metrics but fails wet/night/personnel-zone sliceExclude slice or block release
Candidate score improves by less than confidence intervalTreat as inconclusive; do not claim improvement

Metric Pack for Release Reviews

AreaRequired metrics
Object detectionAP/recall/precision by class, range, occlusion, size, route zone; critical false negatives
TrackingTrack fragmentation, ID switches, missed tracks, time-to-detect, persistence under occlusion
Free-space/occupancyFalse-free-space, unknown conservatism, occupied/free/unknown confusion, protected-zone failures
OOD/unknownAUROC/AUPR/FPR at target recall, unknown-object action rate, false suppression review
SLAM/localizationATE, RPE, drift rate, relocalization success, integrity coverage, map tile residual
Map qualityStatic preservation, dynamic rejection, FOD retention, map-perception disagreement
RobustnessCorruption/fault-injection deltas, timing skew, dropout, calibration drift, adverse weather
Runtimep50/p95/p99/p99.9 latency, memory, GPU/CPU, thermal throttling, dropped frames
OperationsInterventions, remote assists, alert precision, operator workload, incident joins

Scorecard Template

SectionRequired answer
Benchmark scopeDataset versions, splits, routes, ODD slices, exclusions
CandidateBuild, model, map, calibration, config, runtime, hardware
BaselineProduction-compatible comparator and manifest
Metric deltasScore delta, confidence interval, pass/fail by slice
Safety-critical failuresEvent-level review, not only averages
Runtime impactLatency/resource deltas and deployment feasibility
Monitor actionWhether uncertainty/OOD/free-space signals changed behavior
RecommendationPass, pass with ODD restriction, inconclusive, block

Anti-Patterns

  • Reporting only mAP/NDS while ignoring false-free-space and critical false negatives.
  • Comparing a public leaderboard result to an internal model with different sensors, runtime, or post-processing.
  • Treating open-world/OOD benchmark performance as proof that all unknown airside objects are safe.
  • Averaging across airports or routes when commissioning a new site.
  • Hiding a runtime regression behind an accuracy improvement.
  • Tuning thresholds on the locked test set or incident replay without a new validation split.
  • 60-safety-validation/verification-validation/slam-map-benchmark-protocol.md
  • 60-safety-validation/verification-validation/perception-slam-statistical-validity-protocol.md
  • 60-safety-validation/verification-validation/airside-dynamic-map-cleaning-benchmark.md
  • 60-safety-validation/verification-validation/multi-sensor-calibration-release-benchmark.md
  • 30-autonomy-stack/perception/datasets-benchmarks/nuscenes-waymo-practical-guide.md
  • 30-autonomy-stack/localization-mapping/slam-methods/benchmarking-metrics-datasets.md

Sources

Public research notes collected from public sources.