Skip to content

SLAM Map Benchmark Protocol

Last updated: 2026-05-09

Purpose

This protocol defines a repeatable benchmark for SLAM, localization, map construction, and map-update pipelines used by airside autonomous ground vehicles. It combines public SLAM benchmarks for comparability with internal airside datasets for release evidence.

The benchmark supports the perception-SLAM evidence case, statistical validity protocol, uncertainty calibration release gates, corruption and fault injection protocol, and airside dynamic map cleaning benchmark.

Benchmark Tiers

TierDataset typePurposeRelease use
B0 smokeShort internal routes and synthetic checksDetect pipeline/config regressions quicklyRequired for every build
B1 public comparabilityKITTI, TUM RGB-D, Oxford RobotCar, Boreas, SLAMBench-compatible dataCompare against known methods and stress basic generalizationSupporting evidence only
B2 airside replayLogged airport routes with ground truth and labelsMeasure ODD-relevant performanceRequired for release
B3 closed-courseInstrumented test track with fixtures/FOD/GSE/aircraft mockupsMeasure safety-critical geometry and edge casesRequired for new ODD or major change
B4 shadow modeReal operational route exposure under supervisionConfirm operational distribution and long-tail alertsRequired before autonomous expansion

Public Benchmark Anchors

BenchmarkWhat it contributesLimitation for airside release
KITTI odometryStandard visual/LiDAR odometry metrics across urban driving sequencesRoad domain; limited airport-specific actors/weather
TUM RGB-DGround-truth visual/RGB-D SLAM and ATE/RPE evaluation toolingIndoor/small-scale; not representative of outdoor apron geometry
Oxford RobotCarRepeated route over long time, appearance change, urban dynamicsRoad domain; useful for long-term localization and map aging
BoreasRepeated route with seasonal/weather changes, LiDAR/radar/camera and ground truthRoad domain; strong proxy for adverse weather and multi-season drift
SLAMBenchReproducible SLAM benchmarking with accuracy/performance/energy focusPrimarily research harness; adapt carefully to production stack
MapBenchRobustness of HD map construction under sensor corruptionsRoad HD map domain; useful for corruption thinking

Public datasets cannot prove airside safety. They are used to catch generic regressions, maintain reproducibility, and compare algorithms before internal airside release testing.

Internal Airside Dataset Requirements

Dataset sliceMinimum content
Route repeatsSame route across day/night, dry/wet, quiet/busy operations
Stand pairsAircraft absent/present, GSE staged/removed, chocks/cones/FOD present/absent
Depot changesFrequent temporary-object changes and parked-fleet clutter
Taxiway crossing supportWeak-feature open areas, geofence boundaries, clearance-state context
WeatherHeavy rain/wet surface if in ODD; fog/snow/ice/de-icing if in ODD
Ground truthSurvey, RTK/INS, total station, overhead tracking, or human-adjudicated labels
Map lifecycleSource traversals, map build date, tile hashes, reviewer decisions, publication state

Metrics

Localization and SLAM

MetricDefinitionReport by
ATEAbsolute trajectory error after alignment appropriate to use caseRoute, zone, weather, map age
RPERelative pose error over fixed distance/time windowsSpeed, turn class, surface
Drift rateTranslation/yaw error per 100 m or per minuteFeature density and GNSS status
Loop-closure errorResidual before/after loop closure and wrong-loop incidenceRoute repeat and map tile
Relocalization successRecovery after deliberate localization loss or start from unknown poseZone and initial uncertainty
Localization availabilityTime pose remains valid inside error envelopeMission and ODD slice
RuntimeCPU/GPU/memory/latency and dropped-frame rateHardware config and logging tier

Map Quality

MetricDefinitionSafety use
Map alignment errorDifference between map features and surveyed/reference geometryProtects geofence and path alignment
Static preservationValid permanent features retainedPrevents loss of localization anchors
Dynamic rejectionDynamic actor points excluded from permanent static layerPrevents ghosts
False-free-space rateOccupied/hazardous space marked traversableCritical release blocker
Unknown-space conservatismCorrectly marks insufficiently observed areas unknownPrevents optimistic maps
Movable-static routingTemporary objects sent to review/quarantinePrevents unsafe map publication
Tile consistencySeam continuity, frame consistency, no duplicate/stale tileProtects runtime map lookup

Semantic and Safety Layers

LayerRequired checks
Permanent staticBuildings, poles, curbs, terminal edges, fixed markings, fixtures
Movable-staticCones, barriers, parked carts, parked aircraft/GSE, chocks
DynamicPeople, moving GSE, aircraft movement, service vehicles
FOD/hazardSmall objects preserved as current hazards, not cleaned away
Geofence/routeNo mismatch between map, route graph, and restricted zones
Unknown/reviewAmbiguous regions are not promoted to free space

Benchmark Procedure

  1. Freeze candidate build, map package, calibration, and benchmark manifest.
  2. Run B0 smoke checks on every build.
  3. Run B1 public benchmark suite for algorithm comparability and regression detection.
  4. Run B2 airside replay using locked partitions and pre-defined ODD slices.
  5. Run B3 closed-course tests for critical geometry, FOD, temporary objects, and sensor degradation.
  6. Run B4 shadow-mode route exposure for operational confirmation.
  7. Produce metric report, failure packets, and release recommendation.
  8. Quarantine any map tile with unresolved critical defects.

Statistical Decision Rules

Use perception-SLAM statistical validity protocol for confidence intervals and sample independence. Benchmark-specific rules:

DecisionRule
Public benchmark regressionCandidate must not regress beyond pre-set tolerance against production baseline
Airside releaseEach critical ODD slice must pass; aggregate pass is insufficient
Map tile publicationTile passes only if source traversals, geometry, semantic layers, and review status pass
New airportTreat as new B2/B3/B4 campaign; do not rely on another airport's sample counts
Inconclusive sliceRelease excludes that slice or campaign continues

Failure Modes and Diagnostics

Failure modeDiagnostic
Wrong global alignmentATE spike, geofence mismatch, survey residual
Local drift in weak-feature areaRPE/drift rate by feature density
Wrong loop closureTopological inconsistency, residual jump, route discontinuity
Dynamic object ghostAircraft/GSE/person points in permanent layer
Map changed after surveyScan-to-map residual trend and map-change detector
False-free-spaceOccupancy/semantic layer comparison against labels/fixtures
Over-cleaning small hazardsFOD/chock/cone missing from hazard/current-world layer
Runtime overloadLatency, dropped frames, stale pose consumed by planner

Evidence Artifacts

ArtifactContents
Benchmark manifestBuild/map/calibration IDs, dataset partitions, route/tile list
Ground-truth packageSurvey files, RTK/INS logs, label files, uncertainty model
Metric reportTables, plots, confidence intervals, public and internal benchmark results
Map QA reportTile status, semantic-layer checks, reviewer decisions
Failure packetReproducible log slice, seed/config, expected vs actual, defect ID
Runtime reportLatency, memory, CPU/GPU, dropped frames, watchdog events
Release recommendationPass, pass with ODD restriction, inconclusive, or block

Owner Handoffs

OwnerResponsibility
Benchmark ownerManifest, harness, reproducibility, metric report
Mapping ownerMap build, tile QA, source traversals, publication readiness
Perception/SLAM ownerCandidate stack, metrics, root cause analysis
Data platform ownerDataset curation, partition locks, storage, metadata
Safety leadCritical thresholds and release interpretation
Fleet operationsShadow-mode execution and route/airport exposure

Sources

Public research notes collected from public sources.