SLAM Map Benchmark Protocol

Last updated: 2026-05-09

Purpose

This protocol defines a repeatable benchmark for SLAM, localization, map construction, and map-update pipelines used by airside autonomous ground vehicles. It combines public SLAM benchmarks for comparability with internal airside datasets for release evidence.

The benchmark supports the perception-SLAM evidence case, statistical validity protocol, uncertainty calibration release gates, corruption and fault injection protocol, and airside dynamic map cleaning benchmark.

Benchmark Tiers

Tier	Dataset type	Purpose	Release use
B0 smoke	Short internal routes and synthetic checks	Detect pipeline/config regressions quickly	Required for every build
B1 public comparability	KITTI, TUM RGB-D, Oxford RobotCar, Boreas, SLAMBench-compatible data	Compare against known methods and stress basic generalization	Supporting evidence only
B2 airside replay	Logged airport routes with ground truth and labels	Measure ODD-relevant performance	Required for release
B3 closed-course	Instrumented test track with fixtures/FOD/GSE/aircraft mockups	Measure safety-critical geometry and edge cases	Required for new ODD or major change
B4 shadow mode	Real operational route exposure under supervision	Confirm operational distribution and long-tail alerts	Required before autonomous expansion

Public Benchmark Anchors

Benchmark	What it contributes	Limitation for airside release
KITTI odometry	Standard visual/LiDAR odometry metrics across urban driving sequences	Road domain; limited airport-specific actors/weather
TUM RGB-D	Ground-truth visual/RGB-D SLAM and ATE/RPE evaluation tooling	Indoor/small-scale; not representative of outdoor apron geometry
Oxford RobotCar	Repeated route over long time, appearance change, urban dynamics	Road domain; useful for long-term localization and map aging
Boreas	Repeated route with seasonal/weather changes, LiDAR/radar/camera and ground truth	Road domain; strong proxy for adverse weather and multi-season drift
SLAMBench	Reproducible SLAM benchmarking with accuracy/performance/energy focus	Primarily research harness; adapt carefully to production stack
MapBench	Robustness of HD map construction under sensor corruptions	Road HD map domain; useful for corruption thinking

Public datasets cannot prove airside safety. They are used to catch generic regressions, maintain reproducibility, and compare algorithms before internal airside release testing.

Internal Airside Dataset Requirements

Dataset slice	Minimum content
Route repeats	Same route across day/night, dry/wet, quiet/busy operations
Stand pairs	Aircraft absent/present, GSE staged/removed, chocks/cones/FOD present/absent
Depot changes	Frequent temporary-object changes and parked-fleet clutter
Taxiway crossing support	Weak-feature open areas, geofence boundaries, clearance-state context
Weather	Heavy rain/wet surface if in ODD; fog/snow/ice/de-icing if in ODD
Ground truth	Survey, RTK/INS, total station, overhead tracking, or human-adjudicated labels
Map lifecycle	Source traversals, map build date, tile hashes, reviewer decisions, publication state

Metrics

Localization and SLAM

Metric	Definition	Report by
ATE	Absolute trajectory error after alignment appropriate to use case	Route, zone, weather, map age
RPE	Relative pose error over fixed distance/time windows	Speed, turn class, surface
Drift rate	Translation/yaw error per 100 m or per minute	Feature density and GNSS status
Loop-closure error	Residual before/after loop closure and wrong-loop incidence	Route repeat and map tile
Relocalization success	Recovery after deliberate localization loss or start from unknown pose	Zone and initial uncertainty
Localization availability	Time pose remains valid inside error envelope	Mission and ODD slice
Runtime	CPU/GPU/memory/latency and dropped-frame rate	Hardware config and logging tier

Map Quality

Metric	Definition	Safety use
Map alignment error	Difference between map features and surveyed/reference geometry	Protects geofence and path alignment
Static preservation	Valid permanent features retained	Prevents loss of localization anchors
Dynamic rejection	Dynamic actor points excluded from permanent static layer	Prevents ghosts
False-free-space rate	Occupied/hazardous space marked traversable	Critical release blocker
Unknown-space conservatism	Correctly marks insufficiently observed areas unknown	Prevents optimistic maps
Movable-static routing	Temporary objects sent to review/quarantine	Prevents unsafe map publication
Tile consistency	Seam continuity, frame consistency, no duplicate/stale tile	Protects runtime map lookup

Semantic and Safety Layers

Layer	Required checks
Permanent static	Buildings, poles, curbs, terminal edges, fixed markings, fixtures
Movable-static	Cones, barriers, parked carts, parked aircraft/GSE, chocks
Dynamic	People, moving GSE, aircraft movement, service vehicles
FOD/hazard	Small objects preserved as current hazards, not cleaned away
Geofence/route	No mismatch between map, route graph, and restricted zones
Unknown/review	Ambiguous regions are not promoted to free space

Benchmark Procedure

Freeze candidate build, map package, calibration, and benchmark manifest.
Run B0 smoke checks on every build.
Run B1 public benchmark suite for algorithm comparability and regression detection.
Run B2 airside replay using locked partitions and pre-defined ODD slices.
Run B3 closed-course tests for critical geometry, FOD, temporary objects, and sensor degradation.
Run B4 shadow-mode route exposure for operational confirmation.
Produce metric report, failure packets, and release recommendation.
Quarantine any map tile with unresolved critical defects.

Statistical Decision Rules

Use perception-SLAM statistical validity protocol for confidence intervals and sample independence. Benchmark-specific rules:

Decision	Rule
Public benchmark regression	Candidate must not regress beyond pre-set tolerance against production baseline
Airside release	Each critical ODD slice must pass; aggregate pass is insufficient
Map tile publication	Tile passes only if source traversals, geometry, semantic layers, and review status pass
New airport	Treat as new B2/B3/B4 campaign; do not rely on another airport's sample counts
Inconclusive slice	Release excludes that slice or campaign continues

Failure Modes and Diagnostics

Failure mode	Diagnostic
Wrong global alignment	ATE spike, geofence mismatch, survey residual
Local drift in weak-feature area	RPE/drift rate by feature density
Wrong loop closure	Topological inconsistency, residual jump, route discontinuity
Dynamic object ghost	Aircraft/GSE/person points in permanent layer
Map changed after survey	Scan-to-map residual trend and map-change detector
False-free-space	Occupancy/semantic layer comparison against labels/fixtures
Over-cleaning small hazards	FOD/chock/cone missing from hazard/current-world layer
Runtime overload	Latency, dropped frames, stale pose consumed by planner

Evidence Artifacts

Artifact	Contents
Benchmark manifest	Build/map/calibration IDs, dataset partitions, route/tile list
Ground-truth package	Survey files, RTK/INS logs, label files, uncertainty model
Metric report	Tables, plots, confidence intervals, public and internal benchmark results
Map QA report	Tile status, semantic-layer checks, reviewer decisions
Failure packet	Reproducible log slice, seed/config, expected vs actual, defect ID
Runtime report	Latency, memory, CPU/GPU, dropped frames, watchdog events
Release recommendation	Pass, pass with ODD restriction, inconclusive, or block

Owner Handoffs

Owner	Responsibility
Benchmark owner	Manifest, harness, reproducibility, metric report
Mapping owner	Map build, tile QA, source traversals, publication readiness
Perception/SLAM owner	Candidate stack, metrics, root cause analysis
Data platform owner	Dataset curation, partition locks, storage, metadata
Safety lead	Critical thresholds and release interpretation
Fleet operations	Shadow-mode execution and route/airport exposure

Sources

KITTI odometry benchmark: https://www.cvlibs.net/datasets/kitti/eval_odometry.php
TUM RGB-D SLAM Dataset and Benchmark: https://cvg.cit.tum.de/data/datasets/rgbd-dataset
Oxford RobotCar Dataset: https://robotcar-dataset.robots.ox.ac.uk/
Oxford RobotCar IJRR paper: https://robotcar-dataset.robots.ox.ac.uk/images/robotcar_ijrr.pdf
Boreas multi-season autonomous driving dataset: https://www.boreas.utias.utoronto.ca/
Boreas paper: https://arxiv.org/abs/2203.10168
SLAMBench repository: https://github.com/pamela-project/slambench
SLAMBench paper: https://arxiv.org/abs/1410.2167
SLAMBench2 paper: https://arxiv.org/abs/1808.06820
MapBench project: https://mapbench.github.io/
Waymo Open Dataset: https://waymo.com/open/
ISO 34502:2022, scenario-based safety evaluation framework: https://www.iso.org/standard/78951.html

SLAM Methods

Methods

SLAM Map Benchmark Protocol ​

Purpose ​

Benchmark Tiers ​

Public Benchmark Anchors ​

Internal Airside Dataset Requirements ​

Metrics ​

Localization and SLAM ​

Map Quality ​

Semantic and Safety Layers ​

Benchmark Procedure ​

Statistical Decision Rules ​

Failure Modes and Diagnostics ​

Evidence Artifacts ​

Owner Handoffs ​

Sources ​