SLAM Benchmarking Metrics and Datasets
SLAM benchmarking is easy to do badly. A single ATE number can hide scale alignment, loop-closure jumps, bad covariance, relocalization failures, compute spikes, and map artifacts that make a method unusable in a production AV or indoor robot. This guide defines the metrics and dataset choices that should be used across the method-level SLAM library.
Repo Cross-Links
| Related area | Link | Benchmark relevance |
|---|---|---|
| LiDAR method details | LiDAR SLAM Algorithms | Provides method-specific performance notes for KISS-ICP, LIO-SAM, FAST-LIO2, Faster-LIO-style voxel LIO, CT-ICP, and Point-LIO. |
| Production localization metrics | Production LiDAR Map Localization | Adds scan-to-map fitness, degeneracy, covariance, and runtime acceptance gates. |
| Loop/relocalization metrics | LiDAR Place Recognition and Re-Localization | Defines retrieval recall, precision, top-K verification, and kidnapped-robot recovery success. |
| Survey map QA | Map Construction Pipeline | Connects SLAM trajectory metrics to final HD map QA, GCP alignment, and packaging. |
| Estimator consistency | Robust State Estimation Multi-Sensor | Covers NEES/NIS, innovation gating, sensor dropout, and fallback validation. |
| Factor graph residuals | GTSAM Factor Graphs | Explains how factors, covariances, robust kernels, and iSAM2 updates should be inspected. |
| Dense/neural map evaluation | Gaussian Splatting for Driving | Adds reconstruction, rendering, semantic, and simulation-quality metrics for Gaussian/neural SLAM. |
| Coverage audit | SLAM Coverage Audit and Backlog | Tracks benchmark/dataset gaps such as LaMAria, HeLiPR, FusionPortableV2, Oxford Spires, Hilti x Trimble 2026, ETH3D SLAM, VBR, M2DGR, NTU VIRAL, and UrbanLoco. |
Metric Taxonomy
| Metric | Measures | Best for | Formula/implementation notes | Report as | Common misuse |
|---|---|---|---|---|---|
| Absolute Trajectory Error (ATE) | Global trajectory consistency after alignment | SLAM with ground truth | Align estimated and ground-truth trajectories, then compute pose error over timestamps | RMSE/mean/median/max in m; optionally yaw/rot error | Hiding drift by using Sim(3) scale alignment for stereo/LiDAR methods that should have metric scale |
| Relative Pose Error (RPE) | Local drift over a fixed time/distance segment | Odometry and loop-free front ends | Compare relative motion over windows such as 1s, 10m, 100m | Translation %, rotation deg/m or deg/100m | Reporting only ATE after loop closure, which can hide poor local odometry |
| KITTI odometry t_rel/r_rel | Average drift on subsequences | Outdoor vehicle odometry | KITTI averages translational and rotational errors over subsequences of 100-800m | t_rel %, r_rel deg/m | Comparing KITTI numbers to short indoor datasets without normalization |
| Segment drift curve | Error versus path length | AV and long-route mapping | Compute RPE over multiple segment lengths | Table/plot at 10, 50, 100, 200, 400, 800m | Reporting only one length and missing long-range drift |
| Loop-closure precision/recall | Candidate retrieval quality | Place recognition and graph SLAM | Precision after geometric verification; recall at top-K | Precision@K, Recall@K, F1, false positives/km | High recall without verifying false positives that destroy maps |
| Relocalization success | Recovery from unknown pose | Startup/kidnapped robot | Candidate found, verified, and accepted within pose threshold | Success %, time-to-localize, false accepts | Measuring only descriptor recall, not full pose recovery |
| Map consistency | Agreement between overlapping submaps or sessions | Survey mapping | Cloud-to-cloud distance, wall thickness, double-surface rate, GCP residual | cm RMSE, P95, max, visual QA flags | Relying on trajectory ATE when final map has double walls |
| GCP/RTK residual | Geodetic map accuracy | Airside/road HD maps | Compare optimized map landmarks/trajectory to surveyed anchors | East/north/up RMSE and P95 | Treating local SLAM frame as geodetically valid without anchors |
| Scan-matching health | Current registration quality | Runtime localization | Fitness/inlier ratio, residual distribution, Hessian eigenvalues | Time series and thresholds | Using a single scalar fitness that ignores degeneracy direction |
| Estimator consistency | Whether covariance is honest | Sensor fusion and safety | NEES/NIS compared to chi-square bounds | NEES/NIS time series and violation rate | Publishing small covariance because pose looked smooth |
| Robustness score | Recovery under degradation | Production testing | Count tracking loss, reinitializations, skipped factors, fallback duration | Failures/hour, safe-stop events, recovery time | Removing hard sequences from benchmark averages |
| Runtime latency | Real-time viability | Embedded deployment | End-to-end and stage timings; include P95/P99 | ms mean/P95/P99, deadline misses | Reporting desktop average only, not target hardware P99 |
| Resource use | Deployability | Embedded/ROS systems | CPU, GPU, memory, map size, bandwidth | Peak/steady values | Ignoring map growth over long missions |
| Determinism | Predictable behavior | Safety-critical runtime | Re-run same bag and compare outputs/timing | Pose diff, timing jitter, nondeterministic failures | Accepting stochastic variation without bounds |
| Dense reconstruction | Surface quality | RGB-D/Gaussian/neural SLAM | Accuracy, completeness, Chamfer/F-score, render PSNR/SSIM/LPIPS | Metric per scene plus failure cases | Treating nice rendering as reliable metric pose |
Alignment Rules
| Method type | Allowed alignment | Why | Report separately |
|---|---|---|---|
| Monocular visual odometry without scale source | Sim(3) may be fair for research comparison | Scale is unobservable from monocular geometry alone | Also report scale drift if method estimates scale later |
| Stereo, RGB-D, LiDAR, LiDAR-inertial | SE(3), not Sim(3) | Metric scale should be observable from sensors | Any scale correction indicates calibration or estimator problem |
| GNSS/RTK/georeferenced maps | Fixed geodetic frame or SE(3) with known datum transform | Absolute position matters operationally | East/north/up residuals and map datum residual |
| Multi-session SLAM | Per-session and joint-frame metrics | A good per-session trajectory can still merge badly | Cross-session overlap error and loop factor residuals |
| Runtime localization | No post-hoc global alignment for pass/fail | The vehicle must localize online in the correct map frame | Initial convergence time and false accept rate |
Public Dataset Matrix
| Dataset/benchmark | Domain | Sensors | Ground truth | Best for | Weakness for airside/AV production |
|---|---|---|---|---|---|
| KITTI Odometry | Urban/suburban driving | Stereo, Velodyne LiDAR, GPS/IMU ground truth | Training sequences 00-10 with GT; test 11-21 hidden | Outdoor vehicle odometry, KITTI t_rel/r_rel, LiDAR/stereo baselines | Old 64-beam setup, limited adverse weather, not a full HD-map localization benchmark |
| KITTI-360 | Urban driving and scene understanding | Cameras, Velodyne, GPS/IMU, semantic annotations | Accurate localization and annotations | Semantic SLAM, long sequence mapping, dense/novel-view work | Not airport-like; benchmark tasks are broader than odometry |
| MulRan | Urban place recognition and odometry | LiDAR, radar | 6D baseline trajectories | LiDAR/radar place recognition, reverse revisits, long-term gaps | Place-recognition oriented; not all sequences have survey-grade truth |
| NCLT | Long-term campus indoor/outdoor | Omnidirectional cameras, 3D LiDAR, planar LiDAR, GPS/RTK, IMU | Ground-truth pose in one frame | Long-term localization, seasonal change, mixed indoor/outdoor | Segway/campus dynamics differ from AVs; huge data volume |
| Oxford RobotCar | Long-term road autonomy | Cameras, LiDAR, radar extension, GPS/INS | Route repetitions; RTK reference release | Long-term road localization, weather/time changes | Sensor suite and route are urban road, not airside; exact SLAM GT varies by subset |
| Boreas | Multi-season driving | 128-beam LiDAR, Navtech radar, camera, GNSS/INS | Centimeter post-processed poses | All-weather LiDAR/radar odometry and metric localization | Newer ecosystem; not indoor or airport-specific |
| Newer College | Handheld indoor/outdoor campus | Stereo-inertial and LiDAR | Precise 3D map/trajectory from survey pipeline | Handheld LiDAR/VIO, loop closure, mixed open/vegetated areas | Walking speeds and handheld motion differ from vehicle dynamics |
| Hilti SLAM Challenge | Construction, underground, multi-session | Multi-camera rigs, LiDAR, IMU; handheld and robot platforms | Challenge truth, single/multi-session scoring | Robust indoor/construction SLAM and cross-session mapping | License restrictions; construction geometry differs from airport apron |
| EuRoC MAV | Indoor drone | Stereo global-shutter cameras, IMU | Motion-capture/laser tracker GT | Visual-inertial odometry, initialization, fast camera motion | Small indoor MAV scale; no LiDAR and no AV dynamics |
| TUM VI | Indoor/outdoor visual-inertial | Wide-FOV stereo cameras, IMU | Motion capture at start/end sections | VIO robustness, long indoor/outdoor walks | Partial GT for long sequences; camera-first benchmark |
| TUM RGB-D | Indoor RGB-D | RGB-D camera | Motion-capture GT | RGB-D SLAM, ATE/RPE tools, dense mapping | Short-range indoor only; not useful for LiDAR AV stack selection |
| Argoverse 2 Sensor/LiDAR | Road AV data and maps | LiDAR, cameras, maps, ego pose | Map-aligned poses | Learning, map automation, sensor-domain research | Not a standard SLAM odometry leaderboard; use carefully for custom tests |
| nuScenes | Urban AV perception | Cameras, LiDAR, radar, maps, ego pose | Offline localized ego poses | Multi-modal perception and localization research | Short snippets; localization GT not designed as pure SLAM benchmark |
Benchmark Suite by Method Family
| Method family | Minimum public tests | Airside/private tests to add | Must report |
|---|---|---|---|
| KISS-ICP-style LiDAR odometry | KITTI, MulRan, Newer College | Open apron loops, repeated stands, wet/night scans, multi-LiDAR merged clouds | KITTI drift, ATE/RPE, degeneracy stats, runtime P99 |
| FAST-LIO2/Point-LIO-style LIO | KITTI/NCLT/Newer College/Hilti depending platform | IMU vibration, sync offsets, multi-LiDAR extrinsic stress, GPS-denied loops | ATE/RPE, IMU residuals, bias behavior, failure/recovery count |
| LIO-SAM-style factor-graph SLAM | KITTI, NCLT, MulRan, Hilti | GCP-anchored airport survey, loop closure false positives near similar gates | Pre/post-loop ATE, loop precision/recall, graph residuals, map overlap |
| Cartographer/2D SLAM | MIT/Intel-style 2D sets, office/warehouse bags | Warehouse aisles, docking lanes, pallet changes | Occupancy map consistency, localization success, CPU, map update behavior |
| ORB-SLAM3/visual SLAM | EuRoC, TUM VI, TUM RGB-D, KITTI stereo | Night/glare, rolling shutter, low texture, rain-on-lens | Tracking loss, ATE/RPE, feature count, initialization failures |
| OpenVINS/VINS-Fusion | EuRoC, TUM VI, UZH-FPV, KITTI | Camera-IMU temporal error, vehicle vibration, low texture | ATE/RPE, NEES, bias, initialization time |
| Runtime scan-to-map localization | KITTI-derived map split, Boreas localization, custom map | Airport HD map, degraded LiDAR, wrong initial pose, changed stands | Convergence basin, false accept rate, covariance, matching score P99 |
| Gaussian/neural SLAM | Replica/TUM RGB-D/ScanNet if supported | Airside map QA captures, static/dynamic split, simulation replay | ATE/RPE plus reconstruction/rendering metrics and compute |
Airside Private Benchmark Design
| Test class | Required sequences | Pass/fail signals | Why it is needed |
|---|---|---|---|
| Open apron degeneracy | Straight and curved traversals across low-feature tarmac | Covariance inflation in weak axes; no overconfident lateral/yaw jumps | Public datasets underrepresent airport-scale open flat spaces. |
| Repeated stand aliasing | Adjacent gates/stands with similar geometry | No false loop closure; relocalization top-K ambiguity handled | Airports have deliberate repetition that defeats naive place recognition. |
| Dynamic aircraft/GSE | Same stand with aircraft present/absent, buses/carts/fuel trucks | Dynamic objects not fused into permanent map; scan-to-map residual localized | Static maps must survive large moving objects. |
| Weather and lighting | Dry/wet tarmac, rain/fog/de-icing spray, day/night/glare | Tracking loss rate, fallback duration, sensor health flags | Airside operations run across weather and shifts. |
| GPS/RTK degradation | Terminal overhang, aircraft shadowing, multipath zones | Estimator rejects bad GNSS and does not corrupt map/localization | State fusion must handle false absolute measurements. |
| Multi-LiDAR health | Disable/misalign one LiDAR, timestamp offsets, partial blockage | Fault isolated to one sensor; pose degrades gracefully | Multi-sensor AVs need per-sensor diagnostics. |
| Map staleness | Changed barriers/markings, construction, temporary closures | Change is flagged rather than absorbed silently | Fleet maps must be versioned and maintained. |
| Relocalization | Startup at unknown pose, vehicle towed, pose intentionally perturbed | Correct global pose accepted; wrong hypotheses rejected | Startup/recovery is as important as steady-state tracking. |
Acceptance Gates for Production-Oriented SLAM Evaluation
| Gate | Target for airside map construction | Target for runtime localization | Notes |
|---|---|---|---|
| Local odometry drift | Less than 0.5-1.0% before loop closure on airport-like loops | Fallback only; must be time-limited | Exact threshold depends on safe-stop policy and map coverage. |
| Global map accuracy | GCP/RTK residual P95 less than 10cm for navigation layers; tighter for docking zones | N/A | Docking/aircraft proximity may need local survey refinement. |
| Runtime pose accuracy | N/A | Typical steady-state less than 5-10cm lateral and less than 0.2deg yaw in validated map zones | Must be validated against vehicle-level safety margins. |
| False loop closures | Zero accepted false positives in safety benchmark | Zero accepted false relocalizations | A single false closure can invalidate the map or pose. |
| Tracking loss | Documented and recoverable | Safe fallback or safe stop within safety budget | "No output" can be safer than wrong output. |
| Deadline misses | Offline acceptable if bounded | P99 under localization cycle budget | Report on target hardware, not workstation only. |
| Covariance consistency | NEES/NIS within expected bounds on instrumented tests | Same, with fault injection | Overconfidence is a safety bug. |
| Map artifact rate | No double walls/ghost aircraft in operational layers | N/A | Visual inspection plus automated cloud distance checks. |
Reporting Template for Method Pages
| Section | Required content |
|---|---|
| Dataset table | Public datasets used, sequence IDs, sensor subset, preprocessing, alignment mode |
| Metrics table | ATE, RPE/segment drift, loop metrics if applicable, runtime, memory, failure counts |
| Hardware table | CPU/GPU, ROS version, compiler/build mode, thread count, target embedded result if available |
| Calibration assumptions | Intrinsics/extrinsics/time sync source; whether online calibration is enabled |
| Failure analysis | At least three failure modes with detection and mitigation |
| Production fit | License, ROS 1/2 support, API stability, diagnostics, hot-path determinism |
| Airside extrapolation | What public datasets do not test and which private sequences are required |
Metric Pitfalls
| Pitfall | Why it misleads | Better practice |
|---|---|---|
| Averaging across easy and hard sequences without stratification | A method can look strong by dominating easy sequences and failing rare critical cases | Report per-sequence and by condition bucket. |
| Using final ATE only after loop closure | A loop can hide poor odometry until recovery is impossible | Report pre-loop odometry drift and post-loop global consistency. |
| Ignoring false positive loops | Recall improvements can destroy maps | Report precision after geometric verification and robust kernel behavior. |
| Reporting mean latency only | Embedded systems fail at P99/P999 | Report stage timing distributions and deadline misses. |
| Comparing methods with different alignment freedoms | Sim(3) can forgive metric-scale errors | State SE(3)/Sim(3)/fixed-frame alignment explicitly. |
| Treating map and trajectory as interchangeable | Good trajectory can produce a poor dense map | Measure map overlap, surface thickness, and dynamic artifacts. |
| Ignoring estimator consistency | Smooth pose with false covariance can corrupt fusion | Report NEES/NIS and innovation gating outcomes. |
| Benchmarking only public datasets | Public data rarely matches the target ODD | Add domain-specific private tests and fault injection. |
Sources
- KITTI odometry benchmark official page and evaluation protocol: https://www.cvlibs.net/datasets/kitti/eval_odometry.php
- KITTI dataset paper, Geiger et al., "Vision meets Robotics: The KITTI Dataset", IJRR 2013: https://www.cvlibs.net/publications/Geiger2013IJRR.pdf
- KITTI-360 paper and project: https://arxiv.org/abs/2109.13410 and https://www.cvlibs.net/datasets/kitti-360/
- TUM RGB-D benchmark tools and ATE/RPE scripts: https://cvg.cit.tum.de/data/datasets/rgbd-dataset/tools
- Sturm et al., "A Benchmark for the Evaluation of RGB-D SLAM Systems", IROS 2012: https://cvg.cit.tum.de/_media/spezial/bib/sturm12iros.pdf
- OpenVINS evaluation metrics documentation for ATE, RPE, RMSE, and NEES: https://docs.openvins.com/eval-metrics.html
- EuRoC MAV dataset paper, Burri et al., "The EuRoC micro aerial vehicle datasets", IJRR 2016: https://projects.asl.ethz.ch/datasets/doku.php?id=kmavvisualinertialdatasets
- TUM VI benchmark official page and paper: https://cvg.cit.tum.de/data/datasets/visual-inertial-dataset and https://arxiv.org/abs/1804.06120
- Newer College Dataset official page and paper: https://ori-drs.github.io/newer-college-dataset/ and https://arxiv.org/abs/2003.05691
- MulRan Dataset official page and paper: https://sites.google.com/view/mulran-pr/dataset and https://gisbi-kim.github.io/publications/gkim-2020-icra.pdf
- University of Michigan NCLT dataset official repository: https://deepblue.lib.umich.edu/data/concern/data_sets/h128nf37h
- Oxford RobotCar Dataset official site: https://robotcar.org.uk/
- Boreas dataset official site and AWS registry: https://www.boreas.utias.utoronto.ca/ and https://registry.opendata.aws/boreas/
- Hilti SLAM Challenge 2023 official dataset page: https://www.hilti-challenge.com/dataset-2023
- Argoverse 2 official dataset page: https://www.argoverse.org/av2.html
- nuScenes official dataset and paper: https://www.nuscenes.org/ and https://arxiv.org/abs/1903.11027
- evo trajectory evaluation tool metrics documentation: https://github.com/MichaelGrupp/evo/wiki/Metrics