Skip to content

SLAM Benchmarking Metrics and Datasets

SLAM benchmarking is easy to do badly. A single ATE number can hide scale alignment, loop-closure jumps, bad covariance, relocalization failures, compute spikes, and map artifacts that make a method unusable in a production AV or indoor robot. This guide defines the metrics and dataset choices that should be used across the method-level SLAM library.

Related areaLinkBenchmark relevance
LiDAR method detailsLiDAR SLAM AlgorithmsProvides method-specific performance notes for KISS-ICP, LIO-SAM, FAST-LIO2, Faster-LIO-style voxel LIO, CT-ICP, and Point-LIO.
Production localization metricsProduction LiDAR Map LocalizationAdds scan-to-map fitness, degeneracy, covariance, and runtime acceptance gates.
Loop/relocalization metricsLiDAR Place Recognition and Re-LocalizationDefines retrieval recall, precision, top-K verification, and kidnapped-robot recovery success.
Survey map QAMap Construction PipelineConnects SLAM trajectory metrics to final HD map QA, GCP alignment, and packaging.
Estimator consistencyRobust State Estimation Multi-SensorCovers NEES/NIS, innovation gating, sensor dropout, and fallback validation.
Factor graph residualsGTSAM Factor GraphsExplains how factors, covariances, robust kernels, and iSAM2 updates should be inspected.
Dense/neural map evaluationGaussian Splatting for DrivingAdds reconstruction, rendering, semantic, and simulation-quality metrics for Gaussian/neural SLAM.
Coverage auditSLAM Coverage Audit and BacklogTracks benchmark/dataset gaps such as LaMAria, HeLiPR, FusionPortableV2, Oxford Spires, Hilti x Trimble 2026, ETH3D SLAM, VBR, M2DGR, NTU VIRAL, and UrbanLoco.

Metric Taxonomy

MetricMeasuresBest forFormula/implementation notesReport asCommon misuse
Absolute Trajectory Error (ATE)Global trajectory consistency after alignmentSLAM with ground truthAlign estimated and ground-truth trajectories, then compute pose error over timestampsRMSE/mean/median/max in m; optionally yaw/rot errorHiding drift by using Sim(3) scale alignment for stereo/LiDAR methods that should have metric scale
Relative Pose Error (RPE)Local drift over a fixed time/distance segmentOdometry and loop-free front endsCompare relative motion over windows such as 1s, 10m, 100mTranslation %, rotation deg/m or deg/100mReporting only ATE after loop closure, which can hide poor local odometry
KITTI odometry t_rel/r_relAverage drift on subsequencesOutdoor vehicle odometryKITTI averages translational and rotational errors over subsequences of 100-800mt_rel %, r_rel deg/mComparing KITTI numbers to short indoor datasets without normalization
Segment drift curveError versus path lengthAV and long-route mappingCompute RPE over multiple segment lengthsTable/plot at 10, 50, 100, 200, 400, 800mReporting only one length and missing long-range drift
Loop-closure precision/recallCandidate retrieval qualityPlace recognition and graph SLAMPrecision after geometric verification; recall at top-KPrecision@K, Recall@K, F1, false positives/kmHigh recall without verifying false positives that destroy maps
Relocalization successRecovery from unknown poseStartup/kidnapped robotCandidate found, verified, and accepted within pose thresholdSuccess %, time-to-localize, false acceptsMeasuring only descriptor recall, not full pose recovery
Map consistencyAgreement between overlapping submaps or sessionsSurvey mappingCloud-to-cloud distance, wall thickness, double-surface rate, GCP residualcm RMSE, P95, max, visual QA flagsRelying on trajectory ATE when final map has double walls
GCP/RTK residualGeodetic map accuracyAirside/road HD mapsCompare optimized map landmarks/trajectory to surveyed anchorsEast/north/up RMSE and P95Treating local SLAM frame as geodetically valid without anchors
Scan-matching healthCurrent registration qualityRuntime localizationFitness/inlier ratio, residual distribution, Hessian eigenvaluesTime series and thresholdsUsing a single scalar fitness that ignores degeneracy direction
Estimator consistencyWhether covariance is honestSensor fusion and safetyNEES/NIS compared to chi-square boundsNEES/NIS time series and violation ratePublishing small covariance because pose looked smooth
Robustness scoreRecovery under degradationProduction testingCount tracking loss, reinitializations, skipped factors, fallback durationFailures/hour, safe-stop events, recovery timeRemoving hard sequences from benchmark averages
Runtime latencyReal-time viabilityEmbedded deploymentEnd-to-end and stage timings; include P95/P99ms mean/P95/P99, deadline missesReporting desktop average only, not target hardware P99
Resource useDeployabilityEmbedded/ROS systemsCPU, GPU, memory, map size, bandwidthPeak/steady valuesIgnoring map growth over long missions
DeterminismPredictable behaviorSafety-critical runtimeRe-run same bag and compare outputs/timingPose diff, timing jitter, nondeterministic failuresAccepting stochastic variation without bounds
Dense reconstructionSurface qualityRGB-D/Gaussian/neural SLAMAccuracy, completeness, Chamfer/F-score, render PSNR/SSIM/LPIPSMetric per scene plus failure casesTreating nice rendering as reliable metric pose

Alignment Rules

Method typeAllowed alignmentWhyReport separately
Monocular visual odometry without scale sourceSim(3) may be fair for research comparisonScale is unobservable from monocular geometry aloneAlso report scale drift if method estimates scale later
Stereo, RGB-D, LiDAR, LiDAR-inertialSE(3), not Sim(3)Metric scale should be observable from sensorsAny scale correction indicates calibration or estimator problem
GNSS/RTK/georeferenced mapsFixed geodetic frame or SE(3) with known datum transformAbsolute position matters operationallyEast/north/up residuals and map datum residual
Multi-session SLAMPer-session and joint-frame metricsA good per-session trajectory can still merge badlyCross-session overlap error and loop factor residuals
Runtime localizationNo post-hoc global alignment for pass/failThe vehicle must localize online in the correct map frameInitial convergence time and false accept rate

Public Dataset Matrix

Dataset/benchmarkDomainSensorsGround truthBest forWeakness for airside/AV production
KITTI OdometryUrban/suburban drivingStereo, Velodyne LiDAR, GPS/IMU ground truthTraining sequences 00-10 with GT; test 11-21 hiddenOutdoor vehicle odometry, KITTI t_rel/r_rel, LiDAR/stereo baselinesOld 64-beam setup, limited adverse weather, not a full HD-map localization benchmark
KITTI-360Urban driving and scene understandingCameras, Velodyne, GPS/IMU, semantic annotationsAccurate localization and annotationsSemantic SLAM, long sequence mapping, dense/novel-view workNot airport-like; benchmark tasks are broader than odometry
MulRanUrban place recognition and odometryLiDAR, radar6D baseline trajectoriesLiDAR/radar place recognition, reverse revisits, long-term gapsPlace-recognition oriented; not all sequences have survey-grade truth
NCLTLong-term campus indoor/outdoorOmnidirectional cameras, 3D LiDAR, planar LiDAR, GPS/RTK, IMUGround-truth pose in one frameLong-term localization, seasonal change, mixed indoor/outdoorSegway/campus dynamics differ from AVs; huge data volume
Oxford RobotCarLong-term road autonomyCameras, LiDAR, radar extension, GPS/INSRoute repetitions; RTK reference releaseLong-term road localization, weather/time changesSensor suite and route are urban road, not airside; exact SLAM GT varies by subset
BoreasMulti-season driving128-beam LiDAR, Navtech radar, camera, GNSS/INSCentimeter post-processed posesAll-weather LiDAR/radar odometry and metric localizationNewer ecosystem; not indoor or airport-specific
Newer CollegeHandheld indoor/outdoor campusStereo-inertial and LiDARPrecise 3D map/trajectory from survey pipelineHandheld LiDAR/VIO, loop closure, mixed open/vegetated areasWalking speeds and handheld motion differ from vehicle dynamics
Hilti SLAM ChallengeConstruction, underground, multi-sessionMulti-camera rigs, LiDAR, IMU; handheld and robot platformsChallenge truth, single/multi-session scoringRobust indoor/construction SLAM and cross-session mappingLicense restrictions; construction geometry differs from airport apron
EuRoC MAVIndoor droneStereo global-shutter cameras, IMUMotion-capture/laser tracker GTVisual-inertial odometry, initialization, fast camera motionSmall indoor MAV scale; no LiDAR and no AV dynamics
TUM VIIndoor/outdoor visual-inertialWide-FOV stereo cameras, IMUMotion capture at start/end sectionsVIO robustness, long indoor/outdoor walksPartial GT for long sequences; camera-first benchmark
TUM RGB-DIndoor RGB-DRGB-D cameraMotion-capture GTRGB-D SLAM, ATE/RPE tools, dense mappingShort-range indoor only; not useful for LiDAR AV stack selection
Argoverse 2 Sensor/LiDARRoad AV data and mapsLiDAR, cameras, maps, ego poseMap-aligned posesLearning, map automation, sensor-domain researchNot a standard SLAM odometry leaderboard; use carefully for custom tests
nuScenesUrban AV perceptionCameras, LiDAR, radar, maps, ego poseOffline localized ego posesMulti-modal perception and localization researchShort snippets; localization GT not designed as pure SLAM benchmark

Benchmark Suite by Method Family

Method familyMinimum public testsAirside/private tests to addMust report
KISS-ICP-style LiDAR odometryKITTI, MulRan, Newer CollegeOpen apron loops, repeated stands, wet/night scans, multi-LiDAR merged cloudsKITTI drift, ATE/RPE, degeneracy stats, runtime P99
FAST-LIO2/Point-LIO-style LIOKITTI/NCLT/Newer College/Hilti depending platformIMU vibration, sync offsets, multi-LiDAR extrinsic stress, GPS-denied loopsATE/RPE, IMU residuals, bias behavior, failure/recovery count
LIO-SAM-style factor-graph SLAMKITTI, NCLT, MulRan, HiltiGCP-anchored airport survey, loop closure false positives near similar gatesPre/post-loop ATE, loop precision/recall, graph residuals, map overlap
Cartographer/2D SLAMMIT/Intel-style 2D sets, office/warehouse bagsWarehouse aisles, docking lanes, pallet changesOccupancy map consistency, localization success, CPU, map update behavior
ORB-SLAM3/visual SLAMEuRoC, TUM VI, TUM RGB-D, KITTI stereoNight/glare, rolling shutter, low texture, rain-on-lensTracking loss, ATE/RPE, feature count, initialization failures
OpenVINS/VINS-FusionEuRoC, TUM VI, UZH-FPV, KITTICamera-IMU temporal error, vehicle vibration, low textureATE/RPE, NEES, bias, initialization time
Runtime scan-to-map localizationKITTI-derived map split, Boreas localization, custom mapAirport HD map, degraded LiDAR, wrong initial pose, changed standsConvergence basin, false accept rate, covariance, matching score P99
Gaussian/neural SLAMReplica/TUM RGB-D/ScanNet if supportedAirside map QA captures, static/dynamic split, simulation replayATE/RPE plus reconstruction/rendering metrics and compute

Airside Private Benchmark Design

Test classRequired sequencesPass/fail signalsWhy it is needed
Open apron degeneracyStraight and curved traversals across low-feature tarmacCovariance inflation in weak axes; no overconfident lateral/yaw jumpsPublic datasets underrepresent airport-scale open flat spaces.
Repeated stand aliasingAdjacent gates/stands with similar geometryNo false loop closure; relocalization top-K ambiguity handledAirports have deliberate repetition that defeats naive place recognition.
Dynamic aircraft/GSESame stand with aircraft present/absent, buses/carts/fuel trucksDynamic objects not fused into permanent map; scan-to-map residual localizedStatic maps must survive large moving objects.
Weather and lightingDry/wet tarmac, rain/fog/de-icing spray, day/night/glareTracking loss rate, fallback duration, sensor health flagsAirside operations run across weather and shifts.
GPS/RTK degradationTerminal overhang, aircraft shadowing, multipath zonesEstimator rejects bad GNSS and does not corrupt map/localizationState fusion must handle false absolute measurements.
Multi-LiDAR healthDisable/misalign one LiDAR, timestamp offsets, partial blockageFault isolated to one sensor; pose degrades gracefullyMulti-sensor AVs need per-sensor diagnostics.
Map stalenessChanged barriers/markings, construction, temporary closuresChange is flagged rather than absorbed silentlyFleet maps must be versioned and maintained.
RelocalizationStartup at unknown pose, vehicle towed, pose intentionally perturbedCorrect global pose accepted; wrong hypotheses rejectedStartup/recovery is as important as steady-state tracking.

Acceptance Gates for Production-Oriented SLAM Evaluation

GateTarget for airside map constructionTarget for runtime localizationNotes
Local odometry driftLess than 0.5-1.0% before loop closure on airport-like loopsFallback only; must be time-limitedExact threshold depends on safe-stop policy and map coverage.
Global map accuracyGCP/RTK residual P95 less than 10cm for navigation layers; tighter for docking zonesN/ADocking/aircraft proximity may need local survey refinement.
Runtime pose accuracyN/ATypical steady-state less than 5-10cm lateral and less than 0.2deg yaw in validated map zonesMust be validated against vehicle-level safety margins.
False loop closuresZero accepted false positives in safety benchmarkZero accepted false relocalizationsA single false closure can invalidate the map or pose.
Tracking lossDocumented and recoverableSafe fallback or safe stop within safety budget"No output" can be safer than wrong output.
Deadline missesOffline acceptable if boundedP99 under localization cycle budgetReport on target hardware, not workstation only.
Covariance consistencyNEES/NIS within expected bounds on instrumented testsSame, with fault injectionOverconfidence is a safety bug.
Map artifact rateNo double walls/ghost aircraft in operational layersN/AVisual inspection plus automated cloud distance checks.

Reporting Template for Method Pages

SectionRequired content
Dataset tablePublic datasets used, sequence IDs, sensor subset, preprocessing, alignment mode
Metrics tableATE, RPE/segment drift, loop metrics if applicable, runtime, memory, failure counts
Hardware tableCPU/GPU, ROS version, compiler/build mode, thread count, target embedded result if available
Calibration assumptionsIntrinsics/extrinsics/time sync source; whether online calibration is enabled
Failure analysisAt least three failure modes with detection and mitigation
Production fitLicense, ROS 1/2 support, API stability, diagnostics, hot-path determinism
Airside extrapolationWhat public datasets do not test and which private sequences are required

Metric Pitfalls

PitfallWhy it misleadsBetter practice
Averaging across easy and hard sequences without stratificationA method can look strong by dominating easy sequences and failing rare critical casesReport per-sequence and by condition bucket.
Using final ATE only after loop closureA loop can hide poor odometry until recovery is impossibleReport pre-loop odometry drift and post-loop global consistency.
Ignoring false positive loopsRecall improvements can destroy mapsReport precision after geometric verification and robust kernel behavior.
Reporting mean latency onlyEmbedded systems fail at P99/P999Report stage timing distributions and deadline misses.
Comparing methods with different alignment freedomsSim(3) can forgive metric-scale errorsState SE(3)/Sim(3)/fixed-frame alignment explicitly.
Treating map and trajectory as interchangeableGood trajectory can produce a poor dense mapMeasure map overlap, surface thickness, and dynamic artifacts.
Ignoring estimator consistencySmooth pose with false covariance can corrupt fusionReport NEES/NIS and innovation gating outcomes.
Benchmarking only public datasetsPublic data rarely matches the target ODDAdd domain-specific private tests and fault injection.

Sources

Public research notes collected from public sources.