Uncertainty Calibration for Perception-SLAM Release Gates

Last updated: 2026-05-09

Purpose

This protocol defines the release gates for uncertainty estimates produced by perception, localization, SLAM, and map-change systems. Airside autonomy cannot rely only on point estimates. The stack must know when it is uncertain, surface that uncertainty to runtime monitors, and avoid confident wrong answers near aircraft, people, geofence boundaries, FOD, temporary GSE, and stale map regions.

This file is used by the perception-SLAM evidence case, the statistical validity protocol, the corruption and fault injection protocol, and online perception monitoring and ODD enforcement.

Calibration Scope

Output	Uncertainty signal	Ground truth
Ego pose	Covariance, particle dispersion, factor-graph marginal, scan-match score	Surveyed trajectory, RTK/INS reference, motion-capture/test-track reference
Relative motion	Odometry covariance, IMU preintegration residual, wheel-slip indicator	Ground-truth relative transform over fixed windows
Object detection	Class probability, bounding-box covariance, track existence probability	Human labels, fused multi-sensor labels, adjudicated critical labels
Free-space/occupancy	Occupancy probability, unknown probability, map confidence	Surveyed occupancy, labeled obstacles/FOD, closed-course fixtures
Map change	Change probability, persistence score, reviewer priority	Cross-session map labels and human map QA
Sensor health	Degradation score, dropout probability, contamination score	Fault injection label, environmental logs, sensor diagnostics

Release Gate Philosophy

Calibration is a safety gate only when it changes behavior. A probability estimate that is not consumed by the planner, monitor, map publisher, or fleet triage system is diagnostic, not safety evidence.

For each uncertainty output, the release package must show:

The consumer of the uncertainty signal.
The action triggered by high uncertainty.
The calibration partition used to set thresholds.
The locked test partition used to measure calibration.
The ODD slices where calibration is valid.
The fallback behavior when calibration is out of scope.

Metrics

Metric	Use	Release interpretation
Expected calibration error (ECE)	Binned confidence vs empirical correctness	Detects broad miscalibration; report by ODD slice
Maximum calibration error (MCE)	Worst-bin confidence gap	Blocks release when a high-risk bin is overconfident
Negative log likelihood (NLL)	Proper scoring rule for probabilistic outputs	Penalizes confident wrong predictions
Brier score	Proper scoring rule for binary/multiclass probabilities	Useful for event probabilities such as map-change or occupancy
Reliability diagram	Visual audit of confidence bins	Required for safety board review
Coverage	Fraction of true values inside predicted set/interval	Core metric for pose/object/free-space uncertainty
Sharpness	Size of confidence set/interval	Prevents trivially wide but useless uncertainty
Risk-coverage curve	Error rate as uncertain samples are rejected	Shows whether abstention/degraded mode is meaningful
Calibration under corruption	Metric delta under fault injection	Detects silent overconfidence under sensor/weather faults

Calibration Methods

Method	Use	Constraints
Temperature scaling	Neural classifier and detector class probabilities	Use independent calibration data; do not tune on locked test
Vector/matrix scaling	Multiclass outputs with class-specific bias	Requires more calibration data than temperature scaling
Isotonic or histogram calibration	Event probabilities with enough samples per bin	Avoid when bins are sparse
Gaussian covariance scaling	Pose and bounding-box covariance	Validate with normalized estimation error squared or coverage
Ensemble/dropout uncertainty	Model epistemic uncertainty	Must be validated against held-out route/airport slices
Conformal prediction	Distribution-free coverage sets under exchangeability	Use slice-aware calibration; do not claim conditional coverage unless tested
Mondrian/sliced conformal	Coverage by risk group or ODD bin	Use when high-risk airside slices differ materially

Gates

Gate	Pass condition	Block condition
U0 provenance	Calibration, validation, and test partitions are versioned and leakage-checked	Any tuning on locked test data
U1 nominal calibration	ECE/NLL/Brier and coverage pass thresholds on nominal ODD slices	Overconfidence in any critical class or zone
U2 high-risk slice calibration	People, aircraft, FOD, geofence, wet/night, and stale-map slices pass	Aggregate pass hides high-risk slice failure
U3 uncertainty actionability	High uncertainty triggers a defined runtime or fleet action	Uncertainty is logged but unused for safety behavior
U4 corruption calibration	Under credible corruptions, uncertainty increases before or with error rate	Silent overconfidence under rain, beam loss, time skew, or extrinsic drift
U5 conformal coverage	Empirical coverage meets target within tolerance by approved slice	Coverage fails in an ODD slice intended for release
U6 operational watch	Post-release confidence/error distributions match validation envelope	Drift alert unresolved beyond watch window

Suggested Threshold Pattern

Exact thresholds are program-specific and must be approved in the release plan. A defensible default pattern:

Output	Example target
Pose 95 percent confidence region	At least 93 percent empirical coverage by approved ODD slice
Pose covariance consistency	Normalized error not persistently above chi-square envelope
Object class confidence	ECE below agreed threshold; no high-confidence false negative for people/aircraft/FOD in locked test
Occupancy/free-space	High-confidence free-space false positive is zero in protected zones
Map-change probability	High-risk changes prioritize review with high recall, accepting moderate false positives
Sensor degradation score	Severe injected faults detected before planner consumes stale/confident output

Runtime Actions

Trigger	Required action
Pose uncertainty above route threshold	Reduce speed, increase following/clearance margins, prepare controlled stop
Pose uncertainty above hard threshold	Controlled stop or remote-assist handoff
Object class uncertainty near aircraft/person	Treat as obstacle or request review; do not suppress as low confidence
Free-space uncertainty in protected zone	Mark unknown/blocked, not free
Map-change uncertainty high	Quarantine tile or create temporary override pending review
Calibration out-of-scope ODD detected	Enforce ODD restriction or degraded mode
Sensor uncertainty rises under corruption	Switch modality, reduce speed, and log event for fleet triage

Airside Failure Modes

Failure mode	Calibration symptom	Required evidence
Wet apron reflection marked as free space	High confidence free-space error	Wet-ground reliability diagram and false-free-space table
Aircraft surface produces poor scan match	Pose covariance remains small while residual rises	Residual-to-error calibration and aircraft-present slice
GNSS multipath near terminal	Localization reports stable pose with wrong global alignment	GNSS degraded slice and cross-sensor consistency check
Temporary GSE added to permanent map	Map-change score underestimates persistence uncertainty	Cross-session map-change calibration
FOD suppressed as noise	Low object confidence despite safety relevance	Critical-object false-negative review
Beam dropout in rain	Detector confidence unchanged as point density falls	Corruption calibration campaign
Camera/LiDAR extrinsic drift	Fusion confidence high despite cross-modal misalignment	Miscalibration fault injection

Evidence Artifacts

Artifact	Contents
Calibration manifest	Data partitions, map versions, vehicle configs, sensor configs, ODD slices
Metric report	ECE, NLL, Brier, coverage, sharpness, risk-coverage by output and slice
Reliability diagrams	Confidence bins for nominal, high-risk, and corrupted slices
Threshold file	Runtime thresholds with owners, review date, and linked evidence
Runtime integration proof	Tests showing monitor/planner/map publisher consumes uncertainty correctly
Drift dashboard	Post-release confidence distributions and alert thresholds
Defect log	Overconfidence incidents, root cause, mitigation, re-test evidence

Owner Handoffs

Owner	Responsibility
Perception/SLAM owner	Produce uncertainty signals and calibration models
Runtime assurance owner	Consume uncertainty and enforce degraded-mode actions
V&V owner	Lock partitions, run calibration gates, publish report
Data platform owner	Maintain calibration/test data lineage and fleet monitoring fields
Safety lead	Approve high-risk thresholds and residual risk
Fleet operations	Monitor drift and trigger post-release rollback/quarantine

Sources

Guo et al., "On Calibration of Modern Neural Networks": https://arxiv.org/abs/1706.04599
PMLR paper page for Guo et al.: https://proceedings.mlr.press/v70/guo17a.html
Angelopoulos and Bates, "A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification": https://arxiv.org/abs/2107.07511
Lei et al., "Distribution-Free Predictive Inference for Regression": https://arxiv.org/abs/1604.04173
Vovk et al., "Nonparametric predictive distributions based on conformal prediction": https://link.springer.com/article/10.1007/s10994-018-5755-8
ISO 21448:2022, Road vehicles - Safety of the intended functionality: https://www.iso.org/standard/77490.html
UL 4600 Ed. 3, Evaluation of Autonomous Products: https://webstore.ansi.org/standards/ul/ul4600ed2023
Waymo Safety Methodologies and Safety Readiness Determinations: https://arxiv.org/abs/2011.00054

SLAM Methods

Methods

Uncertainty Calibration for Perception-SLAM Release Gates ​

Purpose ​

Calibration Scope ​

Release Gate Philosophy ​

Metrics ​

Calibration Methods ​

Gates ​

Suggested Threshold Pattern ​

Runtime Actions ​

Airside Failure Modes ​

Evidence Artifacts ​

Owner Handoffs ​

Sources ​