Skip to content

Uncertainty Calibration for Perception-SLAM Release Gates

Last updated: 2026-05-09

Purpose

This protocol defines the release gates for uncertainty estimates produced by perception, localization, SLAM, and map-change systems. Airside autonomy cannot rely only on point estimates. The stack must know when it is uncertain, surface that uncertainty to runtime monitors, and avoid confident wrong answers near aircraft, people, geofence boundaries, FOD, temporary GSE, and stale map regions.

This file is used by the perception-SLAM evidence case, the statistical validity protocol, the corruption and fault injection protocol, and online perception monitoring and ODD enforcement.

Calibration Scope

OutputUncertainty signalGround truth
Ego poseCovariance, particle dispersion, factor-graph marginal, scan-match scoreSurveyed trajectory, RTK/INS reference, motion-capture/test-track reference
Relative motionOdometry covariance, IMU preintegration residual, wheel-slip indicatorGround-truth relative transform over fixed windows
Object detectionClass probability, bounding-box covariance, track existence probabilityHuman labels, fused multi-sensor labels, adjudicated critical labels
Free-space/occupancyOccupancy probability, unknown probability, map confidenceSurveyed occupancy, labeled obstacles/FOD, closed-course fixtures
Map changeChange probability, persistence score, reviewer priorityCross-session map labels and human map QA
Sensor healthDegradation score, dropout probability, contamination scoreFault injection label, environmental logs, sensor diagnostics

Release Gate Philosophy

Calibration is a safety gate only when it changes behavior. A probability estimate that is not consumed by the planner, monitor, map publisher, or fleet triage system is diagnostic, not safety evidence.

For each uncertainty output, the release package must show:

  1. The consumer of the uncertainty signal.
  2. The action triggered by high uncertainty.
  3. The calibration partition used to set thresholds.
  4. The locked test partition used to measure calibration.
  5. The ODD slices where calibration is valid.
  6. The fallback behavior when calibration is out of scope.

Metrics

MetricUseRelease interpretation
Expected calibration error (ECE)Binned confidence vs empirical correctnessDetects broad miscalibration; report by ODD slice
Maximum calibration error (MCE)Worst-bin confidence gapBlocks release when a high-risk bin is overconfident
Negative log likelihood (NLL)Proper scoring rule for probabilistic outputsPenalizes confident wrong predictions
Brier scoreProper scoring rule for binary/multiclass probabilitiesUseful for event probabilities such as map-change or occupancy
Reliability diagramVisual audit of confidence binsRequired for safety board review
CoverageFraction of true values inside predicted set/intervalCore metric for pose/object/free-space uncertainty
SharpnessSize of confidence set/intervalPrevents trivially wide but useless uncertainty
Risk-coverage curveError rate as uncertain samples are rejectedShows whether abstention/degraded mode is meaningful
Calibration under corruptionMetric delta under fault injectionDetects silent overconfidence under sensor/weather faults

Calibration Methods

MethodUseConstraints
Temperature scalingNeural classifier and detector class probabilitiesUse independent calibration data; do not tune on locked test
Vector/matrix scalingMulticlass outputs with class-specific biasRequires more calibration data than temperature scaling
Isotonic or histogram calibrationEvent probabilities with enough samples per binAvoid when bins are sparse
Gaussian covariance scalingPose and bounding-box covarianceValidate with normalized estimation error squared or coverage
Ensemble/dropout uncertaintyModel epistemic uncertaintyMust be validated against held-out route/airport slices
Conformal predictionDistribution-free coverage sets under exchangeabilityUse slice-aware calibration; do not claim conditional coverage unless tested
Mondrian/sliced conformalCoverage by risk group or ODD binUse when high-risk airside slices differ materially

Gates

GatePass conditionBlock condition
U0 provenanceCalibration, validation, and test partitions are versioned and leakage-checkedAny tuning on locked test data
U1 nominal calibrationECE/NLL/Brier and coverage pass thresholds on nominal ODD slicesOverconfidence in any critical class or zone
U2 high-risk slice calibrationPeople, aircraft, FOD, geofence, wet/night, and stale-map slices passAggregate pass hides high-risk slice failure
U3 uncertainty actionabilityHigh uncertainty triggers a defined runtime or fleet actionUncertainty is logged but unused for safety behavior
U4 corruption calibrationUnder credible corruptions, uncertainty increases before or with error rateSilent overconfidence under rain, beam loss, time skew, or extrinsic drift
U5 conformal coverageEmpirical coverage meets target within tolerance by approved sliceCoverage fails in an ODD slice intended for release
U6 operational watchPost-release confidence/error distributions match validation envelopeDrift alert unresolved beyond watch window

Suggested Threshold Pattern

Exact thresholds are program-specific and must be approved in the release plan. A defensible default pattern:

OutputExample target
Pose 95 percent confidence regionAt least 93 percent empirical coverage by approved ODD slice
Pose covariance consistencyNormalized error not persistently above chi-square envelope
Object class confidenceECE below agreed threshold; no high-confidence false negative for people/aircraft/FOD in locked test
Occupancy/free-spaceHigh-confidence free-space false positive is zero in protected zones
Map-change probabilityHigh-risk changes prioritize review with high recall, accepting moderate false positives
Sensor degradation scoreSevere injected faults detected before planner consumes stale/confident output

Runtime Actions

TriggerRequired action
Pose uncertainty above route thresholdReduce speed, increase following/clearance margins, prepare controlled stop
Pose uncertainty above hard thresholdControlled stop or remote-assist handoff
Object class uncertainty near aircraft/personTreat as obstacle or request review; do not suppress as low confidence
Free-space uncertainty in protected zoneMark unknown/blocked, not free
Map-change uncertainty highQuarantine tile or create temporary override pending review
Calibration out-of-scope ODD detectedEnforce ODD restriction or degraded mode
Sensor uncertainty rises under corruptionSwitch modality, reduce speed, and log event for fleet triage

Airside Failure Modes

Failure modeCalibration symptomRequired evidence
Wet apron reflection marked as free spaceHigh confidence free-space errorWet-ground reliability diagram and false-free-space table
Aircraft surface produces poor scan matchPose covariance remains small while residual risesResidual-to-error calibration and aircraft-present slice
GNSS multipath near terminalLocalization reports stable pose with wrong global alignmentGNSS degraded slice and cross-sensor consistency check
Temporary GSE added to permanent mapMap-change score underestimates persistence uncertaintyCross-session map-change calibration
FOD suppressed as noiseLow object confidence despite safety relevanceCritical-object false-negative review
Beam dropout in rainDetector confidence unchanged as point density fallsCorruption calibration campaign
Camera/LiDAR extrinsic driftFusion confidence high despite cross-modal misalignmentMiscalibration fault injection

Evidence Artifacts

ArtifactContents
Calibration manifestData partitions, map versions, vehicle configs, sensor configs, ODD slices
Metric reportECE, NLL, Brier, coverage, sharpness, risk-coverage by output and slice
Reliability diagramsConfidence bins for nominal, high-risk, and corrupted slices
Threshold fileRuntime thresholds with owners, review date, and linked evidence
Runtime integration proofTests showing monitor/planner/map publisher consumes uncertainty correctly
Drift dashboardPost-release confidence distributions and alert thresholds
Defect logOverconfidence incidents, root cause, mitigation, re-test evidence

Owner Handoffs

OwnerResponsibility
Perception/SLAM ownerProduce uncertainty signals and calibration models
Runtime assurance ownerConsume uncertainty and enforce degraded-mode actions
V&V ownerLock partitions, run calibration gates, publish report
Data platform ownerMaintain calibration/test data lineage and fleet monitoring fields
Safety leadApprove high-risk thresholds and residual risk
Fleet operationsMonitor drift and trigger post-release rollback/quarantine

Sources

Public research notes collected from public sources.