Perception-SLAM Statistical Validity Protocol

Last updated: 2026-05-09

Purpose

This protocol defines how to make defensible statistical claims about perception-SLAM and map reliability for airside autonomous ground vehicles. It prevents common validation failures: aggregate-only pass rates, reused test sets, silent cherry-picking, correlated traversals treated as independent samples, and release decisions based on mileage alone.

It supports the top-level perception-SLAM map reliability evidence case, the SLAM map benchmark protocol, the uncertainty calibration release gates, and the perception-SLAM fleet data contract.

Statistical Claims

Every release claim must be written before the campaign starts:

Claim type	Example claim	Required statistical form
Reliability	"False-free-space defects are below the release threshold in protected apron zones"	One-sided upper confidence bound or Bayesian credible upper bound
Performance	"ATE p95 is below 0.20 m on service-road routes"	Quantile estimate with confidence interval by ODD slice
Robustness	"Severe rain corruption causes no more than 15 percent degradation in localization availability"	Paired comparison against clean baseline
Calibration	"90 percent pose uncertainty sets contain the true pose at least 88 percent of the time"	Coverage interval and binomial/hierarchical test
Regression	"Release candidate is not worse than production by more than delta"	Non-inferiority test with pre-set margin

Units of Analysis

Do not count every frame as an independent sample. Perception and SLAM errors are temporally and spatially correlated.

Unit	Use for	Independence rule
Frame	Object-level perception, calibration bins, point-cloud quality	Use cluster-robust intervals grouped by scenario/session
Segment	10-60 second slice around an event or route feature	Preferred unit for detection, relocalization, corruption tests
Traverse	Full route, stand approach, or depot mission	Preferred unit for map/localization reliability
Map tile	Map publication and quarantine decisions	Tile is independent only if source traversals and route context differ
Scenario family	ISO 34502-style functional/logical scenario	Used for coverage and safety argument completeness
Airport-day	Fleet post-release monitoring	Used for drift, weather, and operational exposure

Stratification Plan

At minimum, all headline metrics are reported by:

Dimension	Required bins
Zone	stand, apron service road, depot, taxiway crossing support, maintenance area
Lighting	day, dusk/dawn, night, glare
Weather/surface	dry, wet, heavy rain, fog/low visibility if in ODD, snow/ice if in ODD
Traffic density	quiet, normal turnaround, congested
Aircraft state	absent, parked, servicing, pushback/taxi adjacency
Sensor health	nominal, reduced point density, camera degraded, GNSS degraded, time-sync warning
Map age	newly surveyed, less than 7 days, 7-30 days, more than 30 days, changed since last survey
Release phase	offline benchmark, closed-course, shadow mode, limited autonomous, production watch

Aggregate release decisions are invalid if any safety-critical slice fails or lacks enough exposure for the intended ODD.

Metrics and Decision Rules

Metric	Primary decision rule	Minimum reporting
False-free-space critical defects	Zero observed in release-critical protected zones; if any observed, release blocked until fixed and re-tested	Count, exposure denominator, root cause, residual risk
ATE / RPE	p95 and p99 upper confidence bounds below route-specific thresholds	Median, p95, p99, CI, worst segment
Localization availability	Lower confidence bound above threshold by ODD slice	Availability, dropouts/hour, longest outage
Relocalization latency	p95 upper confidence bound below allowed time/distance budget	Latency distribution and failed recoveries
Scan-to-map residual	Candidate distribution non-inferior to production	Paired delta, drift by map age
Ghost rate	Upper bound below map publication threshold	Ghosts per 100 m or per stand/tile
Static preservation	Lower bound above threshold	Lost-feature categories and impact
Calibration coverage	Empirical coverage interval includes nominal target within tolerance	Reliability diagram by risk bin
Robustness degradation	Paired performance drop below corruption-specific limit	Clean vs corrupt paired table

Sample Size Rules

Zero-Failure Claims

When claiming a rare-event upper bound with zero observed failures, use the rule:

N >= ln(alpha) / ln(1 - p*)

Where p* is the maximum acceptable event probability per independent unit and 1 - alpha is confidence. Example: to show fewer than 1 critical false-free-space defect per 1,000 independent protected-zone segments at 95 percent confidence with zero observed failures, use N >= ln(0.05) / ln(0.999) = 2,995 independent segments.

This does not prove system safety by itself. RAND's "Driving to Safety" shows why road-mile accumulation alone becomes impractical for rare fatality-level claims. Use scenario-based, accelerated, simulation, closed-course, and fleet evidence together.

Non-Zero Defect Claims

For defect rates with observed failures, report an exact binomial confidence interval or a beta-binomial hierarchical model when data are clustered by airport, route, or weather. Use one-sided upper bounds for safety release decisions.

Quantile Claims

For p95/p99 ATE, RPE, relocalization time, and residual distributions:

Use bootstrap confidence intervals over independent segments or traverses, not frames.
Preserve scenario grouping during bootstrap resampling.
Report the worst ODD slice even when aggregate metrics pass.
Use non-parametric intervals unless a distributional model is justified and checked.

Regression Claims

For candidate vs production:

Use paired comparisons when both stacks run on the same logs.
Define a non-inferiority margin before testing.
Block release if the candidate improves aggregate performance while degrading any critical ODD slice beyond the margin.

Data Partitioning

Partition	Purpose	Rules
Development	Tuning, debugging, ablation	May be reused; never used for release claims
Calibration	Temperature scaling, uncertainty thresholds, conformal quantiles	Frozen before final evaluation
Validation	Model and map selection	Can choose between candidates; cannot serve as final release evidence
Locked test	Release claim	Read-only; access logged; no tuning after inspection
Shadow-mode watch	Operational confirmation	Used for post-release monitoring and future test-set design

No sequence may appear in more than one partition through near-duplicate route, time, or map-tile leakage. For repeated airport routes, split by day/session/map version where possible, not random frames.

Sequential Testing and Stopping

Validation teams may use sequential release testing only if the stopping rule is pre-registered:

Situation	Allowed stopping rule
Critical defect observed	Stop immediately, block release, open safety defect
Sufficient evidence accumulated	Stop only when pre-defined confidence/coverage criteria are met
Time or budget exhausted	Report inconclusive; do not convert to pass
Candidate underperforms production	Stop for futility if pre-defined non-inferiority cannot be achieved

Repeated looks at the data require alpha spending, Bayesian monitoring with pre-specified priors, or a fixed holdout untouched until final analysis.

Scenario Coverage

Use ISO 34502-style functional, logical, and concrete scenario decomposition. For this airside domain, the statistical report must include:

Coverage matrix from airside scenario taxonomy to benchmark logs.
Route and map-tile coverage from SLAM map benchmark protocol.
Corruption severity coverage from perception-SLAM corruption and fault injection protocol.
Fleet exposure coverage from perception-SLAM fleet data contract.

Bias and Validity Controls

Risk	Control
Test-set contamination	Locked manifests, hash-based duplicate detection, access logging
Correlated samples inflated as independent	Cluster by segment/traverse/tile/airport-day
Weather/lighting under-sampled	Stratified minimums and inconclusive status for missing ODD slices
Overfitting to public benchmarks	Public benchmark used only as external comparability, not release proof
Simulation over-trusted	Simulation results discounted unless sim-to-real gap is measured
Human labels inconsistent	Inter-annotator agreement and adjudication for critical labels
Map survey error	Survey uncertainty propagated into ATE/map-layer thresholds
Survivorship bias in fleet data	Include failures, aborted missions, upload failures, and quarantined bags

Release Decision Template

Each statistical report must include:

Candidate build, map, calibration, and data manifest IDs.
Pre-registered claims and thresholds.
Partitions and leakage checks.
Sample counts by independent unit and ODD slice.
Metric tables with confidence intervals.
Failed, inconclusive, or waived slices.
Comparison to production baseline.
Safety defect references and disposition.
Final recommendation: pass, pass with ODD restriction, inconclusive, or block.

Owner Handoffs

Owner	Responsibility
V&V statistician	Pre-registration, sample size, intervals, final statistical decision
Perception/SLAM owner	Metric implementation, baseline comparison, failure triage
Mapping owner	Map tile sample frame, survey uncertainty, map-change status
Data platform owner	Dataset manifests, partition enforcement, duplicate detection
Safety lead	Criticality thresholds, release interpretation, waiver control
Fleet operations	Shadow-mode exposure and event completeness

Sources

ISO 34502:2022, scenario-based safety evaluation framework: https://www.iso.org/standard/78951.html
ISO 21448:2022, Road vehicles - Safety of the intended functionality: https://www.iso.org/standard/77490.html
RAND, "Driving to Safety": https://www.rand.org/content/dam/rand/pubs/research_reports/RR1400/RR1478/RAND_RR1478.pdf
Waymo Safety Methodologies and Safety Readiness Determinations: https://arxiv.org/abs/2011.00054
Waymo Safety Impact Hub methodology context: https://waymo.com/safety/impact/
NHTSA Standing General Order on Crash Reporting: https://www.nhtsa.gov/laws-regulations/standing-general-order-crash-reporting
Guo et al., "On Calibration of Modern Neural Networks": https://arxiv.org/abs/1706.04599
Angelopoulos and Bates, "A Gentle Introduction to Conformal Prediction and Distribution-Free Uncertainty Quantification": https://arxiv.org/abs/2107.07511
Lei et al., "Distribution-Free Predictive Inference for Regression": https://arxiv.org/abs/1604.04173

SLAM Methods

Methods

Perception-SLAM Statistical Validity Protocol ​

Purpose ​

Statistical Claims ​

Units of Analysis ​

Stratification Plan ​

Metrics and Decision Rules ​

Sample Size Rules ​

Zero-Failure Claims ​

Non-Zero Defect Claims ​

Quantile Claims ​

Regression Claims ​

Data Partitioning ​

Sequential Testing and Stopping ​

Scenario Coverage ​

Bias and Validity Controls ​

Release Decision Template ​

Owner Handoffs ​

Sources ​