Skip to content

Perception-SLAM Statistical Validity Protocol

Last updated: 2026-05-09

Purpose

This protocol defines how to make defensible statistical claims about perception-SLAM and map reliability for airside autonomous ground vehicles. It prevents common validation failures: aggregate-only pass rates, reused test sets, silent cherry-picking, correlated traversals treated as independent samples, and release decisions based on mileage alone.

It supports the top-level perception-SLAM map reliability evidence case, the SLAM map benchmark protocol, the uncertainty calibration release gates, and the perception-SLAM fleet data contract.

Statistical Claims

Every release claim must be written before the campaign starts:

Claim typeExample claimRequired statistical form
Reliability"False-free-space defects are below the release threshold in protected apron zones"One-sided upper confidence bound or Bayesian credible upper bound
Performance"ATE p95 is below 0.20 m on service-road routes"Quantile estimate with confidence interval by ODD slice
Robustness"Severe rain corruption causes no more than 15 percent degradation in localization availability"Paired comparison against clean baseline
Calibration"90 percent pose uncertainty sets contain the true pose at least 88 percent of the time"Coverage interval and binomial/hierarchical test
Regression"Release candidate is not worse than production by more than delta"Non-inferiority test with pre-set margin

Units of Analysis

Do not count every frame as an independent sample. Perception and SLAM errors are temporally and spatially correlated.

UnitUse forIndependence rule
FrameObject-level perception, calibration bins, point-cloud qualityUse cluster-robust intervals grouped by scenario/session
Segment10-60 second slice around an event or route featurePreferred unit for detection, relocalization, corruption tests
TraverseFull route, stand approach, or depot missionPreferred unit for map/localization reliability
Map tileMap publication and quarantine decisionsTile is independent only if source traversals and route context differ
Scenario familyISO 34502-style functional/logical scenarioUsed for coverage and safety argument completeness
Airport-dayFleet post-release monitoringUsed for drift, weather, and operational exposure

Stratification Plan

At minimum, all headline metrics are reported by:

DimensionRequired bins
Zonestand, apron service road, depot, taxiway crossing support, maintenance area
Lightingday, dusk/dawn, night, glare
Weather/surfacedry, wet, heavy rain, fog/low visibility if in ODD, snow/ice if in ODD
Traffic densityquiet, normal turnaround, congested
Aircraft stateabsent, parked, servicing, pushback/taxi adjacency
Sensor healthnominal, reduced point density, camera degraded, GNSS degraded, time-sync warning
Map agenewly surveyed, less than 7 days, 7-30 days, more than 30 days, changed since last survey
Release phaseoffline benchmark, closed-course, shadow mode, limited autonomous, production watch

Aggregate release decisions are invalid if any safety-critical slice fails or lacks enough exposure for the intended ODD.

Metrics and Decision Rules

MetricPrimary decision ruleMinimum reporting
False-free-space critical defectsZero observed in release-critical protected zones; if any observed, release blocked until fixed and re-testedCount, exposure denominator, root cause, residual risk
ATE / RPEp95 and p99 upper confidence bounds below route-specific thresholdsMedian, p95, p99, CI, worst segment
Localization availabilityLower confidence bound above threshold by ODD sliceAvailability, dropouts/hour, longest outage
Relocalization latencyp95 upper confidence bound below allowed time/distance budgetLatency distribution and failed recoveries
Scan-to-map residualCandidate distribution non-inferior to productionPaired delta, drift by map age
Ghost rateUpper bound below map publication thresholdGhosts per 100 m or per stand/tile
Static preservationLower bound above thresholdLost-feature categories and impact
Calibration coverageEmpirical coverage interval includes nominal target within toleranceReliability diagram by risk bin
Robustness degradationPaired performance drop below corruption-specific limitClean vs corrupt paired table

Sample Size Rules

Zero-Failure Claims

When claiming a rare-event upper bound with zero observed failures, use the rule:

N >= ln(alpha) / ln(1 - p*)

Where p* is the maximum acceptable event probability per independent unit and 1 - alpha is confidence. Example: to show fewer than 1 critical false-free-space defect per 1,000 independent protected-zone segments at 95 percent confidence with zero observed failures, use N >= ln(0.05) / ln(0.999) = 2,995 independent segments.

This does not prove system safety by itself. RAND's "Driving to Safety" shows why road-mile accumulation alone becomes impractical for rare fatality-level claims. Use scenario-based, accelerated, simulation, closed-course, and fleet evidence together.

Non-Zero Defect Claims

For defect rates with observed failures, report an exact binomial confidence interval or a beta-binomial hierarchical model when data are clustered by airport, route, or weather. Use one-sided upper bounds for safety release decisions.

Quantile Claims

For p95/p99 ATE, RPE, relocalization time, and residual distributions:

  • Use bootstrap confidence intervals over independent segments or traverses, not frames.
  • Preserve scenario grouping during bootstrap resampling.
  • Report the worst ODD slice even when aggregate metrics pass.
  • Use non-parametric intervals unless a distributional model is justified and checked.

Regression Claims

For candidate vs production:

  • Use paired comparisons when both stacks run on the same logs.
  • Define a non-inferiority margin before testing.
  • Block release if the candidate improves aggregate performance while degrading any critical ODD slice beyond the margin.

Data Partitioning

PartitionPurposeRules
DevelopmentTuning, debugging, ablationMay be reused; never used for release claims
CalibrationTemperature scaling, uncertainty thresholds, conformal quantilesFrozen before final evaluation
ValidationModel and map selectionCan choose between candidates; cannot serve as final release evidence
Locked testRelease claimRead-only; access logged; no tuning after inspection
Shadow-mode watchOperational confirmationUsed for post-release monitoring and future test-set design

No sequence may appear in more than one partition through near-duplicate route, time, or map-tile leakage. For repeated airport routes, split by day/session/map version where possible, not random frames.

Sequential Testing and Stopping

Validation teams may use sequential release testing only if the stopping rule is pre-registered:

SituationAllowed stopping rule
Critical defect observedStop immediately, block release, open safety defect
Sufficient evidence accumulatedStop only when pre-defined confidence/coverage criteria are met
Time or budget exhaustedReport inconclusive; do not convert to pass
Candidate underperforms productionStop for futility if pre-defined non-inferiority cannot be achieved

Repeated looks at the data require alpha spending, Bayesian monitoring with pre-specified priors, or a fixed holdout untouched until final analysis.

Scenario Coverage

Use ISO 34502-style functional, logical, and concrete scenario decomposition. For this airside domain, the statistical report must include:

Bias and Validity Controls

RiskControl
Test-set contaminationLocked manifests, hash-based duplicate detection, access logging
Correlated samples inflated as independentCluster by segment/traverse/tile/airport-day
Weather/lighting under-sampledStratified minimums and inconclusive status for missing ODD slices
Overfitting to public benchmarksPublic benchmark used only as external comparability, not release proof
Simulation over-trustedSimulation results discounted unless sim-to-real gap is measured
Human labels inconsistentInter-annotator agreement and adjudication for critical labels
Map survey errorSurvey uncertainty propagated into ATE/map-layer thresholds
Survivorship bias in fleet dataInclude failures, aborted missions, upload failures, and quarantined bags

Release Decision Template

Each statistical report must include:

  1. Candidate build, map, calibration, and data manifest IDs.
  2. Pre-registered claims and thresholds.
  3. Partitions and leakage checks.
  4. Sample counts by independent unit and ODD slice.
  5. Metric tables with confidence intervals.
  6. Failed, inconclusive, or waived slices.
  7. Comparison to production baseline.
  8. Safety defect references and disposition.
  9. Final recommendation: pass, pass with ODD restriction, inconclusive, or block.

Owner Handoffs

OwnerResponsibility
V&V statisticianPre-registration, sample size, intervals, final statistical decision
Perception/SLAM ownerMetric implementation, baseline comparison, failure triage
Mapping ownerMap tile sample frame, survey uncertainty, map-change status
Data platform ownerDataset manifests, partition enforcement, duplicate detection
Safety leadCriticality thresholds, release interpretation, waiver control
Fleet operationsShadow-mode exposure and event completeness

Sources

Public research notes collected from public sources.