Skip to content

Perception-SLAM Alert Runbooks

Last updated: 2026-05-09

Purpose

This page defines fleet runbooks for perception, SLAM/localization, map, calibration, timing, model-runtime, and compatibility alerts. Alerts are operational controls only when they route to an owner, trigger a vehicle or fleet action, preserve evidence, and close with a disposition.

Severity Model

SeverityMeaningResponse targetExamples
P0Immediate safety risk or vehicle action requiredVehicle action immediately; fleet incident within operational SLAFree-space red, pose integrity red, wrong map active, calibration red while moving
P1Safety margin degraded or release/canary at riskTriage during shift; pause promotionTiming yellow/red, repeated TF failures, unknown-object cluster
P2Reliability issue with owner actionTicket within normal operationsIntermittent diagnostics stale, map disagreement watch
P3Evidence or dashboard hygieneBacklog with expiryMissing optional field, dashboard panel regression

Alert Schema

FieldRequirement
alert.idStable ID for deduplication and evidence
alert.severityP0/P1/P2/P3 plus safety/event class
vehicle.id, site.id, route.idRequired for all vehicle alerts
manifest.*Build, model, map, calibration, config, diagnostics graph
time.windowStart, detect, acknowledge, vehicle-action, clear
state.before_afterGreen/yellow/red/unknown transitions
operator.actionStop, route hold, remote assist, suppress with owner, no action
evidence.linksBag/MCAP, trace, dashboard, incident, ticket
schema.urlTelemetry schema used to interpret custom fields

Runbook Matrix

AlertTrigger patternVehicle actionFleet actionOwner
Free-space redTraversable cell lacks current observation or conflicts with protected-zone obstacleControlled stop unless validated crawl mode appliesOpen P0, preserve logs, block release/canaryRuntime assurance
Unknown/OOD object clusterUnknown object/OOD occupancy overlaps route corridor or protected zone repeatedlyTreat as obstacle, slow/stop/remote assistLabel review, ontology/model review, route watchPerception owner
Pose integrity redProtection level/covariance/residual crosses hard thresholdControlled stop or remote-assist handoffBlock map/release evidence, inspect map/calibration/timingLocalization owner
Timing redPTP unlock, stamp age, skew, TF future/past failure beyond thresholdReject stale data; degrade or stopTiming incident, exclude logs from map/release evidenceRuntime platform
Calibration redSensor pair residual/time offset exceeds hard thresholdRemove affected modality or stopMaintenance ticket, recalibration requiredCalibration owner
Map mismatchActive map differs from dispatch expectation or overlay expiredStop dispatch or force approved reloadQuarantine route/map bundleMapping owner
Model runtime redEngine mismatch, deserialization failure, p99 latency, GPU OOMKeep previous artifact or stop affected perception pathAbort rollout; rebuild or rollbackML runtime owner
Diagnostics graph unknownCritical diagnostic node missing/stale/unlatchedTreat dependent function as unknownRepair producer/config; evidence invalid until fixedFleet SRE
OTA compatibility failureCandidate artifact set fails matrixDo not activateStop rollout; update eligibility or manifestRelease manager

P0 Free-Space Red

  1. Confirm vehicle state: autonomous, speed, route segment, active map/calibration/model IDs.
  2. Verify vehicle executed the expected stop/degrade action.
  3. Preserve raw sensor logs, occupancy/free-space output, pose, map tile, calibration state, diagnostics graph, and operator event.
  4. Check for timing red, calibration red, pose red, map mismatch, and ODD/adverse-condition alert in the same window.
  5. Block canary or route expansion until root cause and replay evidence are attached.
  6. Close only with a safety-lead disposition: defect fixed, ODD restricted, threshold changed through release process, or false alarm accepted with rationale.

P0 Pose Integrity Red

  1. Confirm pose output was not consumed after red state.
  2. Compare scan-match residual, covariance/protection level, relocalization state, map tile, and timing health.
  3. Preserve route segment and map tile evidence.
  4. Hold affected route if alert clusters by tile or feature-poor zone.
  5. Require replay plus closed-course or site validation before clearing a systemic issue.

P1 Unknown/OOD Cluster

  1. Pull event frames, object/occupancy outputs, raw sensor data, and map context.
  2. Label whether the object is aircraft/GSE/person/FOD/infrastructure/artifact/novel.
  3. Check whether planner treated it conservatively.
  4. If conservative action occurred and cluster is operationally acceptable, keep watch and add data to training/evaluation queue.
  5. If suppression or false-free-space occurred, escalate to P0.

P1 Calibration Drift

  1. Confirm sensor serials, calibration package, TF tree hash, and recent maintenance.
  2. Inspect projection/overlap residual preview and prerequisites such as route features and weather.
  3. If drift is physical, stop autonomous use until maintenance and recalibration.
  4. If monitor false alarm, retain event and update monitor qualification evidence before tuning threshold.

Suppression Rules

RuleRequirement
Time-limitedEvery suppression has expiry and owner
ScopedVehicle/site/route/artifact specific; no fleetwide wildcard for safety alerts
Evidence-preservingRaw events are still stored
Release-awareSuppressed alerts are visible in release reviews
Safety-reviewedP0 suppressions require safety lead approval and compensating control

Closure Package

ArtifactRequired for P0/P1 closure
TimelineDetect, acknowledge, vehicle action, operator action, clear
ManifestActive build/model/map/calibration/config/schema
Root causeConfirmed, probable, or unknown with further action
EvidenceLogs, trace, dashboards, replay, labels, screenshots if relevant
ImpactVehicles, routes, missions, exposure denominator
Corrective actionFix, rollback, ODD restriction, maintenance, threshold release
Safety-case updateClaim/assumption/hazard link or rationale for no change
  • 50-cloud-fleet/observability/slam-timing-health-dashboard.md
  • 50-cloud-fleet/observability/map-hygiene-operational-monitoring.md
  • 50-cloud-fleet/operations/fleet-sre-incident-response.md
  • 60-safety-validation/runtime-assurance/monitor-qualification-evidence.md
  • 60-safety-validation/safety-case/incident-reporting-post-market-monitoring.md
  • 40-runtime-systems/ml-deployment/perception-slam-runtime-interface-contract.md

Sources

Public research notes collected from public sources.