Skip to content

Fleet SRE and Incident Response for Autonomous Vehicle Fleets

Last updated: 2026-05-09

Fleet SRE is the operating discipline that keeps an autonomous vehicle fleet safe, observable, and recoverable after deployment. It sits between vehicle runtime monitoring, fleet operations, safety assurance, cybersecurity response, and customer/site operations. For AVs, the unit of reliability is not only a cloud service; it is the combined system of vehicles, operators, maps, models, networks, charging, depots, and local site procedures.

Practical Evidence and Artifact Model

Every production fleet should be able to reconstruct who knew what, when, what authority was exercised, and what evidence supports the final root-cause conclusion. The minimum incident artifact set is:

ArtifactContentsOwnerRetention
Incident recordIncident ID, severity, site, affected vehicles, first alert, commander, safety officer, communications lead, current statusIncident commanderPermanent
Impact statementSafety impact, service impact, vehicles stopped, missions aborted, ODD restrictions, affected customers/site stakeholdersOperations leadPermanent
TimelineAlert, acknowledgement, mitigations, fleet-stop decisions, rollbacks, regulator/customer notifications, recoveryScribePermanent
Fleet state snapshotVehicle IDs, software/model/map/config/calibration versions, battery state, mission state, ODD state, network stateFleet SREPermanent for reportable events
Evidence manifestLinks to rosbags/MCAPs, telemetry windows, logs, traces, operator actions, video, map diffs, OTA records, feature flagsFleet SRESame as source data or legal hold
Decision logFleet stop, site stop, ODD reduction, rollback, teleoperation disablement, dispatch suspension, evidence freezeIncident commander and safety officerPermanent
Regulator/customer logReportability assessment, deadlines, submission IDs, external communications, follow-up questionsSafety or regulatory ownerPermanent
Corrective action planRoot cause, contributing factors, containment, long-term fixes, verification evidence, assigned owners and due datesIncident ownerPermanent
Safety-case deltaClaims impacted, assumptions invalidated, evidence superseded, residual risk decisionSafety case ownerPermanent

Incident records should use stable IDs that link to telemetry, deployment manifests, map releases, model registry entries, and safety-case evidence IDs. Do not rely on chat history as the system of record.

Severity Taxonomy

Severity should be based on safety risk and operational blast radius, not only service uptime.

SeverityAV fleet triggerRequired response
SEV-0 Safety criticalInjury, collision with aircraft/person/critical asset, uncontrolled motion, safety monitor defeated, credible cyber control compromise, or regulator-notifiable crash/incidentImmediate fleet or site stop authority available; incident commander, safety officer, security lead, executive, and site authority engaged
SEV-1 Major operational safetyRepeated near misses, loss of localization/map validity in an active zone, unsafe OTA regression, systemic remote assistance failure, telemetry loss that prevents supervisionCoordinated incident response; stop or restrict affected cohort; preserve data; start reportability clock assessment
SEV-2 Degraded fleetMission success, intervention rate, or availability outside SLO; one site or cohort degraded but safety envelope intactOn-call response; rollback or restrict if trend worsens; post-incident review required
SEV-3 Component issueSingle vehicle, sensor, charger, depot gateway, or data pipeline problem with bounded impactService-owner response; ticket and trend tracking

When severity is uncertain, classify high, stabilize, and downgrade only after evidence review. PagerDuty's public incident response guidance makes the same operational point: an incident is not the right time to litigate severity.

Deployment Operations

1. Prepare before launch

  • Define explicit authority for fleet stop, site stop, ODD restriction, software rollback, map rollback, model disablement, and return-to-service.
  • Maintain on-call rotations for fleet SRE, vehicle runtime, maps, OTA, data platform, safety, cybersecurity, and site operations.
  • Store runbooks next to dashboards and alerts. Each alert should name the owner, first triage query, likely mitigations, and escalation path.
  • Drill SEV-0 and SEV-1 scenarios quarterly: loss of telemetry, bad OTA, stale map, cyber key compromise, unsafe behavior spike, and reportable collision.
  • Treat cloud observability and robotics evidence as separate but linked streams: traces/metrics/logs for services, plus rosbags/MCAP/video/diagnostics for vehicle behavior.

2. Detect on symptoms

Alert on user-visible and safety-visible symptoms first:

SignalExample alert
Safety monitorSafety-envelope violation, emergency stop, hard brake above baseline, near-miss trigger
Fleet outcomeMission failure rate, intervention rate, remote-assistance requests, rider/customer cancellations
Vehicle healthLocalization covariance, point-cloud density, camera/lidar/radar heartbeat, actuator faults, thermal throttling
OperationsVehicle stuck, charging failure, depot queue, site dispatch backlog, operator overload
Cloud and networkTelemetry ingest delay, command acknowledgement latency, map/config distribution failure
Change correlationRegression after software, model, map, calibration, or config release

OpenTelemetry should carry correlation IDs from fleet APIs through dispatch, OTA, data ingestion, and operator tools. Vehicle data systems should add the same incident IDs to bag/MCAP metadata so cloud traces can be joined with physical evidence.

3. Respond with named roles

Use an incident command structure adapted to fleet operations:

RoleResponsibility
Incident commanderOwns response pace, role assignment, decisions, and closure
Safety officerCan veto unsafe recovery; owns fleet-stop and return-to-service risk posture
Operations liaisonCoordinates depot/site/airport/warehouse/customer operations
Vehicle/runtime leadDiagnoses vehicle stack, sensors, actuators, runtime monitors
Cloud SRE leadDiagnoses APIs, telemetry, dispatch, OTA, data platform
Security leadHandles compromise assessment, credential containment, forensic preservation
Communications leadSends internal and external updates with approved facts
ScribeMaintains timeline, decisions, links, and evidence manifest

4. Stabilize before diagnosing

The first operational objective is to remove active risk:

  1. Freeze the affected vehicle or cohort if there is credible safety risk.
  2. Preserve volatile data before power cycling or overwriting logs.
  3. Stop new dispatches into the affected ODD, route, map tile, software cohort, or site zone.
  4. Roll back only when the rollback target is known compatible with map/model/config/calibration and has a current safety assessment.
  5. Declare return-to-service only after the safety officer accepts residual risk and evidence links are attached.

5. Learn without hiding operational factors

Post-incident reviews should cover detection, mitigation, communication, evidence quality, safety-case impact, and recurrence prevention. They should not stop at the proximate code bug. For AV fleets, common contributing factors include site layout changes, map freshness, operator training, weather, comms, maintenance, and release governance.

Risks and Failure Modes

Failure modeConsequenceControl
No explicit fleet-stop authorityOperators debate while vehicles continue unsafe workWritten authority matrix, drills, and one-click fleet/site stop paths
Alerting on causes instead of symptomsReal customer or safety impact is missedSLO and safety indicator alerts at fleet/site/cohort layers
Missing vehicle evidenceRoot cause cannot be proven; regulator trust degradesEvent ring buffers, immutable upload manifests, legal hold workflow
Chat-only incident recordTimeline and decisions are incompleteIncident tool as system of record; chat mirrors only
Rollback creates new mismatchOlder software incompatible with current map/model/configCompatibility matrix and signed release manifests
Cloud incident disables safety operationsVehicles cannot be supervised or commandedLocal safe-stop policy, degraded offline mode, independent emergency channel
Over-broad fleet stopUnnecessary operational loss and alert desensitizationCohort/site/ODD-scoped stop options with escalation rules
Under-broad containmentSystemic defect remains active in another site or cohortBlast-radius query by artifact version, site, hardware, and ODD
Blame-oriented postmortemNear misses stop being reportedJust-culture review format and anonymous reporting path
  • 50-cloud-fleet/observability/fleet-anomaly-root-cause-attribution.md
  • 50-cloud-fleet/fleet-management/fleet-management-dispatch.md
  • 50-cloud-fleet/ota/ota-fleet-management.md
  • 40-runtime-systems/data-logging/on-vehicle-data-triage-selective-upload.md
  • 40-runtime-systems/monitoring-observability/hmi-operator-interface.md
  • 60-safety-validation/safety-case/safety-incidents-lessons.md
  • 60-safety-validation/runtime-assurance/runtime-verification-monitoring.md
  • 60-safety-validation/cybersecurity/cybersecurity-airside-av.md

Sources

Public research notes collected from public sources.