Skip to content

Safety-Certified Runtime Compute

Last updated: 2026-05-09

Why It Matters

High-performance AV compute is not automatically safety-certified compute. Perception accelerators, Linux containers, GPU kernels, neural networks, and recorders can be mission-critical while still being unsuitable as the final safety authority. A certifiable vehicle runtime separates the safety decision path from the high-performance autonomy path, then proves freedom from interference, bounded fallback timing, and traceable assumptions.

The practical pattern is a mixed-criticality compute stack: safety-certified RTOS or hypervisor services supervise a safety controller, network processor, or lockstep MCU, while Linux/ROS/AI workloads run in a constrained domain. Platforms such as QNX OS for Safety, NXP S32G, and NVIDIA DRIVE can contribute evidence, but the vehicle safety case must still prove the final integration.

Architecture Decisions

DecisionPractical rule
Safety islandPut E-stop, brake enable, speed limit, watchdog, geofence, and heartbeat supervision on an independent MCU, safety processor, or certified RTOS partition.
AI domainRun perception, prediction, planning, foundation models, and logging on high-performance Linux/GPU compute as QM or lower-ASIL software unless explicitly qualified otherwise.
Gateway processorUse a vehicle network processor such as S32G for CAN/Ethernet gatewaying, safety supervision, secure boot, and traffic control when the system needs ASIL-capable networking.
Certified OSUse QNX OS for Safety or equivalent RTOS for safety services that require deterministic scheduling, safety manuals, certified libraries, and qualified toolchains.
HypervisorPartition Linux, QNX, service OSs, and recorder workloads only when CPU, memory, DMA, interrupt, and device ownership can be shown in the safety case.
NVIDIA DRIVE/Orin/ThorTreat platform safety certifications, safety manuals, and process certifications as inputs. They do not certify the AV application by themselves.
Degraded operationDefine which autonomy functions continue after loss of GPU, Linux, network gateway, time source, recorder, or one sensor zone.

Reference runtime split:

Safety domain
QNX / safety RTOS / lockstep MCU
        +-- watchdogs
        +-- speed and geofence limits
        +-- brake and E-stop supervision
        +-- safety CAN / TSN gateway checks

Autonomy domain
Linux / ROS 2 / CUDA / TensorRT
        +-- perception and fusion
        +-- planning and behavior
        +-- data recording and diagnostics
        +-- fleet and developer tools

Hardware controls
MMU, IOMMU, hypervisor, lockstep cores, secure boot, HSE/TPM,
partitioned network ports, bounded watchdog timeouts

Evidence Artifacts

  • Compute safety concept showing ASIL/QM allocation, safety goals, fallback timeouts, and independence assumptions.
  • Platform safety manuals, safety certificates, SEooC assumptions, errata, qualified toolchain records, and OS/hypervisor configuration baselines.
  • Resource partition evidence for CPU, GPU, DLA, memory, DMA, interrupts, storage, Ethernet, CAN, and PCIe devices.
  • Boot chain and update evidence: secure boot state, measured boot logs, rollback policy, signing keys, and recovery image behavior.
  • Watchdog and heartbeat traces for Linux hang, GPU timeout, process crash, scheduler overload, gateway reset, network partition, and time-source loss.
  • Worst-case execution and latency budget for the safety path from hazard detection or heartbeat loss to actuator command.
  • Thermal and power derating evidence proving safety timeouts hold under throttling, brownout, hot soak, and cold start.

Acceptance Checks

  • A full Linux autonomy-domain hang causes a safety response within the approved watchdog timeout.
  • GPU reset, CUDA fault, or perception process crash cannot block the safety controller from commanding brake or safe torque off.
  • Safety-domain code boots, runs, and logs its health without depending on cloud connectivity, ROS master/graph availability, or recorder availability.
  • Shared memory, DMA devices, and PCIe endpoints cannot write into safety-domain memory or corrupt safety I/O.
  • The system refuses mission start when safety manual assumptions are violated: wrong silicon revision, wrong OS build, wrong hypervisor config, or wrong safety-controller firmware.
  • Runtime health messages distinguish degraded AI performance from loss of a safety function.
  • Thermal throttling and power limiting are visible to the supervisor before they violate timing budgets.

Failure Modes

Failure modeDetectionSafe response
Linux or ROS deadlockHeartbeat timeout, scheduler watchdog, stale command leaseSafety domain commands controlled stop and blocks new missions.
GPU fault or thermal throttlingDriver error, watchdog reset, power/thermal telemetryDrop to reduced perception mode or safe stop depending on ODD coverage.
Hypervisor misconfigurationConfig hash mismatch, partition test failure, device ownership errorRefuse release or mission start.
DMA or shared-memory interferenceIOMMU fault, memory protection violation, unexplained safety data corruptionIsolate offending partition and enter safe state.
Safety MCU lockstep faultLockstep error, ECC fault, reset reasonEngage hardware fallback, preserve event log, require service inspection.
Network gateway resetMissing gateway heartbeat, topology change, CAN/TSN path lossFreeze command output and transition to safe stop if actuator path is affected.
Safety manual assumption driftBOM, compiler, OS, or silicon revision mismatchBlock certification claim for that build until re-analysis is complete.
Recorder or storage hangI/O timeout, queue depth saturation, power rail spikeDrop logging load before it can starve safety or control traffic.

Sources

Public research notes collected from public sources.