Skip to content

Zoox Perception Stack — Technical Deep Dive

Compiled March 2026 from Zoox journal articles, Amazon Science publications, CVPR/NeurIPS papers, NVIDIA blog posts, FLIR/Hesai datasheets, Zoox Safety Reports, arXiv papers, job postings, and patent filings.


Table of Contents

  1. Sensor Hardware
  2. Sensor Calibration & Synchronization
  3. Early Fusion Architecture (CVPR 2025)
  4. Sensor Staleness Innovation (CVPR 2025)
  5. Three Parallel Perception Systems
  6. Published Neural Network Architectures
  7. Next-Gen Unified Perception Model
  8. Perception Outputs & BEV Representation
  9. Perception → Prediction Interface
  10. Prediction Architecture (CNN + GNN)
  11. Vision-Language-Action Foundation Model
  12. Inference & On-Vehicle Compute
  13. Training Infrastructure
  14. Perception Team Organization
  15. Competitive Analysis
  16. What Remains Undisclosed
  17. Sources

Sensor Hardware

Overview

~64 sensors distributed across four identical sensor pods at the vehicle's four corners. Each corner achieves 270-degree FOV, overlapping to provide full 360-degree coverage extending 150–200+ meters.

LiDAR

AttributeDetail
SupplierHesai Technology (multi-year, multi-generation partnership)
Count8 units (2 per corner pod — 1 long-range + 1 short-range)
Wavelength905 nm (Class 1 eye-safe)
CertificationsISO 26262 ASIL B, ISO 21434 Cybersecurity
IP ratingIP6K7, IP6K9K
Per-point dataXYZ position, intensity, per-point timestamp
TimestampT_L = max(T_i) — sweep completion time

Long-range units (1 per corner) — Hesai Pandar128 / OT128:

AttributeDetail
Channels128
Range200 m at 10% reflectivity
Points/sec3.46M single-return, 6.91M dual-return
Horizontal FOV360°
Vertical FOV40°
Angular resolution0.1° (H) × 0.125° (V)
Rotation rate10 Hz
Designed lifetime>30,000 hours

Short-range units (1 per corner) — Hesai QT128:

AttributeDetail
Channels128
Range20 m at 10% reflectivity (50 m max)
Points/sec1.15M single-return
Horizontal FOV360°
Vertical FOV105° (ultra-wide for near-field)
Angular resolution0.4° (H) × 0.4° (V)
Weight700 g
PurposeBlind-spot coverage — pedestrians, animals, small objects at close range

Sensor Count Summary

ModalityCountSource
RGB Cameras14Zoox Safety Report / teardown analysis
Radars20Zoox Safety Report / teardown analysis
LiDARs8Zoox Safety Report / teardown analysis
LWIR ThermalMultiple (exact count undisclosed)FLIR partnership announcement
UltrasonicsMultiple (close-range blind-spot fill)Teardown analysis
MicrophonesMultipleZoox journal
Total~64 sensorsZoox official

Cameras

AttributeDetail
Count14 RGB cameras across the vehicle
TypesWide field-of-view + telephoto lenses
Frame rate10 Hz
Exposure5–15 ms exposure, 25 μs row time (rolling shutter)
TimestampT_C = first-line exposure-stop time
SyncClock-synchronized and phase-locked with LiDAR
UsesColor, texture, traffic lights, pedestrian gestures, fine-grained classification

Longwave Infrared (LWIR) Thermal Cameras

AttributeDetail
SupplierTeledyne FLIR
ModelFLIR Boson + ADK
CoreUncooled VOx microbolometer
Resolution640 × 512 pixels
Spectral band8–14 μm
Pixel pitch12 μm
Thermal sensitivity< 50 mK NETD
HFOV75°
Frame rate30/60 Hz selectable
Onboard VPUIntel Movidius Myriad 2
InterfaceUSB, GMSL, Ethernet, FPD-Link III
IngressIP67-rated, heated window
Operating temp-40°C to +85°C
Weight~100 g per unit
Power~4 W average
Sensing modePassive — no illumination needed
UsesPedestrian/cyclist/animal detection in darkness, sun glare, fog, smoke, rain, snow

Radar

AttributeDetail
Count20 units (~5 per corner, some mounted low on vehicle body)
SupplierNot publicly disclosed
Frequency77 GHz with ~4 GHz bandwidth (industry standard)
Type4D imaging radar (range, azimuth, elevation, velocity)
SyncNOT phase-locked to other sensors (variable firing frequencies with offsets)
Data aggregation1-second buffer
Per-point data3D position, RCS, SNR, Doppler interval, per-point timestamp
TimestampT_R = max(T_i) — latest radar point in buffer
UsesDirect velocity (Doppler), long-range (hundreds of meters), weather-robust, penetrates occlusions
PatentUS 2019/0391250 — radar clustering and velocity disambiguation across multiple pulse-Doppler sensors

Microphones

AttributeDetail
TypeExternal acoustic array
UseEmergency vehicle siren detection and directional approach determination
TechniquesML models on acoustic data; Direction of Arrival (DoA) estimation
PatentCN114586084A — emergency vehicle detection

Ultrasonic Sensors

AttributeDetail
UseClose-range blind-spot fill around vehicle body
Typical specs40–100 kHz frequency, 0.3–2.5 m range, 120–160° angular detection
NoteExact Zoox ultrasonic specifications not publicly disclosed

Additional Sensors (for Localization)

GPS receivers, accelerometers (IMU), gyroscopes, wheel speed sensors, steering angle sensors.


Sensor Calibration & Synchronization

Infrastructure-Free Calibration (CLAMS)

Zoox pioneered automatic extrinsic calibration without calibration targets:

  • Identifies natural environmental features (building edges, tree trunks)
  • Aligns image gradients from cameras with depth edges in LiDAR point clouds
  • Runs continuously during operation — thermal cycling, shock, and vibration cause constant drift
  • No external infrastructure required (unlike traditional checkerboard or target-based calibration)

Clock Synchronization

ModalitySync StatusTiming
LiDARPhase-lockedSweep completion timestamp
CamerasPhase-locked to LiDARFirst-line exposure stop
RadarNOT phase-lockedVariable firing with offsets
LWIRIntegration into main clock domain30/60 Hz

Perfectly synchronized camera timestamp formula:

T_C = T_L - 0.1 × (θ_L - θ_C) / (2π)

where θ values are azimuth angles.

The Staleness Problem

Even with clock synchronization, processing and transmission delays cause modalities to arrive at different times. Real deployment logs show staleness patterns of approximately -0.1s to 0s between camera and LiDAR, with multiple histogram peaks indicating systematic delay patterns.


Early Fusion Architecture (CVPR 2025)

From the paper: "Robust sensor fusion against on-vehicle sensor staleness" (arXiv 2506.05780, CVPR 2025 Precognition Workshop). Authors: Meng Fan, Yifan Zuo, Patrick Blaes, Harley Montgomery, Subhasis Das (all Zoox Inc.).

Architecture Overview

The early fusion operates in perspective view (not BEV):

Camera frames ──→ [YoloXPAFPN backbone] ──→ [B, C_C, H/8, W/8]

LiDAR points ───→ [PointPillar backbone] ──→ [B, C_L, H/8, W/8] ──→ [Dynamic Fusion] ──→ [FPN] ──→ [DINO Decoder]
                   (perspective-view pillars                                                              ↓
                    = camera frustum pillars)                                                    3D Detections

Radar points ───→ [PointPillar backbone] ──→ [B, C_R, H/8, W/8]

Backbones (Per-Modality)

ModalityBackboneNotes
CameraYoloXPAFPN (YOLOX Path Aggregation Feature Pyramid Network)Output aligned to stride 8
LiDARPointPillar in perspective viewEach "pillar" defined as a camera frustum (not BEV column)
RadarPointPillar in perspective viewSame frustum-based pillar definition

Fusion Module

  • Dynamic fusion combines the three backbone feature maps
  • Followed by Feature Pyramid Network (FPN) at strides 8, 16, and 32

Detection Head: DINO Decoder

Adapted DINO (DETR with Improved Denoising Anchor Boxes):

ComponentLoss Function
Class headFocal loss
2D box headGIoU loss
3D box head (center, extents, yaw, velocity)L1 loss
AssignmentHungarian matching (one-to-one query ↔ ground-truth)

Robustness Training

  • Feature-level sensor dropout: 20% probability of zeroing out any one modality's backbone output during training
  • Makes the model robust to real-world sensor failures or degradation

Sensor Staleness Innovation (CVPR 2025)

Two Model-Agnostic Innovations

Innovation 1 — Per-Point Timestamp Offset Feature:

For every LiDAR point and radar point, the model receives:

Δt_i = T_C - T_i

(time difference between camera frame timestamp and individual point capture timestamp)

This gives fine-grained temporal awareness at the individual measurement level.

Innovation 2 — Synthetic Stale Data Augmentation:

  1. LiDAR data and ground truth remain unchanged (reference frame)
  2. Compute ideal synchronized camera timestamp: T_C = T_L - 0.1 × (θ_L - θ_C)/(2π)
  3. Generate random jitter: δt ~ Uniform(-0.1s, +0.1s)
  4. Query closest actual camera frame at T' = T_C + δt
  5. For radar: independently jitter T_R and update 1-second buffer
  6. Mix augmented data at ratio P_s (optimal: ~0.01 = 1%)

Experimental Results (Proprietary AV Dataset)

Under Perfect Synchronization (No Staleness)

MetricCyclists F1Cars F1Pedestrians F1
Baseline32.1%52.8%28.2%
With augmentation30.8%52.4%28.5%

Minimal degradation — augmentation does NOT hurt synchronized performance.

Under 100 ms Camera Staleness

MetricCyclists F1Cars F1Pedestrians F1
Baseline14.1%36.6%6.3%
With augmentation32.1%52.1%26.8%
Improvement2.3×1.4×4.3×

The baseline collapses under staleness. The augmented model is virtually unaffected.

Precision/Recall Under 100 ms Staleness

Precision (Cyc/Car/Ped)Recall (Cyc/Car/Ped)
Baseline10.2 / 30.6 / 7.122.9 / 45.6 / 5.7
With augmentation23.3 / 40.3 / 22.051.3 / 73.7 / 34.4

Deployment Recommendation

150 ms threshold: Use stale data (with augmentation-trained model) when staleness < 150 ms. Apply modality dropout (zero features) when staleness > 150 ms or sensor fails completely.

Properties

  • Model-agnostic — works with any fusion architecture
  • Negligible latency impact — suitable for real-time on-vehicle deployment
  • Deployed summer 2025 on the Zoox production fleet

Three Parallel Perception Systems

Zoox runs three architecturally diverse, independent perception systems simultaneously:

System 1: Main AI Perception Stack

AttributeDetail
TypeML-based (deep learning)
InputAll 5 sensor modalities via early fusion
Outputs3D bounding boxes + velocity, classification, tracking, semantic segmentation, occupancy, dense depth
FeedsPrediction and planning modules

System 2: Geometric Collision Avoidance (CAS)

AttributeDetail
TypeInterpretable geometric algorithms + deep learning hybrid
InputRaw sensor data (direct) — processes a subset of sensors (primarily LiDAR, radar, ToF); CAS team job postings confirm consumption of lidar, radar, vision, and LWIR
OutputNear-collision warnings along intended driving path; trajectory validation verdicts
DesignHigher integrity, lower complexity than System 1; shorter processing pipeline; more easily verifiable algorithms; designed for potential ASIL-D certification
ResponseUltra-fast braking ("like hitting the brakes if someone steps out")
ImplementationC++ with low-latency algorithm design
PatentUS20200211394 — Collision Avoidance System (dual-system architecture)

CAS Perception: Geometric Algorithms

CAS Perception processes raw sensor data using "a combination of geometric, interpretable algorithms and deep learning" to detect near-collisions under tight compute resource constraints. The geometric algorithms are designed to be auditable and verifiable, avoiding the opacity of the Main AI's deep neural networks.

Corridor-Based Spatial Analysis (US11500385B2):

ComponentDetail
Corridor definitionBounded region ahead of vehicle based on planned trajectory, vehicle dimensions, velocity, and steering offset
Spatial filteringOnly sensor data within the corridor is analyzed — dramatically reduces computational load
Ground modelingMulti-degree polynomial or spline curves (Bezier, B-spline, NURBS) fitted to ground profile within corridor
Knot formulaKnot count = control points + curve degree + 1; control points determined from sensor channel density
Object classificationPoints above the fitted ground curve (or exhibiting elevation discontinuities) are classified as objects; points within threshold distance below the curve are ground
RegressionWeighted least squares with skewed loss function — penalizes false negatives (missed objects) more heavily than false positives
Pre-fittingOutlier elimination via clustering + RANSAC; data weighting by elevation (lower = heavier weight); binning by sensor channel density
Post-fittingControl point elevation caps (max sensor data elevation + buffer); knot spacing optimization for uniform spatial distribution
Multi-grade terrainHandles complex environments (e.g., negative grade → flat → negative grade) using higher-order polynomials; projects 3D data into 2D elevation profiles, fits curves, then projects back
EstimatorsUses M-estimators instead of neural networks for robustness and verifiability
PatentUS11500385B2 — Collision Avoidance Perception System; EP4037946A1 — Complex Ground Profile Estimation

Object Detection (Secondary System):

ComponentDetail
ApproachDetects objects by sensor data analysis without semantic classification in many cases
OutputObject presence, position, velocity, acceleration, extent (size) — but NOT classification
TrackingHistorical position, velocity, acceleration, and orientation tracked over time
Motion predictionStraight-line approximation based on current velocity (simple); Extended Kalman Filter / particle filter for sophisticated cases
Ground removalEigenvalue decomposition per voxel to filter drivable surfaces from LiDAR data
Key distinctionUses probabilistic models (Kalman filters, particle filters) rather than neural networks

CAS Trajectory Validation

The CAS acts as a multiplexer between the Main AI planner and vehicle drive controllers:

  1. Main AI planner generates a primary trajectory and a secondary/contingent trajectory (backup)
  2. Both trajectories are sent to CAS for independent validation
  3. CAS runs its own perception, prediction, and collision checking on each trajectory
  4. Validated trajectory is passed through to controllers; failed trajectories trigger escalation

Two Simultaneous Distance Checks:

CheckPass Condition
Object distanceNearest detected object must be ≥ threshold distance
Ground extentFurthest detected ground point must be ≥ threshold distance

Threshold distance is dynamically calculated from: vehicle velocity, stopping distance estimates, environmental gradient, and coefficient of friction. If either check fails, the trajectory is rejected.

Additional Validation Checks:

CheckDescription
Temporal freshnessTrajectory must have been generated less than a threshold time ago
ConsistencyTrajectory must be consistent with current or previous vehicle pose
FeasibilityTrajectory must respect vehicle steering limits, acceleration limits
Collision detectionVehicle bounding box must not intersect with predicted object bounding boxes at any common timestep

CAS Trajectory Hierarchy (4-Level Escalation)

LevelTrajectory TypeActionComfort
1Primary trajectoryNormal driving (acceleration, lane changes)Normal
2Secondary/contingentGentle stop (deceleration < maximum)Moderate
3Collision avoidanceModified secondary with adjusted deceleration profileUncomfortable
4Maximum decelerationEmergency stop (emergency brakes, seatbelt pre-tensioners)Emergency

The system selects the highest-level trajectory that passes validation to minimize passenger discomfort. A state machine prevents upward transitions (returning to a higher-level trajectory) until an explicit release signal is received, preventing oscillation.

Advisory vs. Override Logic:

ScenarioCAS Response
Far-future collisionWarning message to Main AI with time-to-collision, object extents, velocity, location, and collision point — allowing the AI to re-plan
Imminent collisionDirect command to secondary or collision avoidance trajectory
System failureMonitor triggers immediate transition to maximum deceleration

CAS Polygon & Geometric Collision Detection

Bounding Box Collision Checking (Primary Method — US20200211394):

Zoox represents all objects as oriented bounding boxes (cuboids) defined by eight corners with position, orientation, length, width, and height. Collision detection checks whether the ego vehicle's bounding box overlaps with obstacle bounding boxes at each projected timestep, with optional safety margins (enlarged bounding boxes for conservative detection).

Path Polygon Corridors (US20210370921A1 — Perturbed Object Trajectories):

ComponentDetail
Path polygonsLeft/right trajectory boundaries forming a swept 2D region ("corridor")
Boundary pointsPositioned at 0.2–0.5 second intervals along trajectory
Turn adjustmentBoundaries adjusted outward during turns based on curvature radius
Collision testTime-space overlap analysis — checks temporal coincidence of vehicle and object occupancy within the corridor
PerturbationM × N trajectory variants generated by perturbing acceleration and steering parameters
ProbabilityP(collision) = (trajectories causing collision) / (total perturbed trajectories), weighted by behavioral likelihood
Position conesMin/max velocity envelopes provide conservative temporal bounds

Convex Polygon Buffer Regions (US20200398833 — Dynamic Collision Checking):

RegionDescription
Dilated regionLargest drivable area extent — represents maximum available space
Collision regionSmaller drivable region within dilated — hard collision boundary
Safety regionBuffer zone between vehicle and collision boundary
MethodFront bumper position computed at points along center curve of predicted travel region; polygons generated for each position; polygons joined using convex shape-based algorithm to produce convex polygonal buffer

Cost-Based Trajectory Optimization (US20200139959 — Cost Scaling):

Rather than binary collision/no-collision tests, Zoox uses layered cost regions with continuous distance-based costs. The planner optimizes trajectories against graduated penalties from lane boundaries, dilated regions, collision regions, and safety regions — enabling the planner to find optimal trajectories balancing safety margins against other objectives.

Dynamic vs. Static Object Handling

Dynamic Objects (Vehicles, Pedestrians, Cyclists, Animals):

AspectDetail
RepresentationOriented bounding boxes (cuboids) with position, heading, velocity, acceleration
BEV appearanceBoxes in bird's-eye view — "different, smaller boxes" for pedestrians vs. vehicles
PredictionCAS uses linear extrapolation (position + velocity) for simple cases; Extended Kalman Filter for complex cases
Collision checkTrajectory-level intersection: projected ego bounding box path checked against projected obstacle bounding box path at common time windows
Perturbation analysisMultiple trajectory variants generated by perturbing object acceleration/steering; produces probabilistic collision assessment
Radar contributionDirect Doppler velocity feeds dynamic object velocity estimation without relying on multi-frame tracking

Static / Map Objects (Road Boundaries, Buildings, Keep-Clear Zones):

AspectDetail
Map representationZRN uses polygon meshes for surfaces (US20240094009A1); crosswalks, lanes, drivable surfaces defined as polygon regions
CAS map usageSecondary system performs less localization processing than primary; may determine pose relative to objects/surfaces rather than a full map
Unmapped obstaclesDetected via sensor data (parked cars, construction, debris) using corridor-based obstacle classification
Ground removalEigenvalue decomposition per voxel separates drivable ground from obstacles in LiDAR data
Drivable areaDefined with layered polygon boundaries (lane boundaries, dilated/collision/safety regions)

Key Distinction: CAS applies uniform bounding box intersection logic across all object types. The differentiation between dynamic and static objects occurs primarily in the prediction stage — dynamic objects have their trajectories propagated forward using motion models, while static objects are checked as fixed obstacles within the corridor.

Algorithms NOT Used (Based on Patent Evidence)

Across all examined Zoox patents, the following computational geometry algorithms are never mentioned:

  • GJK (Gilbert-Johnson-Keerthi) algorithm
  • Separating Axis Theorem (SAT)
  • Minkowski sum/difference
  • Swept volume computation (as a formal algorithm)
  • Convex hull decomposition for collision checking

This is consistent with industry practice — advanced polygon algorithms like GJK and Minkowski sum are more common in robotics manipulation and game physics engines (Bullet, FCL, Box2D). In AV motion planning, objects are typically simple rectangles where bounding box overlap suffices.

CAS Team Organization

Sub-TeamFocus
CAS PerceptionRaw sensor processing with geometric + DL algorithms for obstacle detection
CAS PlannerMotion planning with low-latency C++ algorithms
CAS Verification & ValidationTesting and certification of CAS components

Real-World CAS Performance

No specific CAS intervention statistics (interventions/mile) have been publicly disclosed. However, real-world recalls reveal CAS-related behaviors:

DateRecallIssue
March 202525E-029 (258 vehicles)Unexpected hard braking: (1) over-cautious braking when cyclist near adjacent crosswalk at newly green signal; (2) incorrect anticipation of collision from rapidly approaching motorcyclist/bicyclist from behind
April 202525E-037At speeds >40 mph, "inaccurately confident predictions" of perpendicular vehicle behavior from driveways
December 2025(332 vehicles)Software causing lane crossings and crosswalk blocking near intersections

CTO Jesse Levinson noted: "It's really easy to overreact because the car can look at 300 things and think maybe one of them is going to run into it, causing the car to brake."

CAS Patent Summary

PatentTitleKey Innovation
US20200211394Collision Avoidance SystemPrimary/secondary dual-system architecture, trajectory hierarchy, state machine control
US11500385B2Collision Avoidance Perception SystemSpline-based ground modeling, corridor analysis, weighted least squares classification
EP4037946A1Complex Ground Profile EstimationMulti-grade terrain handling, NURBS fitting, outlier elimination
US20200398833Dynamic Collision CheckingConvex polygon buffer regions, bumper path projection, layered drivable area
US20210370921A1Perturbed Object TrajectoriesProbabilistic collision checking via Monte Carlo-style perturbation analysis
US20200139959Cost Scaling in Trajectory GenerationLayered collision/safety regions with distance-based cost functions
US20240094009A1Map Annotation ModificationPolygon mesh for map surfaces, polygon fitting for map element extents
US10535138Sensor Data SegmentationCross-modal sensor data segmentation for perception

Key inventors: Andrew Lewis King, Jefferson Bradfield Packer, Kristofer Sven Smeds, Robert Edward Somers, Marc Wimmershoff, Michael Carsten Bosse, Jacob Daniel Boydston, Joshua Kriser Cohen, Chuang Wang, Janek Hudecek, David Pfeiffer.

CAS Theoretical & Research Background

The CAS draws on decades of research across robust statistics, computational geometry, state estimation, safety engineering, and formal methods. This section traces each CAS technique to its foundational academic literature.

A. Architectural Design Philosophy

The dual-system architecture (Main AI + CAS) instantiates several well-established patterns from safety-critical systems engineering:

PatternSeminal WorkYearKey IdeaCAS Mapping
N-Version ProgrammingAvizienis, "The N-Version Approach to Fault-Tolerant Software," IEEE TSE1985Multiple independently developed implementations reduce common-cause failure probabilityMain AI (neural network) and CAS (geometric) are architecturally dissimilar — different algorithms, different failure modes
Design Diversity CritiqueKnight & Leveson, "An Experimental Evaluation of the Assumption of Independence in Multiversion Programming," IEEE TSE1986Correlated failures occur more often than independence assumption predicts — homogeneous redundancy is insufficientJustifies why CAS uses fundamentally different algorithms (geometric/interpretable) rather than duplicating the Main AI
Simplex ArchitectureSeto, Krogh, Sha & Chutinan, ACC 1998; Sha, "Using Simplicity to Control Complexity," IEEE Software1998/2001A verified simple controller bounds the behavior of an unverified complex controller; safety and performance decoupledMain AI = advanced unverified controller; CAS = verified baseline safety controller with switching logic
Monitor-Actuator (Doer-Checker)Koopman & Wagner, "Challenges in Autonomous Vehicle Testing and Validation," SAE 2016-01-01282016Complex "doer" produces outputs; simpler "checker" validates them; checker certifiable at higher ASIL due to reduced complexityCAS monitors and validates Main AI trajectory outputs; can override with safe trajectory
Responsibility-Sensitive Safety (RSS)Shalev-Shwartz, Shammah & Shashua, "On a Formal Model of Safe and Scalable Self-driving Cars," arXiv:1708.063742017Mathematical definitions of "dangerous situation" and "proper response"; formal safety distance calculations; blame attributionCAS trajectory validation enforces formal safety constraints analogous to RSS proper-response rules
SOTIF (ISO 21448)ISO/PAS 21448:2019 (full standard 2022), "Road vehicles — Safety of the intended functionality"2019Addresses hazards when system works as designed but has functional limitations (not hardware failure) — especially perceptionCAS's dissimilar perception catches Main AI perception-limitation hazards that ISO 26262 does not address

ASIL-D and ISO 26262 Context:

The CAS targets ASIL-D (the most stringent automotive safety integrity level), which requires:

  • Single-Point Fault Metric (SPFM) ≥ 99%
  • Latent Fault Metric (LFM) ≥ 90%
  • Probabilistic Metric for Hardware Failure (PMHF) < 10⁻⁸/hour

Through ASIL decomposition (ISO 26262 Part 9), the overall ASIL-D safety goal is split between the complex Main AI (lower ASIL) and the simpler CAS (higher ASIL), with sufficient independence demonstrated through dissimilar redundancy. Lower-complexity interpretable systems are fundamentally easier to certify at higher ASIL levels because their state spaces are smaller, amenable to formal/semi-formal analysis, and have enumerable failure modes.

Why Interpretable Algorithms Over Neural Networks for the Safety-Critical Path:

Katz et al. (2017, "Reluplex," CAV) proved that verifying even simple properties of deep neural networks with ReLU activations is NP-complete. The CAS responds to this intractability by using geometric/analytic methods that provide: (1) deterministic outputs for given inputs, (2) bounded execution time amenable to WCET analysis, (3) enumerable failure modes, (4) formal provability of safety properties, and (5) transparency for certification authorities.

B. Ground Profile Estimation: Splines, Robust Fitting & Outlier Rejection

B-Splines, Bezier Curves, and NURBS:

ContributionAuthor(s)YearSignificance
Spline functions introducedI.J. Schoenberg, "Contributions to the Problem of Approximation of Equidistant Data," Q. Appl. Math. 4:45-991946Founded spline theory — smooth curve fitting to discrete noisy data, the core CAS ground-fitting problem
Bezier curves (unpublished)Paul de Casteljau, internal Citroen documents1959First practical control-point curve evaluation algorithm
Bezier curves (published)Pierre Bezier, Renault; UNISURF system1962–72First CAD/CAM system using polynomial control-point curves in production
Cox-de Boor recursionCarl de Boor, "On Calculating with B-Splines," J. Approx. Theory 6:50-62; Maurice Cox independently1972Recursive B-spline basis function evaluation — computational backbone of all B-spline systems
B-splines for CADRichard Riesenfeld, PhD thesis, Syracuse1973Proved B-splines generalize and are superior to Bezier curves for CAD
NURBSKenneth Versprille, PhD thesis, Syracuse1975Extended B-splines with rational functions to represent conics and freeform curves in a unified framework
Knot vector theory monographCarl de Boor, A Practical Guide to Splines, Springer1978Definitive reference: derives m = n + p + 1 (knots = control points + degree + 1) from spline space dimension
B-spline ground estimation for AVWirges, Rosch, Bieder & Stiller, "Fast and Robust Ground Surface Estimation from LiDAR Measurements using Uniform B-Splines," FUSION2021Direct precedent — models ground surface as uniform B-spline solved via robust least squares on LiDAR data

B-splines are ideal for CAS corridor ground modeling because: (a) local control — modifying one control point affects only a local curve segment (critical when a curb or pothole appears in one section); (b) non-uniform point density handling — gracefully adapts to LiDAR's distance-dependent density; (c) tunable smoothness trading off fidelity against noise rejection.

Knot Vector Relationship: The formula m = n + p + 1 (knots = control_points + degree + 1) is a structural consequence of the Cox-de Boor recursion — each basis function of degree p is supported on at most p+1 consecutive knot spans, so defining n basis functions of degree p requires n+p+1 knot values. In the CAS, knot count controls model complexity (more knots → finer ground detail), knot placement controls local resolution (denser at curb transitions, sparser on flat road), and degree selection (typically cubic, p=3, for C² continuity) is physically reasonable for ground surfaces.

Weighted Least Squares with Asymmetric Loss:

ContributionAuthor(s)YearSignificance
LINEX asymmetric loss functionHal Varian, "A Bayesian Approach to Real Estate Assessment"1975First formal asymmetric loss — penalizes overestimation and underestimation differently
Bayesian estimation under asymmetric lossArnold Zellner, "Bayesian Estimation and Prediction Using Asymmetric Loss Functions," JASA 81(394):446-4511986Proved common estimators are inadmissible under asymmetric loss; established rigorous framework

Safety motivation: In ground profile estimation, underestimating ground height can mask an obstacle whose base appears below the estimated ground surface — a potentially catastrophic false negative. Overestimating ground height creates a phantom obstacle — a nuisance but safe. The skewed loss function biases the fit low, ensuring objects on the ground are never masked by an overestimated ground plane.

M-Estimators:

ContributionAuthor(s)YearSignificance
Robust estimation / Huber M-estimatorPeter J. Huber, "Robust Estimation of a Location Parameter," Ann. Math. Stat. 35(1):73-1011964Founded robust statistics. Quadratic loss for small residuals (like least squares), linear for large residuals (like LAD). Minimizes worst-case asymptotic variance under contamination
Robust regressionHuber, "Robust Regression," Ann. Stat. 1(5):799-8211973Extended M-estimator theory to regression problems
Bisquare redescending estimatorJohn Tukey1970sInfluence function drops to zero for gross outliers — complete rejection of extreme contamination
Influence function & breakdown pointFrank Hampel1970sTheoretical framework for comparing robust estimators; breakdown point up to 0.5 (half the data can be corrupted)

Why M-estimators over neural networks for CAS: (1) Deterministic, analyzable behavior — influence function, breakdown point, asymptotic efficiency are known quantities; (2) No training data dependency — no distribution shift vulnerability; (3) Formal safety certification — bounded influence provides direct evidence for ASIL-D; (4) Computational predictability — deterministic, bounded runtime essential for real-time guarantees.

RANSAC:

ContributionAuthor(s)YearSignificance
Random Sample ConsensusMartin A. Fischler & Robert C. Bolles, "Random Sample Consensus," Comm. ACM 24(6):381-3951981Inverted conventional fitting: hypothesize model from minimum samples, count consensus. De facto standard for outlier rejection in point cloud processing

In the CAS corridor pipeline, RANSAC serves as first-stage gross outlier removal (vehicles, pedestrians, vegetation, sensor artifacts), followed by M-estimator-based B-spline fitting for robust fine fitting. This two-stage approach (RANSAC for gross outliers, M-estimator for moderate outliers) is a well-established pattern in robust estimation.

C. Object Tracking & Short-Term Prediction

Extended Kalman Filter (EKF):

ContributionAuthor(s)YearSignificance
Kalman filterR.E. Kalman, "A New Approach to Linear Filtering and Prediction Problems," ASME J. Basic Eng. 82(1):35-451960Recursive optimal state estimation for linear systems from noisy measurements — predict, then update
Continuous-time extensionKalman & Bucy, "New Results in Linear Filtering and Prediction Theory," ASME J. Basic Eng. 83(1):95-1081961Extended to continuous-time systems; established duality between estimation and control
Extended Kalman Filter (nonlinear)Stanley F. Schmidt, NASA Ames (Apollo program)Early 1960sLinearizes nonlinear system around current estimate via Jacobian matrices at each timestep; used on every Apollo lunar mission
Standard tracking referenceBar-Shalom, Li & Kirubarajan, Estimation with Applications to Tracking and Navigation2001Definitive reference covering KF, EKF, UKF, IMM, data association for tracking

The EKF serves as the CAS workhorse for object state estimation, maintaining per-object state vectors (position, velocity, heading) and recursively fusing LiDAR range-bearing and radar Doppler measurements. Its recursive, constant-time-per-update structure is essential for hard real-time latency constraints.

Particle Filters / Sequential Monte Carlo:

ContributionAuthor(s)YearSignificance
Bootstrap filterGordon, Salmond & Smith, "Novel approach to nonlinear/non-Gaussian Bayesian state estimation," IEE Proc. F 140(2):107-1131993Foundational particle filter — represents posterior as weighted random samples with resampling to prevent degeneracy
CONDENSATIONIsard & Blake, "Conditional Density Propagation for Visual Tracking," IJCV 29(1):5-281998Brought particle filtering to computer vision; demonstrated superiority over Kalman for multi-modal, cluttered tracking

Why particle filters for CAS pedestrian tracking: The fundamental limitation of Kalman filter families is Gaussian (unimodal) posterior representation. A pedestrian approaching an intersection might turn left, continue straight, or stop — three distinct modes. A single Gaussian must compromise by placing its mean between these possibilities (potentially an implausible trajectory). Particle filters maintain multiple hypotheses simultaneously, naturally representing behavioral ambiguity, data association ambiguity, and occlusion recovery.

Eigenvalue Decomposition for Ground Removal:

For N LiDAR points within a voxel cell: compute centroid μ, compute 3×3 sample covariance matrix C, perform eigendecomposition C = VΛV^T yielding eigenvalues λ₁ ≥ λ₂ ≥ λ₃ ≥ 0.

Eigenvalue PatternGeometric Interpretation
λ₁, λ₂ large; λ₃ ≈ 0Planar — points lie on a surface. If eigenvector v₃ is approximately vertical → ground
λ₁ large; λ₂, λ₃ smallLinear — pole, tree trunk, post
λ₁ ≈ λ₂ ≈ λ₃Spherical — no dominant structure

Surface variation σ = λ₃/(λ₁ + λ₂ + λ₃) quantifies planarity (near zero = highly planar).

Key PaperAuthorsYearContribution
PCA foundationsKarl Pearson, "On Lines and Planes of Closest Fit," Phil. Mag.1901Directions of maximum variance in data
Ground Plane FittingZermas, Izzat & Papanikolopoulos, ICRA2017PCA-based plane estimation per sector with adaptive seed selection
Fast point cloud segmentationHimmelsbach, Hundelshausen & Wuensche, IEEE IV2010Radial sector discretization for non-flat terrain ground estimation
Eigenvalue geometric featuresDemantke, Mallet, David & Vallet, ISPRS Workshop2011Formalized linearity, planarity, sphericity features from eigenvalue ratios

PCA-based ground removal is preferred over RANSAC for the CAS's safety-critical path because PCA is deterministic (unlike RANSAC which is stochastic and can produce different results on different runs), providing predictable, bounded runtime.

Constant Velocity Models for Short-Horizon Prediction:

Key PaperAuthorsYearKey Finding
CV model vs. deep learningScholler, Aravantinos, Lay & Knoll, "What the Constant Velocity Model Can Teach Us About Pedestrian Motion Prediction," IEEE RA-L 5(2):1696-17032020A parameter-free constant velocity model outperformed Social Force, Social LSTM, Social GAN on ETH/UCY benchmarks
Singer acceleration modelR.A. Singer, "Estimating Optimal Tracking Filter Performance," IEEE T-AES1970Exponentially correlated random acceleration for tracking filter process noise design
Maneuvering target surveyLi & Jilkov, "Survey of Maneuvering Target Tracking. Part I," IEEE T-AES 39(4):1333-13642003Comprehensive survey: constant velocity through coordinated turn and IMM approaches

Why simple models win at short horizons (bias-variance tradeoff): E[Error] = Bias² + Variance + Irreducible Noise. Complex models (LSTMs, GANs) have low bias but high variance (sensitive to training data, scene-specific features). CV models have higher bias but near-zero variance (no learnable parameters). For prediction horizons under ~1-2 seconds, most objects continue at approximately constant velocity — the bias is small and the variance reduction more than compensates. The CAS's relevant prediction horizon (fractions of a second to ~1-2 seconds for last-resort collision avoidance) falls squarely in this regime.

Voxel-Based LiDAR Processing:

ContributionAuthor(s)YearSignificance
Octree spatial data structureDonald Meagher, "Geometric Modeling Using Octree Encoding," CGIP1982Hierarchical 3D space subdivision — ancestor of all voxel representations
Point Cloud LibraryRusu & Cousins, "3D is here: Point Cloud Library," ICRA2011Open-source standard for point cloud processing; VoxelGrid filter, octree indexing
VoxelNetZhou & Tuzel, "VoxelNet: End-to-End Learning for Point Cloud Based 3D Object Detection," CVPR2018Learned 3D detection from voxelized LiDAR; Voxel Feature Encoding layers
PointPillarsLang, Vora, Caesar, Zhou, Yang & Beijbom, CVPR2019Collapsed vertical dimension into 2D pillars — dramatically faster while competitive

The CAS uses voxel infrastructure (spatial binning, O(1) lookups, deterministic memory layout) but applies geometric algorithms rather than learned detectors — deterministic behavior, bounded compute, and interpretability take precedence over marginal accuracy gains from deep learning.

D. Geometric Collision Detection & Trajectory Validation

Oriented Bounding Box (OBB) Collision Detection:

ContributionAuthor(s)YearSignificance
OBBTreeGottschalk, Lin & Manocha, SIGGRAPH1996Hierarchical OBB tree for rapid interference detection using the Separating Axis Theorem (SAT); <200 FLOPs per overlap test
Hyperplane Separation TheoremHermann Minkowskic.1911Two disjoint convex sets in R^n can be separated by a hyperplane — mathematical foundation of SAT
Practical referenceChrister Ericson, Real-Time Collision Detection, Morgan Kaufmann2005Definitive implementation reference for SAT, AABBs, OBBs, k-DOPs, and BVH hierarchies

Why OBBs over AABBs: AABBs fit poorly around rotated objects — a vehicle rotated 45° wastes ~50% of bounding area, creating excessive false-positive collision reports. OBBs align with the object's principal axes, providing dramatically tighter fits. For two OBBs in 2D (vehicle footprints on road), only 4 separating axes need testing (2 edge normals per rectangle). SAT's finite, deterministic nature makes it verifiable and suitable for safety-critical code paths.

Convex Polygon Buffer Regions / Minkowski Sums:

ContributionAuthor(s)YearSignificance
Mathematical morphology (dilation/erosion)Georges Matheron & Jean Serra, Ecole des Mines; Serra, Image Analysis and Mathematical Morphology1964/1982Dilation of set A by structuring element B = Minkowski sum A ⊕ B
Configuration-space obstaclesTomas Lozano-Perez, "Spatial Planning: A Configuration Space Approach," IEEE T-C C-32(2):108-1201983C-obstacle for translating robot = Minkowski sum of workspace obstacle and reflected robot shape
Comprehensive textbookde Berg, van Kreveld, Overmars & Schwarzkopf, Computational Geometry, 3rd ed., Springer2008Standard reference: Minkowski sums in robot motion planning (Ch. 13)
Robot motion planningJean-Claude Latombe, Robot Motion Planning, Kluwer1991Comprehensive treatment of configuration space and Minkowski sum computation

A buffer zone with clearance r around polygon P is formally P ⊕ D(r) where D(r) is a disk of radius r — precisely a Minkowski sum. The CAS's layered buffer regions (dilated/collision/safety) are concentric Minkowski sums with increasing structuring elements, creating graduated warning/action zones with mathematically guaranteed geometric properties.

Monte Carlo Perturbation-Based Probabilistic Collision Checking:

ContributionAuthor(s)YearSignificance
Monte Carlo methodMetropolis & Ulam, "The Monte Carlo Method," JASA 44(247):335-3411949Transforms intractable analytical problems into statistical sampling problems
Adaptive importance sampling for collision probabilitySchmerling & Pavone, "Evaluating Trajectory Collision Probability through Adaptive Importance Sampling," RSS2017Variance reduction makes real-time probabilistic safety evaluation feasible (millisecond-scale); provides confidence intervals
Chance-constrained path planningBlackmore, Ono & Williams, "Chance-Constrained Optimal Path Planning With Obstacles," IEEE T-RO 27(6)2011Plan trajectory distribution such that P(collision) < ε; theoretical framework for CAS's probabilistic collision threshold
Reachability-based safety verificationAlthoff & Dolan, "Online Verification of Automated Road Vehicles Using Reachability Analysis," IEEE T-RO2014Formal online safety via reachable set computation; CAS perturbation approach is a practical relaxation of exact reachability

The CAS generates M×N trajectory variants (M ego perturbations × N obstacle perturbations) by varying acceleration and steering parameters, then computes P(collision) = colliding pairs / total pairs, weighted by behavioral likelihood. This is a runtime implementation of a chance constraint — if the probability exceeds a threshold, the system escalates.

Cost-Based / Potential Field Trajectory Optimization:

ContributionAuthor(s)YearSignificance
Artificial Potential FieldsOussama Khatib, "Real-Time Obstacle Avoidance for Manipulators and Mobile Robots," IJRR 5(1):90-981986Continuous optimization: attractive potential toward goal + repulsive potentials from obstacles; enabled real-time collision avoidance
Navigation functions (no local minima)Rimon & Koditschek, "Exact Robot Navigation Using Artificial Potential Functions," IEEE T-RA 8(5):501-5081992Special potential functions guaranteed to have no local minima; resolved APF's fundamental limitation
Occupancy gridsAlberto Elfes1989Probabilistic environment representation in discrete cells — bridge from continuous potential fields to discrete cost maps

The CAS's layered cost regions function as a discrete approximation to a navigation function. Each cost layer adds a "repulsive" cost increment, creating a gradient field steering trajectory optimization away from obstacles toward free space. This retains the gradient-descent intuition of potential fields while accommodating arbitrary cost structures (non-symmetric obstacles, road geometry, traffic rules).

Trajectory Validation State Machines:

The CAS's monotonic escalation state machine (no upward transitions without explicit release) is the software analog of a hardware safety interlock from aviation and industrial safety:

StandardYearKey Principle
DO-178C (RTCA) — Airborne software certification2011Design Assurance Levels A-E; state machines are natural targets for formal verification via model checking
DO-333 — Formal Methods Supplement to DO-178C2011Guidance on model checking and theorem proving for certification credit
ISO 26262 — Automotive functional safety2011/2018Safety mechanisms must detect faults and maintain safe state; monotonic escalation ensures mechanism cannot be silently bypassed
IEC 61508 — Industrial functional safety (parent standard)1998/2010Fail-safe states, safety interlocks, permissive interlocks formalized

The pattern prevents oscillation (chattering) between states due to noisy sensor data and ensures the system errs on the side of safety. Formal model checkers can exhaustively verify properties like "once in EMERGENCY_BRAKE, the system never transitions to NOMINAL without passing through SAFE_STOP and receiving ALL_CLEAR."

Stopping Distance Models:

Total stopping distance = perception-reaction distance + braking distance:

d_total = v × t_pipeline + v² / (2g(μ + G))

where v = velocity, t_pipeline = sensor-to-actuator latency, g = 9.81 m/s², μ = tire-road friction coefficient, G = road gradient.

Key ReferenceAuthor(s)YearContribution
Highway design standardAASHTO, A Policy on Geometric Design of Highways and Streets ("Green Book")VariousStandard stopping sight distance formulas used in road design
Vehicle dynamicsThomas D. Gillespie, Fundamentals of Vehicle Dynamics, SAE R-1141992Braking dynamics including weight transfer, ABS, friction-slip relationship
Tire force modeling (Magic Formula)Hans B. Pacejka, Tire and Vehicle Dynamics, Elsevier2002/2012Nonlinear tire force model: F = D·sin(C·arctan(Bx - E(Bx - arctan(Bx)))); captures peak friction and tire saturation
Vehicle dynamics & controlRajesh Rajamani, Vehicle Dynamics and Control, Springer2006/2011Bridge between physics models and control algorithms for autonomous braking

The CAS dynamically computes threshold distance at each control cycle incorporating: current velocity, estimated friction coefficient (dry asphalt μ≈0.7-0.8; wet μ≈0.3-0.5; ice μ≈0.1-0.2), road gradient from map/IMU, brake system response time, and computational pipeline latency. If an obstacle falls within the stopping distance horizon, the state machine escalates.

E. Summary: Theoretical Lineage
CAS ComponentPrimary Theoretical FoundationsKey Seminal Works
Dual-system architectureN-version programming, Simplex architecture, design diversityAvizienis 1985; Sha 2001; Knight & Leveson 1986
ASIL-D certification approachASIL decomposition, dissimilar redundancyISO 26262; Koopman & Wagner 2016
Interpretability requirementNP-completeness of NN verificationKatz et al. 2017 (Reluplex)
Ground profile fittingB-spline theory, Cox-de Boor recursionSchoenberg 1946; de Boor 1972/1978
Asymmetric lossLINEX loss, Bayesian asymmetric estimationVarian 1975; Zellner 1986
Robust estimationM-estimators, breakdown point theoryHuber 1964; Hampel
Outlier rejectionRandom Sample ConsensusFischler & Bolles 1981
Object tracking (unimodal)Kalman filter, Extended Kalman FilterKalman 1960; Schmidt (Apollo)
Object tracking (multi-modal)Particle filters / Sequential Monte CarloGordon, Salmond & Smith 1993; Isard & Blake 1998
Ground removalPCA / eigenvalue decomposition on voxelized LiDARPearson 1901; Zermas et al. 2017
Short-horizon predictionConstant velocity models, bias-variance tradeoffScholler et al. 2020; Singer 1970
Spatial indexingVoxel grids, octreesMeagher 1982; Rusu & Cousins 2011
Collision detectionOriented bounding boxes, Separating Axis TheoremGottschalk et al. 1996; Minkowski c.1911
Buffer zonesMinkowski sums, mathematical morphologySerra 1982; Lozano-Perez 1983
Probabilistic collision checkingMonte Carlo methods, chance constraintsMetropolis & Ulam 1949; Blackmore et al. 2011
Cost-based planningArtificial potential fields, navigation functionsKhatib 1986; Rimon & Koditschek 1992
Safety state machineSafety interlocks, formal verificationDO-178C; ISO 26262; IEC 61508
Stopping distanceVehicle dynamics, tire force modelsGillespie 1992; Pacejka 2002
Formal safety modelResponsibility-Sensitive SafetyShalev-Shwartz et al. 2017
Functional limitation safetySOTIFISO 21448:2022

Unifying principle: Every CAS technique is chosen for determinism, verifiability, and bounded behavior — properties that enable formal safety argumentation. Where the Main AI trades interpretability for performance (deep learning), the CAS trades performance for certifiability (geometric/analytic methods). The two systems together cover the safety landscape that neither could address alone: the Main AI handles SOTIF's "unknown safe" scenarios through generalization, while the CAS handles "unknown unsafe" scenarios through formally verifiable safety constraints.

System 3: Safety Net

AttributeDetail
TypeSeparate ML algorithm (architecturally distinct from System 1)
Coverage360 degrees (unlike System 2 which focuses on intended driving path)
FunctionDetection + short-horizon prediction
TriggerEmergency stop when collision probability exceeds threshold
IndependenceBug in System 1 unlikely to affect System 3

Design rationale: Architectural diversity prevents common-cause failures. Systems 2 + 3 together form the Collision Avoidance System (CAS), parallel to the main AI stack. System 2 provides fast, interpretable geometric checking along the planned path, while System 3 provides ML-based 360-degree coverage as a final safety net.

Degradation handling: If a sensor degrades (debris, damage), diagnostics can activate cleaning systems or switch from bidirectional to unidirectional mode, placing the degraded sensor where it matters least.


Published Neural Network Architectures

PointFusion (CVPR 2018)

"PointFusion: Deep Sensor Fusion for 3D Bounding Box Estimation" — Danfei Xu, Dragomir Anguelov (Zoox), Ashesh Jain.

ComponentArchitecture
Image branchCNN for RGB feature extraction from cropped patches
Point cloud branchPointNet variant for unordered 3D point sets
FusionNovel fusion network combining both streams
OutputMultiple 3D bounding box hypotheses with confidence scores
Key innovationDense fusion — per-point spatial offsets using 3D points as spatial anchors
EvaluationKITTI (outdoor) + SUN-RGBD (indoor); dataset-agnostic

FISHING Net (CVPRW 2020)

"FISHING Net: Future Inference of Semantic Heatmaps In Grids" — Zoox's published BEV architecture.

Core design: Ensemble of per-modality networks, each producing a common top-down BEV semantic grid.

Per-Modality Network Design

LiDAR Branch (8 input channels):

  1. Binary occupancy
  2. Log-normalized LiDAR density
  3. Maximum z-value
  4. Sliced max z in 0.5m intervals from 0–2.5m (5 channels)
  • Architecture: U-Net encoder-decoder with skip connections

Radar Branch (6 input channels):

  1. Binary occupancy
  2. Motion-compensated Doppler velocity (X, Y)
  3. Radar cross section (RCS)
  4. Signal-to-noise ratio (SNR)
  5. Ambiguous Doppler interval
  • Architecture: U-Net encoder-decoder with skip connections

Camera Branch:

  • MLP-based view transformation (similar to VPN) to lift perspective images into BEV
  • Single orthogonal feature transform, without skip connections

BEV Grid Specs

AttributeDetail
Resolution10 cm/pixel and 20 cm/pixel
Semantic classesVRU (pedestrians, cyclists, motorists), cars, background
Priority poolingVRU > cars > background
Temporal input5 historical frames
Temporal output5 future frames (deterministic BEV prediction)
FusionLate fusion at BEV level (each modality produces independent BEV, then combined)

Scenario Diffusion (NeurIPS 2023)

"Scenario Diffusion: Controllable Driving Scenario Generation With Diffusion" — Ethan Pronovost, Kai Wang (Zoox).

ComponentDetail
ArchitectureAutoencoder + latent diffusion model
InputBEV renderings of map and entities
OutputSparse bounding box detections with trajectories
Agent representationFeature vector: dimensions, orientation, trajectory
Temporal window2 sec past + 2 sec future per agent
Controllable tokensAgent tokens (individual) + global scene tokens (traffic density)
Generation speed~1 second per scenario on a single GPU
Training dataMillions of driving scenarios (public + proprietary)

Next-Gen Unified Perception Model

Based on Zoox job postings (Sensor Fusion Detection team), the next-generation architecture unifies multiple representations:

Input Representations

The model unifies two LiDAR representations into one architecture:

  • PixelSpace Range View — native 2D projection of LiDAR scan (dense, preserves sensor resolution)
  • VoxelSpace — 3D volumetric discretization of the point cloud

Additional inputs:

  • Multi-frame temporal LiDAR point clouds
  • Multi-frame temporal radar point clouds
  • Multi-view camera images
  • Language and audio inputs

Tokenization

Zoox develops effective tokenization techniques for Vision, LiDAR, and Radar modalities, using LLM techniques to align token embeddings across modalities into a common feature space.

Multi-Task Outputs

All produced simultaneously from a single model:

  • 3D object detection (bounding boxes + velocity)
  • 3D panoptic segmentation (instance + semantic contours)
  • Occlusion estimation
  • Occupancy prediction
  • Scene flow (per-point/per-voxel motion vectors)
  • Object attributes (classification, dimensions, heading)

Efficiency

Exploits sensor data sparsity to reduce training/inference latency, enabling higher-resolution image consumption and increased detection range.


Perception Outputs & BEV Representation

Raw Perception Outputs

OutputDescription
3D bounding boxesPosition, dimensions, heading for every agent
Velocity vectorsPer-agent current motion
ClassificationVehicle type, pedestrian attributes (e.g., "holding smartphone")
Tracking IDsPersistent identity across frames
Semantic segmentationPer-point/per-pixel class labels
Occupancy estimationVolumetric presence/absence
Dense depthPer-pixel depth maps

The ~60-Channel BEV Representation

This is the critical bridge between perception and prediction. Before data reaches prediction, perception outputs are "instantly boiled down to their essentials, into a format optimized for machine learning."

AttributeDetail
FormatTop-down, spatially accurate bird's-eye-view image
CenterEgo vehicle (robotaxi)
Channels~60 semantic layers
Example channelsAgent bounding boxes, agent headings, velocity vectors, pedestrian attributes (phone=1), ZRN static infrastructure, lane geometry, road boundaries, traffic signal states, crosswalk status

Each agent appears as a bounding box with heading, trajectory, and velocity within this multi-channel image. The ~60 channels provide the CNN with rich semantic context about every entity.


Perception → Prediction Interface

Data Flow

3D Perception outputs + ZRN semantic map

    ~60-channel BEV rasterization
    (agent bboxes, headings, velocities,
     attributes, roads, signals, lanes...)

        Prediction Module

    Weighted trajectory distributions
    per agent (8s horizon, 10 Hz)

        Planning Module
    (bidirectional feedback loop)

ZRN Integration

The Zoox Road Network provides the static context layer:

  • Speed limits, traffic lights, stop signs, lane markings
  • Bike lanes, crosswalks, keep-clear zones, one-way streets
  • ZRN Monitor detects real-time divergences from the map
  • Localization at 200 Hz, centimeter accuracy, sub-degree heading

Prediction Architecture (CNN + GNN)

CNN Stage

The ~60-channel BEV image feeds a convolutional neural network that:

  • Extracts spatial features and relationships
  • "Determines what distances matter, what relationships between agents matter"

GNN Stage (Graph Neural Network)

On top of CNN features, a message-passing GNN:

  • All agents and static elements interconnected as graph nodes
  • Explicit encoding of inter-agent relationships
  • Models how relationships develop temporally
  • Produces "prediction of more natural behaviors between agents"

Prediction Evolution: UAP → QTP

GenerationArchitectureLimitation/Improvement
UAP (Unified Active Prediction)Graph-based neural network"Had trouble modeling unexpected outcomes like jaywalkers or illegal U-turns"
QTP (Query-centric Trajectory Prediction)Query-centric paradigm (related to QCNet, CVPR 2023)Data-driven behavior modeling without hand-crafted assumptions

QTP Key Features:

  • Scene encoding independent of global spacetime coordinates → enables reuse of past computations
  • Streaming scene encoding + parallel multi-agent decoding
  • Anchor-free queries for recurrent trajectory proposals at different horizons
  • Refinement module using anchor-based queries
  • Ranked #1 on Argoverse 1 and Argoverse 2 motion forecasting benchmarks

Prediction Specs

AttributeDetail
HorizonUp to 8 seconds
Update rate10 Hz (every 100 ms)
OutputProbability distribution of trajectories per agent
CoverageTrucks, cars, pedestrians, cyclists, animals
TrainingSelf-supervised: real future trajectories as ground truth (no manual labels)

Prediction-Planning Feedback Loop

The planner queries prediction with conditional requests:

  • "If I perform action X, Y, or Z, how do agents react?"
  • Creates closed-loop feedback between planning and prediction
  • Enables contingency-aware planning that accounts for ego influence on the scene

Vision-Language-Action Foundation Model

Disclosed at AWS re:Invent 2025, this represents a paradigm shift that collapses the modular perception→prediction→planning boundary.

Architecture

AttributeDetail
CoreLarge Language Model
Base modelQwen 2/3 VL (vision-language model)
Parameter sizes400M → 7B → 32B (in development)

Inputs

  • Text prompts (e.g., "You are the driver of a Zoox robotaxi, what should you do...")
  • Camera/video through pre-trained vision encoders with projection layers
  • Encoded LiDAR projected into LLM token space
  • Encoded radar projected into LLM token space
  • Existing perception stack outputs (3D bounding boxes)

Outputs

  • Robotic controls (acceleration, braking, steering)
  • 3D object detections
  • Visual question answering
  • Scene descriptions / chain-of-thought reasoning

Three-Stage Training

StageMethodData
1Large-scale supervised fine-tuning (behavior cloning)Tens of thousands of hours of human driving; millions of 3D detection labels
2High-quality SFTRare objects, difficult scenarios, synthetic chain-of-thought
3Reinforcement learning (GRPO + DAPO)Hardest scenarios

Impact on Perception-Prediction Boundary

The foundation model collapses the traditional modular boundary:

  • Instead of separate modules with the ~60-channel BEV handoff
  • Processes raw sensor data end-to-end
  • Outputs both 3D detections (traditionally perception) AND robotic controls (traditionally planning)
  • Potentially subsumes prediction entirely within internal representations

Zero-Shot Goal

Handle long-tail edge cases (jaywalkers, tanks, construction flaggers, fire hoses, animals, unusual vehicles) on first encounter without prior training examples.


Inference & On-Vehicle Compute

Hardware — NVIDIA DRIVE PX Pegasus (Dual Redundant)

ComponentDetail
Platform2× NVIDIA DRIVE PX Pegasus boards (full redundancy)
Per-board performance>320 TOPS
Per-board SoCs2× Xavier SoCs (30 TOPS each, octa-core ARM + Volta GPU) + 2× discrete GPUs (~130 TOPS each)
Per-board TDP500W
Memory bandwidth>1 TB/s combined per board
Sensor inputs16 dedicated high-speed inputs per board (camera/radar/LiDAR/ultrasonics)
ConnectivityCAN, FlexRay, multiple 10 GbE
Safety certificationDesigned for ASIL D
CPUs4× Intel Xeon processors (additional to Pegasus)
RedundancyDual mirrored systems with cross-verified logic domains
Data rate~4 TB/hour of raw sensor data per vehicle
UploadHardwired AWS Data Transfer Terminals, up to 400 Gbps

Inference Performance

MetricValue
Forward pass (Inception-class network)1.767 ms
TensorRT speedup vs TensorFlow (FP32)2–6×
TensorRT speedup vs TensorFlow (INT8)9–19×
Precision modesFP32, FP16, INT8 (quantized)
ExecutionAsynchronous and concurrent via CUDA streams

TensorRT Deployment Pipeline (4-Stage Validation)

Zoox uses a rigorous 4-stage process for deploying ML models via TensorRT on-vehicle:

StageDescription
1. Conversion CheckerValidates that the PyTorch → ONNX → TensorRT conversion succeeds without errors
2. Output DeviationCompares TensorRT outputs against PyTorch reference on identical inputs; flags numerical drift
3. Layer-by-Layer InspectionTraces precision loss to individual layers; decides per-layer FP32/FP16/INT8 mixed-precision config
4. MaintenanceContinuous monitoring in production; regression checks when TensorRT or driver versions update

Transformer Efficiency Techniques

For deploying transformer-based models (DINO decoder, VLA) on embedded NVIDIA hardware:

  • Token pruning — drops low-attention tokens mid-inference to reduce compute
  • Token merging — combines similar tokens to shrink sequence length without losing semantic content
  • Mixed-precision per-layer — critical layers stay FP32, others quantized to FP16/INT8

Deployment Stack

ContextTool
On-vehicle productionNVIDIA TensorRT (with 4-stage validation pipeline)
On-vehicle VLA (in development)TensorRT-LLM
Cloud offlinevLLM
Cloud batchAmazon EKS + Ray Serve
Monitoring dashboardsLooker, Grafana, Databricks

Training Infrastructure

Cloud (AWS)

ResourceDetail
GPU instancesP5 (H100), P6N
Cluster orchestrationSageMaker HyperPod (auto-recovery, health checks)
Cluster size500+ nodes, 64+ GPUs per training job
NetworkingEFA at 3,200 Gbps/node (RDMA-enabled)
GPU utilization95% achieved
ScalingNear-linear across multi-node
StorageS3 (tens of PB active, ~1 EB cold), FSx for Lustre, EFS

On-Premises

ResourceDetail
GPUsThousands of NVIDIA GPUs
StorageQuobyte parallel filesystem — 3 clusters, 30 PB
PreviousMigrated from Ceph (performance/reliability issues)
TieringSSD → disk → cloud

Training Configuration

ParameterDetail
FrameworkPyTorch (primary), JAX
Distributed trainingHSDP + FSDP + DDP + tensor parallelism
PrecisionBF16 with gradient checkpointing
Optimizationtorch.compile
Data loadingMosaic Data Streaming (MDS) — deterministic, resumable, mid-epoch
Training cadence~every 2 weeks
Job duration2–3 days per run
Data pipelineMedallion architecture on S3 with Delta tables + Apache Spark
OrchestrationApache Airflow
Experiment trackingComet ML

Annotation & Labeling

ApproachDetail
PlatformDataloop (external vendor) with custom integrations
Auto-labelingML-assisted algorithms reduce manual burden
Embedding indexesCLIP-based embedding indexes for similarity search across driving scenarios
Active learningAutomated mining of high-uncertainty frames for human review
Self-supervisedPrediction uses actual future trajectories as ground truth
Ground truth3D bounding box annotations for BEV tasks
Human labelingDedicated Perception Labeling & Tools team with web-based tools
Web toolsReact/Angular/Vue frontends + FastAPI/Django backends

Synthetic Data

MethodDetail
Scenario DiffusionLatent diffusion model; ~1 sec/scenario on single GPU
3D Sensor SimulationGenAI/ML + modern 3D graphics to simulate cameras, LiDAR, radar
Neural renderingGaussian Splatting, NeRFs for 3D reconstruction
Procedural worldsHoudini for world creation

Perception Team Organization

Sub-TeamFocus
Object Detection & TrackingAll people and objects capable of moving
Occupancy & Rare EventsFoundation models as perception backbone; long-tail detection; generalization to new geofences
Perception AttributesVehicle classification, semantic enrichment, real-time inference
Scene UnderstandingAdvanced ML for hazard identification
Perception OptimizationOptimized inference pipelines for on-vehicle algorithms
Perception Labeling & ToolsInternal labeling platforms, auto-labeling
Sensor Fusion DetectionNext-gen unified multi-representation model
CAS (Collision Avoidance)Geometric + ML parallel safety system

Director of Perception: Bat-El Shlomo. Also Ruijie He (from Strio.AI acquisition, Boston).

Locations: Foster City (HQ), San Francisco, Boston, San Diego.


Competitive Analysis

Zoox vs Industry Perception Approaches

DimensionZooxWaymoTeslaCruiseAuroraPony.ai
Fusion strategyEarly fusion (CVPR 2025) — perspective-view PointPillar + YoloXPAFPNLEF (Learned Early Fusion) + PVTransformerN/A (camera-only)Mid-level fusionS2A (Sensor-to-Autonomy) end-to-endUndisclosed multi-sensor
LiDAR representationUnified range view + voxel (next-gen)VoxelNet, PointPillars (published)NoneStandard voxelRange view → learned featuresVoxel-based
Sensor modalities5 (camera, LiDAR, radar, LWIR, mics)4 (camera, LiDAR, radar, audio)1 (camera only)3 (camera, LiDAR, radar)3 (camera, LiDAR, radar)4 (camera, LiDAR, radar, mics)
Sensor count~64 (14 cam, 20 radar, 8 LiDAR, LWIR, mics)~40 (29 cam, 6 LiDAR, 4 radar)12 cameras~40 (camera, LiDAR, radar)~40Gen-7: 12 cam, 8 LiDAR, 6 radar
LWIR thermalYes (only AV company)NoNoNoNoNo
Redundant perceptionTriple (AI + geometric + safety net)DualSingleDualDualDual
Detection headDINO (DETR-based)Custom transformerOccupancy networksCustomEnd-to-end learnedCustom
Staleness handlingPublished CVPR 2025 solutionNot publishedN/ANot publishedNot publishedNot published
BEV representation~60-channel semantic gridSimilar conceptOccupancy networksSimilar conceptLearned latent BEVSimilar concept
PredictionQTP (ranked #1 Argoverse 1 & 2)Custom motion forecastingNeural net plannerCustomJoint perception-predictionCustom
Vehicle advantagePurpose-built → optimal sensor placementRetrofit → compromised placementConsumer vehicleRetrofit → compromisedRetrofit (trucks + passenger)Retrofit

Key Competitive Differentiators by Company

Waymo (LEF + PVTransformer): Uses Learned Early Fusion to combine LiDAR range images with camera features before the backbone, and PVTransformer for cross-attention between perspective and voxel features. Most similar to Zoox's approach philosophically (early fusion), but Zoox's frustum-based PointPillar approach is architecturally distinct.

Aurora (S2A): Sensor-to-Autonomy is the most ambitious end-to-end approach in the industry — a single neural network from raw sensor input to driving commands. Focuses on trucking (Aurora Driver for trucks) but expanding to passenger vehicles.

Motional: Uses point-painting (projecting semantic labels from camera onto LiDAR points before 3D detection) — a simpler fusion approach than Zoox's joint backbone fusion.

Pony.ai Gen-7: Latest platform with 12 cameras, 8 LiDARs, 6 radars. Competing in China and US markets with a sensor suite approaching Zoox's density.

Zoox's Unique Advantages

  1. Only AV company using LWIR thermal cameras — critical for pedestrian detection in darkness/glare
  2. Triple-redundant perception with architectural diversity
  3. Purpose-built vehicle enables optimal sensor pod geometry
  4. Published CVPR 2025 staleness solution — quantified robustness gains
  5. Unified next-gen model consuming language + audio alongside traditional AV modalities
  6. QTP prediction ranked #1 on major public benchmarks
  7. Foundation model approach collapsing perception/prediction/planning boundaries

What Remains Undisclosed

  • Exact latency budget allocation across perception/prediction/planning pipeline
  • Exact CAS latency in milliseconds (confirmed "optimized for low-latency" but no specific ms figure)
  • Exact CAS update rate in Hz (Main AI prediction is 10 Hz; CAS likely same or higher)
  • Exact perception update rate in Hz (only 10 Hz prediction confirmed)
  • Precise BEV channel definitions beyond ~60 count and examples
  • Specific CNN architecture for BEV processing (layers, kernels, feature dims)
  • Specific GNN architecture (message-passing rounds, edge/node feature dims)
  • Exact GPU model on-vehicle (confirmed NVIDIA DRIVE, not which SoC)
  • Exact LWIR thermal camera count (RGB cameras: 14, LiDAR: 8, radar: 20 now confirmed)
  • Occupancy grid resolution and update rate
  • Joint perception-prediction training details in modular stack
  • Whether LWIR feeds into the early fusion pipeline alongside camera/LiDAR/radar, or is processed separately
  • Safety Net architecture specifics (network type, parameters, latency)
  • Specific collision detection algorithm used in CAS (SAT, GJK, or custom — patents describe bounding box overlap but not the implementation algorithm)
  • CAS intervention statistics (interventions per mile, false positive rate)
  • Whether CAS checks static map boundaries via ZRN polygon intersection or only sensor-detected obstacles

Sources

Zoox Official

Research Papers

Amazon Science

NVIDIA & Compute

CAS & Collision Avoidance Patents

Sensor Hardware

Infrastructure

Safety & Recalls

Job Postings


Compiled from 10 parallel research agents scanning 100+ sources including CVPR/NeurIPS papers, arXiv preprints, Zoox Safety Reports, patent filings, sensor datasheets, job postings, and conference talks. Updated with corrected sensor counts, deployment pipeline details, and comprehensive CAS/geometric collision avoidance deep dive from 8 Zoox patents covering corridor analysis, polygon collision detection, trajectory hierarchy, and dynamic vs. static object handling.

Public research notes collected from public sources.