Skip to content

Pony.ai Non-ML & Hybrid-ML Perception Stack: Exhaustive Technical Deep Dive

Last updated: March 15, 2026


Table of Contents

  1. Dual Perception Architecture
  2. Heuristic Perception Path
  3. Deep Learning Perception Path
  4. Fusion and Arbitration
  5. Why Dual Paths
  6. Hesai AT128 LiDAR Signal Processing
  7. Radar Signal Processing
  8. Camera ISP and Preprocessing Pipeline
  9. Time Synchronization
  10. Multi-Sensor Calibration
  11. Online Calibration
  12. Manufacturing Calibration
  13. Point Cloud Processing
  14. Free Space Estimation
  15. Road Geometry Detection
  16. Traffic Signal Processing
  17. Kalman Filtering and State Estimation
  18. Data Association
  19. Track Management
  20. IMU/GNSS Integration
  21. LiDAR-to-Map Matching
  22. Visual Odometry
  23. Multi-Sensor Localization Fusion
  24. Mixed Traffic Rule-Based Handling
  25. Non-Standard Infrastructure Handling
  26. ML Detections to Classical Tracking
  27. Classical Preprocessing to ML
  28. Kinematic Feasibility Checking
  29. HD Map Integration
  30. 1,000+ Monitoring Mechanisms
  31. 20+ Safety Redundancies

1. Dual Perception Architecture

Architectural Overview

Pony.ai's perception module is explicitly architected around a dual-path design that runs a heuristic (rule-based / classical) perception pipeline in parallel with a deep-learning perception pipeline. This is documented in the company's safety report, SEC filings (F-1, 20-F), and technology pages. The company states that its "Perception module combines the strengths of a heuristic approach and deep learning models to boost performance, while ensuring the safety and operational redundancy of the vehicles."

This is not a simple ensemble or a sequential pipeline where one feeds the other. Both paths run concurrently on every sensor cycle, producing independent perception outputs that are then compared and arbitrated. The architecture is a safety-critical design decision rooted in ISO 26262 functional safety methodology and maps to the "Multi-Algorithm Fusion Redundancy of Key ADS Modules" -- one of Pony.ai's seven types of software system redundancy.

Safety Motivation

The dual-path architecture directly implements Pony.ai's core safety principle:

  • Single-point failure: The vehicle can continue to operate safely. If the deep learning path fails entirely (GPU crash, model produces NaN, inference timeout), the heuristic path provides baseline perception.
  • Dual-point failure: The vehicle can park safely (minimal risk condition). If both perception paths fail, the MRCC (Minimum Risk Condition Controller) uses its own redundant perception to execute a safe stop.

Operational Modes

ModeConditionPerception Capability
Normal OperationAll systems healthyFull dual-path perception, multi-sensor fusion, all 34 sensors active
Degraded Safe ModeSingle-point failure (one LiDAR offline, DNN inference failure, single GPU crash)Heuristic path + remaining sensors maintain safe driving; system may reduce speed or restrict operational domain
Minimal Risk ConditionDual-point failure (multiple sensor failures, both perception paths degraded)Emergency perception via MRCC redundant system; critical blind-spot coverage maintained; vehicle navigates intersections/ramps and pulls over safely

2. Heuristic Perception Path

The heuristic path implements classical algorithmic approaches to perception that do not depend on trained neural network weights. Based on Pony.ai's safety report, patent filings, SEC disclosures, and job posting requirements, the heuristic path comprises the following components:

2.1 Point Cloud Clustering and Segmentation

Ground Plane Estimation:

  • RANSAC (Random Sample Consensus) plane fitting to identify the dominant ground plane in each LiDAR scan
  • The PCL-style implementation constrains the plane fit to near-horizontal orientations (within angular tolerance of gravity vector), rejecting wall surfaces and ramps
  • Inlier points classified as ground are removed; remaining points represent potential obstacles
  • Critical for separating drivable surface from above-ground objects

Voxel Grid Downsampling:

  • Raw point clouds from four AT128 LiDARs generate >6.12 million points/second combined
  • Voxel grid filtering reduces point density while preserving spatial structure
  • Enables tractable processing for downstream clustering without GPU-intensive neural networks

Connected-Component / Euclidean Clustering:

  • After ground removal, Euclidean clustering separates remaining point clouds into distinct object clusters
  • KD-tree-based nearest neighbor search identifies points within a distance threshold as belonging to the same cluster
  • Minimum and maximum cluster size bounds filter out noise (too few points) and merge artifacts (clusters too large to be single objects)
  • Each cluster represents a candidate obstacle

2.2 Rule-Based Detection (Bounding Box Fitting)

  • Principal Component Analysis (PCA): Applied to each point cluster to determine object orientation and dimensions
  • L-shape fitting: For partially visible objects (e.g., vehicle seen from one side), L-shape model fitting estimates the full 3D bounding box from visible edges
  • Model fitting: Geometric primitives (rectangles for vehicles, cylinders for pedestrians/poles) are fit to clusters
  • Size-based classification: Cluster dimensions are compared against known size ranges for object classes (car: ~4.5 x 1.8 x 1.5m; truck: ~12 x 2.5 x 3.5m; pedestrian: ~0.5 x 0.5 x 1.7m)
  • Height-based filtering: Objects below a minimum height threshold are classified as ground artifacts; objects above road-surface height are classified as above-ground obstacles

2.3 Geometric Tracking

  • Kalman filtering: Extended Kalman Filter (EKF) or Unscented Kalman Filter (UKF) maintains state estimates for each tracked object (see Section 17)
  • Nearest-neighbor data association: Greedy or Hungarian algorithm matches new detections to existing tracks based on Mahalanobis distance (see Section 18)
  • No learned features: Association uses only geometric proximity and kinematic consistency, not appearance embeddings

2.4 Lane and Boundary Detection

  • Edge detection: Sobel, Canny, or similar gradient-based operators on camera images (Y-channel luminance sufficient) to find lane marking edges
  • Hough transforms: Line detection in image space to identify straight lane markings; extended to parabolic/polynomial models for curved lanes
  • LiDAR ground return analysis: Intensity discontinuities in LiDAR ground returns correspond to painted lane markings (paint has higher reflectivity than asphalt)
  • Model fitting: Polynomial curve fitting (2nd or 3rd order) to detected lane points, constrained by lane width priors and continuity with previous frames

2.5 Advantages of the Heuristic Path

AdvantageTechnical Explanation
Predictable behaviorNo opaque neural network failure modes; failures are deterministic and root-cause analyzable
No training data dependencyFunctions without labeled datasets or GPU-intensive training; works from day one on new sensor configurations
Complementary failure modesWhen DL models fail (novel objects, adversarial conditions, distribution shift, compute saturation), the heuristic path may still produce valid detections
Low-latency fallbackClassical algorithms run on CPU; can serve as fast fallback when DNN inference is delayed or GPU memory is exhausted
CertifiableRule-based algorithms can be formally verified and tested against specifications per ISO 26262, which current ML cannot

3. Deep Learning Perception Path

The deep learning path uses multi-modal neural networks for higher-accuracy perception:

ComponentApproach
3D object detectionMulti-modal DNNs operating on LiDAR voxel/pillar features + camera image features, producing 3D bounding boxes with class labels, confidence scores, heading angles, and velocity estimates
BEV representationBird's Eye View as the unified representation: LiDAR features projected to top-down grid; camera features lifted to BEV via depth estimation (LSS/BEVDet/BEVFormer-style view transformation)
Semantic segmentationPixel-level and point-level classification of road surfaces, lane markings, drivable areas, crosswalks
Instance segmentationPer-object masks combining LiDAR and camera modalities (covered by US Patent 11,250,240)
Learned trackingDeep feature extraction for appearance-based data association; learned motion models capturing complex agent behaviors beyond constant-velocity assumptions
Traffic element recognitionCNN-based classification of traffic lights (red/yellow/green, arrow directions, countdown timers), signs, and signals

The DL path runs on NVIDIA DRIVE Orin-X chips (3 main + 1 redundant = 1,016 TOPS total), optimized through TensorRT inference and CUDA 11.2+ memory management, achieving >10 Hz across all perception modules at ~80% sustained GPU utilization.

What the DL Path Does That the Heuristic Path Cannot

  • Semantic understanding: Distinguishing between object classes based on appearance (e.g., police vehicle vs. civilian vehicle, construction barrier vs. guardrail)
  • Rich texture-based detection: Detecting objects that have minimal 3D geometry in LiDAR (flat traffic signs, painted road markings)
  • Generalization to novel appearances: Handling objects with unusual shapes, colors, or occlusion patterns that rule-based size/shape filters would miss
  • Dense scene understanding: Producing complete drivable-area segmentation rather than sparse obstacle detection

4. Fusion and Arbitration

Cross-Validation Layer

The outputs of both paths are fused through an arbitration layer implementing these rules:

  1. Agreement (both paths detect): Objects detected by both paths receive high confidence -- the consensus of independent algorithms provides strong evidence of a real object. The fused output uses the higher-precision bounding box (typically from the DL path) but validates existence via the heuristic path.

  2. DL-only detection: If the DL path detects an object the heuristic path misses, the system treats it as present with moderate confidence. This is common for distant, small, or partially occluded objects where LiDAR points are too sparse for clustering but the camera-based DNN can detect the visual signature.

  3. Heuristic-only detection: If the heuristic path detects an object the DL path misses, the system treats it conservatively as present. This is critical for safety -- an obstacle that produces a clear LiDAR cluster but is not recognized by the neural network (novel object, OOD input) must still be avoided.

  4. Disagreement (conflicting classifications): When both paths detect an object at the same location but disagree on classification, the system uses the more safety-conservative interpretation (e.g., if DL says "traffic cone" but heuristic says "pedestrian," the system assumes "pedestrian" until resolved over subsequent frames).

Conservative Default Policy

The arbitration follows a fundamental principle: the union of detections is used, not the intersection. Any object detected by either path is assumed to exist until proven otherwise. This maximizes recall at the cost of some precision -- acceptable for a safety-critical system where missing an obstacle is far worse than a false detection.

Fault Detection and System Arbitration Module

This is one of the seven software redundancy types. It monitors both perception paths for:

  • Latency violations: If either path exceeds its real-time deadline
  • Output validity: NaN checks, bounding box sanity (negative dimensions, unreasonable velocities)
  • Consistency monitoring: Sustained disagreement between paths triggers escalation
  • GPU health monitoring: Memory errors, thermal throttling, inference failures

5. Why Dual Paths

5.1 ISO 26262 Diversity Requirement

ISO 26262 (Road Vehicles -- Functional Safety) calls for diverse redundancy in safety-critical systems. Running two perception implementations based on fundamentally different algorithmic paradigms (geometric vs. learned) provides:

  • Algorithmic diversity: A bug or systematic error in one approach (e.g., a blind spot in the neural network's training distribution) is unlikely to manifest in the geometrically different heuristic approach
  • Implementation diversity: Different code paths, different engineers, different failure modes
  • Paradigmatic independence: ML models can fail silently on out-of-distribution inputs; classical algorithms fail loudly with interpretable error states

5.2 ML Opacity Problem

Neural networks are not certifiable under current ISO 26262 methods because:

  • Their behavior cannot be exhaustively specified or tested
  • They can produce confident wrong outputs on adversarial or out-of-distribution inputs
  • Failure modes are not enumerable

The heuristic path provides a certifiable safety baseline against which ML outputs are cross-validated.

5.3 Complementary Failure Modes

ScenarioDL PathHeuristic PathDual-Path Outcome
Novel object (e.g., fallen mattress)May miss -- not in training distributionDetects as generic obstacle clusterDetected
Distant small object (e.g., pedestrian at 180m)Detects via camera DNNToo few LiDAR points for clusteringDetected
GPU crash / inference timeoutFull failureUnaffected (runs on CPU)Heuristic path provides fallback
Adversarial pattern / sensor artifactMay produce false classificationGeometric analysis unaffected by visual patternsGeometric validation catches error
Nighttime, poorly lit areaCamera-based DL may degradeLiDAR-based heuristic unaffected by lightingDetected via heuristic path

5.4 CTO Philosophy

Pony.ai's CTO Lou Tiancheng has been described as a "rule-based true believer" in the planning domain. This philosophy extends to perception architecture: the company deliberately maintains strong classical/rule-based components as a counterweight to ML, rather than going all-in on end-to-end neural approaches. Their software is explicitly described as "a combination of AI models and rule-based code."


6. Hesai AT128 LiDAR Signal Processing

6.1 Sensor Overview

The Gen-7 robotaxi uses four Hesai AT128 hybrid solid-state LiDARs as primary perception sensors, each covering 120 degrees of horizontal FOV.

ParameterSpecification
Channels128 genuine channels (128 VCSEL arrays)
Scan technology128 high-power multi-junction VCSEL arrays, electronic scanning (no mechanical rotation)
Horizontal FOV120 degrees
Vertical FOV25.4 degrees
Detection range210 m @ 10% reflectivity
Ground detection rangeUp to 70 m effective
Angular resolution0.1 deg (H) x 0.2 deg (V)
Point rate>1.53 million pts/sec (single return); higher in dual return
Pixel resolution1,200 x 128
Range accuracy+/- 3 cm
Wavelength905 nm
Eye safetyClass 1
Functional safetyISO 26262 ASIL-B certified

6.2 Range Computation (Time-of-Flight)

The AT128 uses a pulsed Time-of-Flight (ToF) ranging method:

  1. Laser emission: Each VCSEL array fires a short (few nanosecond) 905 nm laser pulse
  2. Photon detection: SiPM (Silicon Photomultiplier) detectors receive reflected photons
  3. Time measurement: On-chip TDC (Time-to-Digital Converter) measures the round-trip time from emission to detection with sub-nanosecond resolution
  4. Range calculation: distance = (c * t_roundtrip) / 2 where c = speed of light (~0.3 m/ns), yielding +/- 3 cm accuracy
  5. Waveform digitization: Hesai's proprietary ASIC performs waveform digitization and peak detection to identify return pulses within the time-domain signal

6.3 Multi-Return Handling

The AT128 supports multiple return modes:

  • Single return (strongest): Reports only the strongest return pulse per laser firing. Point rate: 1,536,000 pts/sec. Used when maximum point density is not needed and processing bandwidth is constrained.
  • Single return (last): Reports only the last return pulse, which corresponds to the most distant surface. Useful for seeing through rain, spray, and vegetation.
  • Dual return: Reports both the strongest and last returns per laser firing. Doubles point rate but increases bandwidth and processing load. Critical for adverse weather: the first return may come from rain/spray while the last return reaches the actual obstacle behind it.

The dual return mode is essential for Pony.ai's all-weather operation across Chinese cities. By comparing strongest and last returns, the downstream processing pipeline can distinguish weather-related reflections from solid obstacles.

6.4 Intensity Calibration

Each point includes a reflectivity/intensity value representing the return signal strength relative to a calibrated reference. This value enables:

  • Distinguishing high-reflectivity surfaces (retroreflectors, lane markings, road signs) from low-reflectivity surfaces (asphalt, dark vehicles)
  • Material classification heuristics in the heuristic perception path
  • Lane marking detection from LiDAR ground returns (painted markings have higher reflectivity than bare asphalt)

Factory calibration by Hesai establishes per-channel intensity correction factors to normalize response across all 128 channels.

6.5 Intelligent Point Cloud Engine (IPE) -- Rain/Spray Filtering

Hesai's proprietary Intelligent Point Cloud Engine (IPE), implemented in their ASIC firmware, provides hardware-level noise filtering:

  • Real-time weather detection: Identifies rain, fog, exhaust fumes, and water splashes at the pixel level
  • Per-point marking: Each point receives a confidence/noise flag indicating whether it is likely a weather artifact
  • Filtering rate: Filters out >99.9% of environmental noise in adverse conditions (rain, fog, dust, exhaust)
  • Waveform analysis: The IPE decodes laser return waveforms with nanosecond-level precision, processing 24.6 billion samples per second across all channels
  • Multi-return comparison: Weather artifacts typically appear only in the first/strongest return while solid objects appear in both returns -- the IPE uses this discrepancy to flag noise points

This hardware-level filtering reduces the burden on Pony.ai's downstream software pipeline: the point cloud arriving at the perception stack has already been pre-cleaned by the LiDAR's onboard ASIC.

6.6 Point Cloud Packet Structure

Each AT128 UDP packet contains:

  • Timestamp: Absolute time information for each data block, synchronized to GPS/PTP time source
  • Channel ID: Identifying which of the 128 channels produced each point
  • Range: Distance measurement
  • Intensity/Reflectivity: Calibrated return signal strength
  • Return mode flag: Strongest, last, or dual-return indicator
  • Noise/confidence flag: IPE-generated quality indicator

6.7 Pony.ai's GPU-Side LiDAR Processing

After packets arrive via Ethernet (UDP), Pony.ai's processing pipeline (documented on the NVIDIA Developer Blog) performs:

  1. Packet collection and time sync: The upstream synchronization module (originally FPGA, now NVIDIA DRIVE Orin SoC) collects raw packets, applies time synchronization, and packages data
  2. Structure-of-Array (SoA) conversion: Point cloud data is restructured from packet format into GPU-friendly SoA layout (separate arrays for x, y, z, intensity, timestamp) for coalesced memory access
  3. Page-locked memory transfer: Fields exchanged between CPU and GPU use page-locked (pinned) memory for accelerated PCIe transfers
  4. NVIDIA CUB library operations: Scan/select operations for point filtering achieve ~58% faster performance than naive implementations
  5. Ground plane removal: RANSAC-based ground segmentation on GPU
  6. Point cloud filtering: Noise removal, range-gate filtering, removal of points flagged by IPE

Critical path latency reduction: ~4 ms achieved through these GPU optimizations.


7. Radar Signal Processing

7.1 Hardware Configuration

GenerationConfigurationKey Feature
Gen-64 short-range + 1 long-range forward-facing mmWave radar (5 total)Conventional mmWave, limited elevation resolution
Gen-74 x 4D imaging millimeter-wave radarDense point cloud, elevation angle, high-resolution azimuth

7.2 4D Imaging Radar Signal Processing Pipeline

The Gen-7's 4D imaging radar follows the standard FMCW (Frequency-Modulated Continuous Wave) radar processing chain, which is entirely classical signal processing:

Step 1: Range-Doppler Map Generation (2D-FFT)

  • Each radar chirp produces a beat frequency signal from mixing transmitted and received waveforms
  • Range FFT: Fast Fourier Transform along the fast-time (samples within one chirp) dimension extracts range bins
  • Doppler FFT: FFT along the slow-time (across chirps within one frame) dimension extracts velocity bins
  • Result: A 2D Range-Doppler (RD) map per virtual antenna, with axes of range and radial velocity

Step 2: CFAR Detection (Constant False Alarm Rate)

  • CFAR adaptively sets detection thresholds on the Range-Doppler map
  • Cell-averaging CFAR (CA-CFAR) or Ordered-Statistics CFAR (OS-CFAR): Estimates local noise floor around each cell under test by averaging neighboring cells, then sets threshold as noise floor + margin
  • Target cells exceeding the adaptive threshold are declared detections
  • CFAR ensures reliable detection even in environments with varying clutter levels and low SNR

Step 3: Beamforming and Direction-of-Arrival (DOA) Estimation

  • The 4D imaging radar uses MIMO antenna arrays (multiple TX, multiple RX) to create virtual apertures
  • Digital beamforming (DBF): FFT along the antenna dimension of the virtual array to estimate azimuth angle
  • Elevation estimation: Additional antenna dimension provides elevation angle (the "4th D" in 4D radar)
  • Advanced algorithms: Capon beamforming or MVDR (Minimum Variance Distortionless Response) can nearly double angular resolution compared to standard FFT-based beamforming on the same hardware

Step 4: Point Cloud Generation

  • Detected targets are converted to 3D point cloud format: (range, azimuth, elevation, Doppler velocity)
  • Each radar "point" includes directly measured radial velocity -- a unique capability that LiDAR and cameras cannot provide

Step 5: Velocity Estimation

  • Direct Doppler measurement: Radar provides instantaneous radial velocity for every detected point, without needing frame-to-frame differencing
  • Velocity disambiguation: Phase unwrapping and multi-chirp techniques resolve velocity ambiguity
  • This velocity information is fed to the tracking pipeline (Section 17) to improve state estimation, and is used for online radar calibration (Section 11)

7.3 Radar's Classical Contributions to Perception

All radar signal processing from raw ADC samples through CFAR detection and beamforming is entirely classical -- no neural networks are involved in the radar processing chain. The radar provides:

  • Direct velocity measurements for moving object tracking
  • Weather-robust detections independent of optical conditions
  • Stationary object Doppler signatures used for online sensor calibration (Pony.ai patent US11,454,701)

8. Camera ISP and Preprocessing Pipeline

8.1 Hardware Evolution

Pony.ai's camera pipeline evolved through four major phases (documented on NVIDIA Developer Blog):

PhaseArchitectureCamera Interface
Phase 1CPU-based I/OUSB + Ethernet cameras -> CPU -> GPU
Phase 2FPGA gatewayCameras -> FPGA (trigger + sync) -> DMA -> main memory -> GPU (HostToDevice ~1.5ms)
Phase 3GPU Direct RDMACameras -> FPGA -> PCIe switch -> GPU memory directly (~6 GB/s on PCIe Gen3 x8)
Phase 4 (Production)NVIDIA DRIVE Orin SoCCameras -> Orin SoC (ISP + sync + encoding) -> NvStreams -> discrete GPU or host CPU

8.2 ISP Pipeline

The Image Signal Processor handles raw sensor data to produce usable images. Pony.ai's ISP pipeline, implemented on the Orin SoC in production, performs:

  1. Debayering: Converting raw Bayer-pattern data from the image sensor into full-color images
  2. Exposure control: Adaptive exposure management for varying lighting conditions. Pony.ai collaborated with ON Semiconductor (now onsemi) on "next-generation image sensing and processing technologies" specifically addressing "exposure control in imaging for computer vision applications." The critical insight: "exposure parameters cannot be universally optimal, as variations in lighting and conditions affect the visual representation of objects." The ISP dynamically adapts exposure per camera based on scene illumination.
  3. White balance: Color temperature correction to normalize colors across varying ambient light
  4. Noise reduction: Spatial and temporal denoising to reduce sensor noise, especially in low-light conditions
  5. Tone mapping / Gamma correction: Dynamic range compression to preserve detail in both shadows and highlights
  6. HDR processing: Some cameras may employ multi-exposure HDR to handle high-contrast scenes (e.g., tunnel exits, direct sunlight with deep shadows)
  7. YUV420 output: The ISP outputs in YUV420 color space natively

8.3 YUV420 Native Format Strategy

A key optimization: Pony.ai adopted YUV420 throughout the entire pipeline, eliminating YUV-to-RGB conversion:

  • Conversion savings: Eliminating YUV->RGB conversion saves ~0.3 ms per frame
  • Memory savings: YUV420 uses 50% less GPU memory than RGB
  • Luminance-only processing: Perception modules that need only brightness (edge detection, feature point extraction) use only the Y channel, saving 67% memory
  • No perceptual loss: Human vision is less sensitive to chrominance than luminance; machine perception benefits from the same data in native format

8.4 Hardware Encoding

  • HEVC encoding: NVIDIA Video Codec dedicated hardware encoders (~3 ms per FHD image)
  • Replaces NvJPEG which required ~4 ms and caused CPU/GPU resource contention
  • Dedicated hardware encoders preserve CUDA cores and CPU resources for neural network inference

8.5 Zero-Copy Memory Architecture

Camera frames reside in GPU memory from the moment they arrive via NvStreams/GPU Direct RDMA:

  • Custom protobuf codegen plugin introduced GpuData field type
  • CameraFrame protobuf messages contain GPU memory pointers, not pixel data copies
  • Multiple perception modules (DL detection, heuristic edge detection, segmentation) receive pointers to the same GPU-resident frame
  • Zero-copy throughout the pipeline: No redundant CPU <-> GPU transfers

8.6 GPU Memory Management for Camera Processing

  • Fixed slot-size GPU memory pool (early approach): Pre-allocated stacks matching camera frame sizes, reducing alloc/free overhead to near zero
  • CUDA 11.2 cudaMemPool (current approach): Dynamic allocation with ~2 microsecond overhead, supporting cameras with varying resolutions (Gen-7 uses 14 cameras including 8MP sensors)
  • Page-locked memory: Used for CPU-GPU field exchanges that must occur during preprocessing

8.7 Traffic Light Camera (PiDC)

Pony.ai designed an in-house traffic light camera (PiDC) specifically for traffic signal detection:

  • Uses a constant exposure time of 11 ms to achieve consistent image capture regardless of ambient light
  • Employs a neutral density (ND) filter to prevent oversaturation at this high-sensitivity setting
  • In the Gen-6 system, the self-developed traffic light camera had 1.5x resolution compared to the previous generation
  • The PiDC eliminates the need for adaptive exposure that would otherwise cause traffic light appearance to vary across frames, simplifying both classical and DL detection

9. Time Synchronization

9.1 The Synchronization Problem

Pony.ai's Gen-7 system must synchronize 34 sensors operating at different frame rates, with different internal clocks, and different physical locations on the vehicle:

  • 9 LiDARs (each at 10 or 20 Hz, electronic scanning)
  • 14 cameras (10-30 Hz depending on type)
  • 4 radars (10-20 Hz)
  • 4 microphones (continuous audio stream)
  • 2 water sensors + 1 collision sensor (event-based)
  • GNSS receiver (1-10 Hz)
  • IMU (100-400 Hz)

9.2 Synchronization Architecture

Hardware trigger (FPGA / Orin SoC): The synchronization arbiter was originally an FPGA, which "handles the camera trigger and synchronization logic to provide better sensor fusion." In production (Gen-7), the NVIDIA DRIVE Orin SoC assumes this role.

The synchronization module:

  1. Generates hardware trigger signals to cameras (all cameras fire simultaneously for spatial consistency)
  2. Receives LiDAR packet timestamps (GPS-disciplined PTP time)
  3. Correlates radar frame timestamps
  4. Produces a synchronized sensor data package with all modalities time-aligned

Time reference sources:

  • GPS Pulse-Per-Second (PPS): A 1 PPS signal from the GNSS receiver provides a reference clock aligned to UTC with nanosecond-level accuracy
  • PTP (Precision Time Protocol, IEEE 1588): PTPv2 distributes precise time across the sensor network via Ethernet, achieving +/- 100 ns synchronization between devices
  • Orin SoC internal clock: Disciplined by PTP/GPS, used as the master clock for all sensor triggering

9.3 Temporal Alignment in Processing

After synchronized capture, the upstream module applies:

  • Timestamp interpolation: For sensors with different frame rates, data is interpolated to a common reference time
  • Motion compensation: Vehicle ego-motion between sensor capture times is compensated using IMU data at high rate (100+ Hz), so that point clouds and images can be fused in a consistent coordinate frame despite being captured at slightly different instants
  • Latency equalization: Different sensors have different inherent processing latencies; the synchronization module accounts for these offsets

10. Multi-Sensor Calibration

10.1 The Calibration Challenge

With 34 sensors in the Gen-7 system, calibration involves determining precise 6-DOF transformations (3 rotations + 3 translations) for every sensor pair:

  • 9 LiDAR-to-vehicle transformations
  • 14 camera-to-vehicle transformations (plus intrinsic calibration: focal length, principal point, lens distortion per camera)
  • 4 radar-to-vehicle transformations
  • LiDAR-to-camera cross-modal alignments (for point cloud projection onto images)
  • Radar-to-LiDAR alignments (for cross-modal association)
  • Temporal calibration: Time offsets between sensor clocks

10.2 LiDAR-Camera Extrinsic Calibration

The LiDAR-camera transformation must enable accurate projection of 3D LiDAR points onto 2D camera images. Methods used in the industry (and likely by Pony.ai given their patent portfolio):

Target-based calibration (factory/initial):

  • Calibration targets (checkerboard patterns, specialized reflective targets) are placed at known positions
  • LiDAR detects target edges/corners in the point cloud; camera detects them in the image
  • The 6-DOF transformation is computed by minimizing reprojection error between corresponding 3D-2D point pairs
  • Used during vehicle assembly and after sensor replacement

Targetless calibration (online/continuous):

  • Extracts natural features (edges, planes, corners) from both LiDAR point clouds and camera images
  • Matches features across modalities to estimate and refine the extrinsic transformation
  • Can run continuously during normal driving to detect and correct calibration drift

10.3 Radar-LiDAR/Camera Cross-Calibration

  • Radar-LiDAR calibration exploits shared detections of the same objects in both modalities
  • Spatial association of radar points and LiDAR clusters at known positions determines the radar-to-vehicle transformation
  • Pony.ai's Doppler-based calibration patent (US11,454,701) provides continuous radar calibration using stationary object Doppler signatures (see Section 11)

10.4 LiDAR Intrinsic Calibration

  • Hesai factory-calibrates each AT128's 128 channels: beam angle offsets, range bias corrections, intensity normalization
  • The AT128's "unstitched" 120-degree FOV from genuine 128 VCSEL channels eliminates seam artifacts that would require additional cross-channel calibration
  • Per-channel corrections are stored in sensor firmware and applied automatically

11. Online Calibration

11.1 Real-Time Doppler-Based Calibration (US Patent 11,454,701)

Pony.ai holds patent US11,454,701: "Real-time and dynamic calibration of active sensors with angle-resolved Doppler information for vehicles."

Algorithm:

  1. During normal driving, the radar continuously measures Doppler velocity of all detected objects
  2. Stationary object identification: Buildings, poles, parked vehicles, guardrails -- objects known to be stationary (zero true velocity)
  3. Expected Doppler computation: For a stationary object, the observed Doppler velocity should equal the negative of the vehicle's own velocity component projected along the radar beam direction:
    v_doppler_expected = -v_ego * cos(theta_object)
    where theta_object is the angle from the radar boresight to the object
  4. Discrepancy measurement: Any systematic discrepancy between expected and measured Doppler across multiple stationary objects indicates sensor mounting angle error
  5. Parameter update: The system automatically adjusts the sensor's angular offset parameters to minimize the observed discrepancy
  6. Continuous operation: This calibration runs in real-time during normal driving, requiring no special calibration targets or procedures

Significance for fleet operations:

  • Enables continuous recalibration across 1,000+ vehicle fleet without manual intervention
  • Compensates for thermal expansion, vibration-induced drift, and minor impacts that gradually shift sensor alignment
  • Maintains perception accuracy over the designed 600,000+ km vehicle lifespan

11.2 Static Object-Based LiDAR Calibration (US Patent 12,032,102)

Pony.ai's patent for vehicle sensor calibration using detected static objects:

Algorithm:

  1. Environmental assessment: The system first evaluates whether conditions are suitable for calibration (sufficient static objects, good visibility)
  2. Static object detection: Identifies poles, signs, building corners, and other static landmarks in the point cloud
  3. Height and shape variance check: If detected static objects exceed a predetermined variance threshold (indicating sufficient geometric diversity), calibration proceeds
  4. Transformation matrix computation: The first sensor's local coordinate system is iteratively aligned to a pre-calibrated second sensor's coordinate system
  5. Iterative refinement: If calibration accuracy falls below threshold, additional static objects are detected and the process repeats
  6. Global reference: Calibration is anchored to a global coordinate system using the HD map as reference -- detected landmarks are matched to known map features

This is a fully classical optimization algorithm: iterative closest point (ICP) style transformation estimation using geometric features, with no neural network involvement.

11.3 Calibration Monitoring

The system continuously monitors calibration quality by:

  • Checking consistency of LiDAR-camera point projections against detected image edges
  • Monitoring radar Doppler residuals against ego-motion estimates
  • Flagging sudden calibration shifts (impact detection triggers immediate recalibration check)
  • Running calibration quality checks as part of the 1,000+ monitoring mechanisms

12. Manufacturing Calibration

12.1 Gen-7 Mass Production Context

The Gen-7 is explicitly the "world's first mass-produced L4 autonomous vehicle" with three vehicle platforms (Toyota bZ4X, BAIC ARCFOX Alpha T5, GAC Aion V). Calibration at production scale is fundamentally different from research prototype calibration:

12.2 Platform-Based Design

The Gen-7 features an "enhanced platform-based design that enables rapid adaptation across multiple vehicle models." This means:

  • Standardized sensor mounting points: Each platform has precisely machined mounting locations for the rooftop assembly (4x AT128 + cameras) and body-mounted sensors (5x near-range LiDAR + radar)
  • Pre-assembled sensor modules: Sensors are "pre-assembled" in the highly integrated sensor package before vehicle integration, allowing factory calibration of relative sensor positions within the module
  • Automotive-grade tolerances: 100% automotive-grade components with manufacturing tolerances that minimize initial calibration variation

12.3 Factory Calibration Pipeline

Based on industry practice for mass-produced AV systems and Pony.ai's disclosed capabilities:

  1. Sensor module assembly: Sensors are mounted in the rooftop/body assemblies with precision jigs
  2. Intra-module calibration: Relative positions of sensors within each module are calibrated using target-based methods in a controlled environment
  3. Vehicle integration: Modules are installed on the vehicle platform
  4. Vehicle-level calibration: A complete calibration run establishes the full set of sensor-to-vehicle transformations, potentially using a calibration facility with known reference targets
  5. Calibration verification drive: A short test drive verifies calibration quality using the online calibration system (Section 11) as a checker
  6. Calibration data storage: Calibration parameters are stored in the vehicle's compute system and updated via OTA as online calibration refines values during operation

12.4 Scale Considerations

With 1,000+ vehicles planned for 2025 fleet and 3,000+ by end of 2026:

  • Manual per-vehicle calibration sessions are impractical at this scale
  • Heavy reliance on automated calibration (Sections 11.1, 11.2)
  • Quality control through statistical monitoring of calibration drift rates across the fleet
  • OTA calibration parameter updates when systematic biases are detected

13. Point Cloud Processing

13.1 Classical Point Cloud Processing Pipeline

The point cloud processing pipeline operates on the combined output from all 9 LiDARs (4x AT128 + 5x near-range). Classical processing steps run in the heuristic perception path:

Step 1: Multi-LiDAR Point Cloud Merging

  • Point clouds from individual LiDARs are transformed to the vehicle body frame using calibrated extrinsic transformations
  • Motion compensation using IMU data corrects for vehicle motion during the scan period
  • Combined point cloud represents a unified 360-degree 3D scene

Step 2: Ground Plane Estimation

  • RANSAC plane fitting: Iteratively samples minimal point sets (3 points define a plane), fits candidate planes, counts inliers within a distance threshold
  • Constrained to near-horizontal: Angular tolerance around gravity vector (from IMU) prevents fitting to vertical surfaces
  • Segmented ground model: Rather than a single global plane, the ground is modeled as a piecewise planar surface (tiled grid of local planes), handling slopes, ramps, and road crown
  • Ground points are separated from above-ground points; ground points feed into drivable surface estimation

Step 3: Noise and Outlier Removal

  • Statistical Outlier Removal (SOR): For each point, the mean distance to its k-nearest neighbors is computed; points with mean distances exceeding a threshold (e.g., mean + 2*stddev) are removed
  • Radius-based outlier removal: Points with fewer than N neighbors within radius R are removed
  • IPE-flagged point removal: Points flagged as weather noise by Hesai's IPE are excluded
  • Multi-return disambiguation: In dual-return mode, comparing strongest and last returns to filter weather artifacts

Step 4: Voxel Grid Downsampling

  • 3D space is divided into uniform voxels (e.g., 10-20 cm cubes)
  • Points within each voxel are replaced by their centroid
  • Reduces point count while preserving spatial structure
  • Critical for making downstream clustering computationally tractable at 6M+ pts/sec

Step 5: Above-Ground Clustering

  • Euclidean clustering: KD-tree-based nearest-neighbor search groups points within a distance threshold into clusters
  • Minimum/maximum cluster bounds: Filter out noise clusters (< min_points) and over-merged clusters (> max_points or > max_extent)
  • Each cluster is a candidate obstacle for the heuristic detection path

Step 6: Bounding Box Fitting

  • For each cluster: PCA determines principal axes; oriented bounding box (OBB) is fit aligned to principal components
  • L-shape fitting for partially visible objects (vehicles seen from one corner/side)
  • Height, width, length extracted for size-based classification

13.2 GPU-Optimized Implementation

All point cloud processing runs on GPU with specific optimizations documented by Pony.ai:

  • SoA data layout: Separate contiguous arrays for x, y, z, intensity, timestamp enable coalesced GPU memory access
  • CUB library: NVIDIA CUB scan/select operations for filtering (~58% faster than naive)
  • Page-locked memory: For CPU-GPU exchanges during preprocessing
  • Fixed memory pool / cudaMemPool: Pre-allocated GPU memory to avoid allocation latency

14. Free Space Estimation

14.1 Classical Occupancy Grid

Free space estimation determines which areas around the vehicle are traversable (free of obstacles). The classical approach:

2D Occupancy Grid Construction:

  1. The area around the vehicle is discretized into a 2D grid (cells typically 10-20 cm)
  2. Each cell has one of three states: free, occupied, or unknown
  3. LiDAR points that hit above-ground obstacles mark cells as occupied
  4. Ground-level LiDAR returns (from ground plane estimation) confirm cells as free

Ray Casting:

  • For each LiDAR point, a ray is traced from the sensor origin to the point's location
  • All grid cells along the ray path (before the hit point) are marked as free (the ray passed through them unobstructed)
  • The cell containing the hit point is marked as occupied
  • Cells never traversed by any ray remain unknown
  • This classical method provides a conservative estimate of free space with no ML dependency

Temporal Accumulation:

  • Occupancy evidence accumulates over multiple scans using log-odds updating (Bayesian occupancy grid)
  • Cells transition from unknown to free or occupied as evidence accumulates
  • Moving objects are handled by decaying occupancy evidence over time

14.2 Near-Field Occupancy

The five body-mounted near-range LiDARs (historically RoboSense Bpearl) provide dense near-field occupancy:

  • Detection within 10 cm of the vehicle body
  • Critical for parking, narrow-passage navigation, and low-speed maneuvering
  • Provides occupancy data where roof-mounted AT128s have blind spots (directly beside and below the vehicle)

14.3 Hybrid Free Space

The DL path provides semantic free space (drivable area segmentation from cameras), while the classical path provides geometric free space (occupancy grid from LiDAR ray casting). Both are fused:

  • Geometric free space is authoritative for physical obstacle presence
  • Semantic free space adds context (road surface type, curb boundaries, sidewalk vs. road)
  • The union is used: if either method indicates occupied, the cell is treated as non-traversable

15. Road Geometry Detection

15.1 Classical Lane Detection Components

LiDAR-based lane marking detection:

  • Ground return intensity analysis: Painted lane markings reflect more 905 nm laser light than asphalt
  • Intensity gradient detection along ground plane points identifies marking edges
  • Model fitting (polynomial curves) to detected intensity edges produces lane boundary estimates
  • Effective up to AT128's 70 m ground detection range

Camera-based classical lane detection:

  • Preprocessing: Y-channel extraction from YUV420 (luminance only, 67% memory savings)
  • Edge detection: Sobel/Canny operators on the Y-channel image detect marking edges
  • Perspective transformation: Inverse perspective mapping (IPM) transforms the camera image to a top-down BEV view, making lane lines parallel
  • Hough line detection: Identifies straight-line candidates in the transformed image
  • Polynomial fitting: 2nd/3rd-order polynomial curves fit to detected marking points, constrained by:
    • Lane width priors (3.0-3.75 m for Chinese national standard lanes)
    • Continuity with previous frame's lane model
    • Symmetry constraints (parallel lane boundaries)

Road boundary detection:

  • LiDAR height discontinuities at curb edges (10-25 cm height steps)
  • LiDAR reflectivity changes at road edge (asphalt vs. grass/dirt)
  • Camera-based guardrail/barrier detection via edge detection and template matching

15.2 HD Map-Assisted Road Geometry

Classical lane detection is augmented by the HD map:

  • Known lane geometry from the map provides strong prior constraints
  • Online perception refines/confirms map-based lane positions
  • When lane markings are faded, obscured by snow, or missing, the HD map provides the geometry
  • Discrepancies between perceived and mapped lane geometry trigger construction zone detection

16. Traffic Signal Processing

16.1 Classical Components in Traffic Light Detection

Traffic light detection uses a hybrid of classical and ML approaches:

Classical preprocessing:

  • Color space conversion: RGB or YUV to HSV for color-based segmentation
  • Color thresholding: HSV-space masks isolate red, yellow, and green candidate regions
  • Morphological operations: Erosion/dilation to clean up color masks and remove noise
  • Region of Interest (ROI) extraction: HD map provides expected traffic light positions in image coordinates (projecting known 3D map positions through calibrated camera intrinsics); the classical pipeline constrains detection to these ROIs, dramatically reducing false positives from other red/green light sources (neon signs, tail lights)

Temporal state machine:

  • State transition validation: Traffic light states follow physical constraints -- a light cannot transition from red to green without passing through yellow (or from green to red without yellow in most configurations)
  • Hysteresis filtering: A state transition is only confirmed after the new state is observed for a minimum number of consecutive frames (e.g., 2-3 frames), preventing flicker-induced errors from LED PWM dimming, camera rolling shutter artifacts, or momentary occlusions
  • Countdown timer tracking: For Chinese traffic lights with countdown displays, OCR-based digit recognition provides additional temporal context

HD map association:

  • Each detected traffic light is matched to a known signal position in the HD map
  • This resolves ambiguity when multiple traffic light groups are visible (common at large Chinese intersections)
  • The map specifies which signal group governs which lane, enabling correct lane-to-signal association

16.2 PiDC Camera for Traffic Lights

Pony.ai's self-developed PiDC (Pony intelligent Detection Camera) is specifically optimized for traffic light detection:

  • Fixed 11 ms exposure: Eliminates exposure variation that would change traffic light appearance across frames
  • Neutral density filter: Prevents oversaturation at the fixed high-sensitivity exposure setting
  • After deploying PiDC, monthly traffic-light-related issues dropped to zero (August 2021)
  • The fixed exposure simplifies both classical color classification (consistent HSV values) and ML-based classification (consistent input distribution)

16.3 Chinese Traffic Light Challenges

ChallengeClassical Handling Approach
Multiple signal groups visibleHD map ROI constraints + spatial association
Vertical and horizontal orientationsAspect ratio analysis of detected light cluster geometry
Arrow signals (left, right, U-turn)Shape template matching within detected light region
Countdown timersOCR digit recognition on timer display region
LED PWM flickerTemporal hysteresis filtering (multi-frame confirmation)
Nighttime neon interferenceHD map ROI + size/shape constraints filter non-signal lights

17. Kalman Filtering and State Estimation

17.1 State Vector

The heuristic tracking path maintains a state vector for each tracked object. The typical state vector for a vehicle target:

x = [px, py, pz, vx, vy, heading, yaw_rate, length, width, height]

Where:

  • (px, py, pz): 3D position in vehicle frame
  • (vx, vy): 2D velocity in the ground plane
  • heading: Yaw angle (orientation)
  • yaw_rate: Angular velocity
  • (length, width, height): Object dimensions

For pedestrians, a simpler state may be used (no yaw_rate, smaller dimension vector).

17.2 Process Model

Constant Turn-Rate and Acceleration (CTRA) model:

px(t+dt) = px(t) + vx(t)*dt + 0.5*ax*dt^2
py(t+dt) = py(t) + vy(t)*dt + 0.5*ay*dt^2
heading(t+dt) = heading(t) + yaw_rate*dt

For simpler cases, Constant Velocity (CV) or Constant Turn-Rate and Velocity (CTRV) models are used. The model selection may be class-dependent:

  • Vehicles: CTRA or CTRV (captures turning behavior)
  • Pedestrians: CV (relatively constant velocity between observations)
  • E-bikes: CTRA (frequent turning, acceleration changes)

17.3 Extended Kalman Filter (EKF) / Unscented Kalman Filter (UKF)

  • Prediction step: Propagate state forward using the process model; propagate covariance through the nonlinear motion model (EKF linearizes via Jacobian; UKF uses sigma points)
  • Update step: When a new detection is associated with the track, incorporate the measurement to correct the predicted state
  • Measurement model: Maps state to expected measurement (position, dimensions, heading from detection)
  • Radar velocity integration: When radar Doppler measurements are available for a tracked object, they provide a direct velocity measurement that dramatically improves velocity estimation accuracy compared to position-only differentiation

17.4 Multi-Sensor State Estimation

The Kalman filter fuses measurements from multiple sensor modalities:

  • LiDAR detections: Provide precise 3D position and dimensions
  • Camera detections: Provide 2D bounding boxes; 3D position estimated via known camera geometry + assumed ground plane or depth estimation
  • Radar detections: Provide range, angle, and direct radial velocity
  • Each measurement source has its own noise model (measurement covariance matrix), reflecting the sensor's accuracy characteristics

18. Data Association

18.1 The Association Problem

Each perception cycle produces new detections from both perception paths and all sensor modalities. Data association answers: "Which new detection corresponds to which existing track?"

18.2 Classical Association Methods

Global Nearest Neighbor (GNN) / Hungarian Algorithm:

  • Construct a cost matrix: rows = existing tracks, columns = new detections
  • Cost = Mahalanobis distance (accounting for state uncertainty) between predicted track position and detection position
  • Hungarian algorithm finds the optimal one-to-one assignment that minimizes total cost
  • Unassigned detections become candidate new tracks; unassigned tracks age without update

Joint Probabilistic Data Association (JPDA):

  • In dense scenes (Chinese urban intersections with closely spaced e-bikes, pedestrians), GNN may make incorrect hard assignments
  • JPDA computes the probability that each detection belongs to each track
  • The state update uses a weighted combination of all plausible associations
  • More robust in clutter but computationally more expensive

Gating:

  • Before association, a gate (validation region) around each track's predicted position filters out implausible associations
  • Mahalanobis distance gating: only detections within the track's predicted uncertainty ellipsoid (e.g., 3-sigma) are considered
  • Reduces computation and prevents gross misassociations

18.3 Multi-Sensor Association

Cross-modal association is particularly challenging:

  • A single physical object may produce detections from multiple LiDARs, multiple cameras, and radar simultaneously
  • Association must merge these into a single track
  • Spatial consistency: Detections from different sensors at the same 3D location (within calibration error) are candidates for association
  • Temporal consistency: Detections arriving at similar timestamps with consistent motion are likely from the same object
  • Classification consistency: If LiDAR classifies a cluster as "vehicle" and camera classifies the same region as "vehicle," association confidence increases

19. Track Management

19.1 Track Lifecycle

Track birth:

  • A new detection not associated with any existing track is initialized as a tentative track
  • Tentative tracks must be confirmed by subsequent detections in consecutive frames (e.g., detected in 3 of 5 frames)
  • Confirmation prevents single false detections from creating permanent tracks

Track maintenance:

  • Active tracks are updated with each associated detection via the Kalman filter
  • Track confidence increases with consistent detections across multiple frames and sensor modalities
  • Classification is refined over time: early frames may be ambiguous; as more observations accumulate, classification certainty increases

Track coasting:

  • When a track receives no associated detection in a frame (occlusion, sensor blind spot, missed detection), it coasts -- the state is propagated forward by the process model without measurement update
  • Coasting increases state uncertainty (covariance grows)
  • Maximum coast duration depends on object class and velocity: high-speed vehicles may coast for fewer frames (their predicted position becomes unreliable faster)

Track death:

  • Tracks that coast beyond a maximum duration without re-detection are terminated
  • Tracks that exit the sensor coverage area are terminated
  • Tracks with unreasonably high uncertainty (covariance exceeds threshold) are terminated

19.2 Track-to-Track Fusion

When the dual perception paths produce independent track lists:

  • Track-to-track association: Heuristic and DL path tracks at similar positions are associated
  • State fusion: If both paths track the same object, a fused state estimate (weighted by respective covariances) produces the output track
  • Discrepancy handling: If one path tracks an object the other doesn't, the single-path track is retained with adjusted confidence

20. IMU/GNSS Integration

20.1 Sensor Hardware

Pony.ai's Gen-7 system includes high-accuracy GNSS and IMU as documented in SEC filings:

  • GNSS receiver: Multi-constellation (GPS, GLONASS, BeiDou, Galileo) with RTK correction capability
  • IMU: 6-DOF inertial measurement unit (3-axis accelerometer + 3-axis gyroscope), likely tactical-grade MEMS for automotive application

20.2 IMU Processing (Classical)

The IMU provides high-rate (100-400 Hz) measurements of vehicle acceleration and angular velocity. Classical processing:

Inertial navigation (dead reckoning):

orientation(t+dt) = orientation(t) + gyro_measurement * dt
velocity(t+dt) = velocity(t) + (rotation_matrix * accel_measurement - gravity) * dt
position(t+dt) = position(t) + velocity(t) * dt + 0.5 * acceleration * dt^2
  • Integration of accelerometer data (after removing gravity and rotating to navigation frame) provides velocity and position updates
  • Integration of gyroscope data provides orientation updates
  • Drift problem: Double integration of accelerometer noise causes position error to grow quadratically with time (~meters per minute without correction)
  • Dead reckoning is therefore only useful for short-term bridge periods when other sensors are unavailable (GPS outage in tunnels, LiDAR-denied environments)

Bias estimation:

  • Accelerometer and gyroscope biases are estimated and compensated in real-time
  • Temperature-dependent bias models correct for thermal drift
  • Turn-on bias calibration occurs during vehicle startup (stationary period)

20.3 GNSS Processing (Classical)

Position fixing:

  • Multi-constellation GNSS provides absolute position in WGS84 coordinates
  • RTK correction from base stations or network corrections provides centimeter-level accuracy in open-sky conditions
  • Accuracy degrades in urban canyons, tunnels, underpasses, and dense tree cover

GPS correction of IMU drift:

  • GNSS position fixes reset accumulated IMU drift
  • The Kalman filter (or error-state Kalman filter) continuously estimates and corrects IMU biases using GNSS measurements as a reference
  • When GNSS is unavailable (tunnel, underground parking), the filter extrapolates using IMU-only dead reckoning, with growing uncertainty

20.4 GNSS-IMU Fusion (Classical Kalman Filter)

A tightly-coupled GNSS-IMU fusion filter is standard practice:

  • State vector: Position, velocity, orientation, IMU biases (accelerometer bias, gyroscope bias), GNSS clock offset
  • Process model: IMU-driven state propagation at high rate (100+ Hz)
  • Measurement model: GNSS pseudorange and carrier phase observations at 1-10 Hz
  • Output: Continuous 6-DOF pose estimate at IMU rate, with accuracy bounded by GNSS corrections

This is entirely classical estimation theory -- no neural networks involved.


21. LiDAR-to-Map Matching

21.1 HD Map for Localization

Pony.ai uses HD maps for centimeter-level localization. The HD map contains:

  • 3D point cloud map (dense reference point cloud from prior mapping runs)
  • Lane-level road geometry
  • Traffic infrastructure positions (traffic lights, signs, poles)
  • Building and curb geometry

21.2 Scan Matching Algorithms

Localization achieves centimeter-level accuracy by matching live LiDAR scans to the pre-built HD map point cloud:

Iterative Closest Point (ICP):

  1. For each point in the live scan, find the closest point in the reference map
  2. Compute the rigid transformation (rotation + translation) that minimizes the sum of squared distances between corresponding point pairs
  3. Apply the transformation and iterate until convergence
  4. Variants: point-to-point ICP, point-to-plane ICP (more robust, faster convergence)

Normal Distributions Transform (NDT):

  1. Discretize the reference map into cells (voxels)
  2. For each cell, compute the mean and covariance of contained points (modeling the local surface as a Gaussian distribution)
  3. Score the live scan against the map by evaluating how well each live point fits the Gaussian of its containing cell
  4. Optimize the 6-DOF pose to maximize the total score
  5. NDT is more computationally efficient than ICP for large maps and provides smoother optimization landscapes

Pony.ai likely uses a variant or combination of these methods, potentially with initial alignment from GNSS-IMU and refinement via scan matching.

21.3 Feature-Based Matching

Beyond dense point cloud matching, feature-based approaches provide faster and more robust localization:

  • Pole/post detection: Vertical structures (poles, posts, trees) provide stable localization features
  • Curb edge detection: Road boundary geometry matched against map curb positions
  • Building corner matching: Sharp geometric features in the point cloud matched to map building geometry
  • These features are extracted using classical geometric algorithms (vertical line detection, height discontinuity detection, corner detection in 3D)

21.4 SLAM Patent (US11,908,198)

Pony.ai's patent for "Generating graphical illustrations of point cloud frames for SLAM algorithms" describes:

  • A system that obtains sensor data and determines sensor position/orientation using SLAM algorithms
  • Generates graphical illustrations of captured point cloud frames including trajectory points
  • Visualizes loop closure constraints (when the vehicle revisits a previously mapped area, correcting accumulated drift)
  • Fuses GNSS data and IMU data in generating accurate maps
  • Interactive interface for examining and adjusting loop closure constraints

This patent indicates Pony.ai uses graph-based SLAM for map construction, with pose graph optimization and loop closure -- entirely classical optimization techniques (Levenberg-Marquardt or Gauss-Newton optimization on factor graphs).


22. Visual Odometry

22.1 Camera-Based Motion Estimation

Visual odometry provides ego-motion estimates from camera images, serving as an additional input to the localization fusion:

Feature extraction and matching:

  • Classical feature detectors (FAST, ORB, SIFT/SURF variants) identify salient keypoints in consecutive camera frames
  • Descriptor matching between frames establishes point correspondences
  • Outlier rejection via RANSAC eliminates mismatches

Motion estimation:

  • Essential matrix estimation: From matched feature correspondences, the essential matrix encoding relative rotation and translation (up to scale) is computed via the 5-point or 8-point algorithm
  • PnP (Perspective-n-Point): When 3D positions of features are known (from LiDAR depth or previous triangulation), PnP directly estimates the 6-DOF camera pose
  • Scale estimation: Monocular VO has inherent scale ambiguity; resolved by fusing with LiDAR range measurements or IMU-derived velocity

22.2 Role in Pony.ai's Stack

Visual odometry likely serves as a secondary motion estimation source:

  • Primary localization comes from LiDAR-to-map matching (centimeter accuracy) and GNSS-IMU
  • VO provides additional motion estimates, particularly useful when LiDAR features are sparse (e.g., featureless highway stretches) or when rapid visual features are available
  • VO is entirely classical signal processing and geometric computation

23. Multi-Sensor Localization Fusion

23.1 Fusion Architecture

Pony.ai achieves centimeter-level localization through multi-sensor fusion. The system is described as using "multi-sensor fusion of rich datasets for understanding the static environment" to achieve centimeter-level accuracy. The sensors feeding localization include GNSS, IMU, LiDAR (scan matching), cameras (visual odometry), and wheel odometry.

23.2 Classical Fusion Filter

The localization fusion is a multi-state Kalman filter or factor graph optimization:

Error-State Extended Kalman Filter (ES-EKF):

  • State: 6-DOF pose (position + orientation), velocity, IMU biases
  • Prediction: IMU-driven at 100+ Hz
  • Updates from multiple asynchronous sources:
    • GNSS: 1-10 Hz, absolute position (when available)
    • LiDAR scan matching: 10 Hz, relative or absolute pose correction
    • Visual odometry: 10-30 Hz, relative motion
    • Wheel odometry: continuous, forward velocity
  • Each update source has its own measurement model and noise covariance
  • The filter seamlessly handles sensor dropouts (e.g., GPS loss in tunnel) by continuing on remaining sensors

Factor Graph Optimization (alternative or complement):

  • Pose graph with nodes at each timestep and edges representing:
    • IMU preintegration factors (high-rate relative motion)
    • GNSS absolute position factors
    • LiDAR scan matching relative pose factors
    • Loop closure factors (from SLAM, if applicable)
  • Batch or sliding-window optimization provides globally consistent trajectory estimates
  • More accurate than filtering but computationally more expensive

23.3 GPS-Denied Localization

In GPS-denied environments (tunnels, underground parking, dense urban canyons):

  • LiDAR-to-map matching + IMU dead reckoning provide continuous localization
  • Pre-mapped tunnel/parking structure maps enable LiDAR matching even without GPS
  • Localization uncertainty grows during GPS outages but remains within safe bounds for the duration of typical outages

24. Mixed Traffic Rule-Based Handling

24.1 Chinese Urban Traffic Context

Pony.ai's software is explicitly described as "a combination of AI models and rule-based code, designed to interpret traffic patterns, predict behaviors, and execute driving decisions." The rule-based code is essential for handling China's unique traffic:

24.2 Rule-Based E-Bike/Scooter Handling

E-bikes are the most challenging road user class in Chinese cities. Rule-based handling:

BehaviorRule-Based Response
E-bike riding in vehicle laneClassify as VRU; increase lateral clearance buffer to 1.5m+; reduce speed
E-bike crossing against traffic signalApply conservative yield; treat as potential red-light violator; maintain enlarged safety zone
E-bike with oversized cargo (delivery packages)Increase bounding box extent beyond detected boundaries; apply wider clearance
E-bike emerging from between parked vehiclesApply occlusion-aware safety zone; reduce speed in narrow passages with parked vehicles
Multiple e-bikes in clusterTrack as group; apply group-level motion prediction with expanded safety zone

24.3 Three-Wheeler and Non-Standard Vehicle Rules

  • Size variance handling: Three-wheelers range from small enclosed vehicles to large open cargo platforms. Rule-based size filters widen acceptance ranges beyond standard vehicle templates
  • Overloaded vehicle rules: When detected object dimensions exceed standard vehicle bounds, apply extended clearance zones
  • Slow-moving vehicle rules: Vehicles moving significantly slower than traffic flow trigger enhanced monitoring and safe passing behavior

24.4 Pedestrian Rule-Based Handling

  • Jaywalking prediction zones: Near certain locations (bus stops, shopping areas, median breaks), pedestrian detection zones are enlarged and yield behavior is activated at lower confidence thresholds
  • Group crossing rules: When multiple pedestrians are detected in proximity, assume coordinated group movement and yield to the entire group
  • Crosswalk state machines: Pedestrians approaching a crosswalk trigger proactive yield behavior even before entering the crosswalk

25. Non-Standard Infrastructure Handling

25.1 Missing or Faded Lane Markings

Classical handling when lane markings are absent or undetectable:

  1. HD map fallback: Lane geometry from the map provides lane positions when online detection fails
  2. Road boundary detection: Curb edges and barriers detected from LiDAR height discontinuities define road extent
  3. Vehicle trajectory following: Other vehicles' trajectories (from tracking) imply lane structure in unmarked areas
  4. Width estimation from road boundaries: When boundaries are detected but markings are not, lane positions are inferred from road width and standard lane width assumptions

25.2 Construction Zone Classical Handling

Pony.ai has published a technical blog post specifically about construction zone handling, revealing classical components:

Live Semantic Map:

  • The system constructs a real-time live semantic map to detect non-movable obstacles (cones, barriers, construction equipment)
  • This map persists obstacle positions even when they are temporarily occluded
  • When construction cones are occluded by passing trucks, the live semantic map remembers their positions

Cone Boundary Formation:

  • Detected construction cones are connected to form boundaries in the perception system
  • This creates a virtual "wall" preventing the vehicle from entering the construction area
  • The boundary formation is a classical geometric operation: connecting detected cone positions in sequence using nearest-neighbor ordering
  • The vehicle then navigates within the corridor formed by cones on both sides, maintaining safe distance from the boundaries

Map Discrepancy Detection:

  • When perceived road geometry (from online perception) differs significantly from the HD map, the system infers a construction zone or road modification
  • This triggers a switch to a more conservative driving mode relying on real-time perception rather than map-based priors

25.3 Non-Standard Road Geometry

  • Unmarked intersections: Dead-reckoning through the intersection using the HD map's intersection topology
  • Temporary lane shifts: Detected via cone/barrier positions forming new lane boundaries different from the HD map
  • Narrow hutong/alley navigation: Near-range LiDAR (5 body-mounted units) provides 10 cm-level clearance detection; occupancy grid-based path planning ensures collision avoidance

26. ML Detections to Classical Tracking

26.1 The Handoff Interface

The deep learning perception path produces per-frame detections:

  • 3D bounding boxes with class labels, confidence scores, heading angles
  • These are "instantaneous snapshots" -- no temporal continuity

These detections are fed into the classical tracking pipeline (Sections 17-19):

Per-frame DL detections (3D bbox, class, confidence)
    |
    v
Data Association (Hungarian algorithm / JPDA)
    |-- Match to existing tracks (Mahalanobis distance gating)
    |-- Create new tracks for unmatched detections
    |-- Age tracks without matches
    |
    v
Kalman Filter State Update
    |-- Incorporate DL detection as measurement
    |-- Measurement noise set by DL confidence
    |
    v
Track Management (birth / coast / death)
    |
    v
Output: Smoothed tracks with filtered position, velocity,
        classification history, predicted trajectories

26.2 Confidence-Weighted Updates

The DL detection confidence score modulates the Kalman filter measurement noise:

  • High confidence DL detection: Low measurement noise covariance -> filter trusts the detection strongly
  • Low confidence DL detection: High measurement noise covariance -> filter relies more on its prediction
  • This allows the classical filter to gracefully handle uncertain DL outputs

26.3 Classification Refinement

DL classification (vehicle, pedestrian, cyclist, etc.) feeds into the track's classification history:

  • A Bayesian classification accumulator counts votes from each frame's DL classification
  • Over multiple frames, the most likely class emerges with high confidence
  • This filters out single-frame misclassifications that are common in DL perception

27. Classical Preprocessing to ML

27.1 LiDAR Preprocessing Before Neural Networks

Before point clouds reach the DL detection networks, classical preprocessing has already been applied:

  1. Time synchronization and motion compensation (Section 9): Nanosecond-level sync and IMU-based motion correction
  2. Multi-LiDAR point cloud merging: Extrinsic calibration transforms all points to vehicle frame
  3. Ground plane estimation (RANSAC): Separating ground from obstacles improves 3D detection by reducing background clutter
  4. Noise filtering: Statistical outlier removal, IPE-flagged point removal
  5. Voxelization or pillar encoding: Discretizing the continuous point cloud into a structured grid for neural network consumption (VoxelNet-style or PointPillars-style encoding)

These classical steps ensure the neural network receives clean, organized, time-aligned data rather than raw, noisy, desynchronized sensor packets.

27.2 Camera Preprocessing Before Neural Networks

  1. ISP pipeline (Section 8): Debayering, exposure correction, white balance, noise reduction, tone mapping
  2. YUV420 format: Native ISP output, no conversion overhead
  3. GPU-resident data: Camera frames arrive in GPU memory via NvStreams/GPU Direct RDMA, ready for neural network inference
  4. HEVC encoding: For recording/playback; neural networks consume the raw YUV frames directly
  5. Region of Interest extraction: For specific tasks (traffic light detection), HD map-guided ROI extraction crops the input before feeding the network

27.3 Radar Preprocessing Before Any ML

All radar signal processing (Section 7) -- range-Doppler FFT, CFAR detection, beamforming, DOA estimation -- is classical and occurs before any potential neural network processing of radar data. The neural network (if used for radar object classification) receives already-detected target lists with range, velocity, and angle, not raw ADC samples.


28. Kinematic Feasibility Checking

28.1 Classical Motion Model Constraints on ML Predictions

The prediction module produces future trajectory predictions for other road agents. Classical kinematic feasibility checking validates these predictions:

Vehicle kinematic constraints:

  • Maximum steering angle: Vehicles cannot turn tighter than their minimum turning radius (typically 5-6 m for passenger cars)
  • Maximum acceleration/deceleration: Physical limits on longitudinal acceleration (typically 0.3-0.8g for normal driving, up to ~1g for emergency braking)
  • Maximum lateral acceleration: Tire friction limits (typically 0.3-0.5g for comfortable driving)
  • Maximum yaw rate: Coupled to velocity and steering angle via bicycle model kinematics

Feasibility check:

  • Each predicted trajectory waypoint is checked against kinematic constraints:
    curvature(t) <= 1 / min_turning_radius
    longitudinal_accel(t) <= max_accel
    lateral_accel(t) = v(t)^2 * curvature(t) <= max_lateral_accel
  • Trajectories violating these constraints are either clipped to the constraint boundary or discarded
  • This prevents the ML prediction module from producing physically impossible trajectories (e.g., a truck turning on a dime)

Class-specific models:

  • Vehicle: Bicycle kinematic model with steering constraints
  • Pedestrian: Point-mass model with maximum acceleration limits (~2 m/s^2)
  • E-bike: Bicycle model with higher acceleration and tighter turning radius than cars
  • Truck: Extended bicycle model with longer wheelbase and wider minimum turning radius

28.2 Ego-Vehicle Trajectory Validation

Classical kinematic checking also validates the planning module's proposed ego trajectories:

  • Every planned trajectory is checked against the vehicle's physical capabilities
  • This is a classical safety check that runs independently of any ML components in the planning pipeline

29. HD Map Integration

29.1 Classical Map Features Informing Perception

The HD map provides strong classical priors that constrain and enhance perception:

Map FeaturePerception Use
Lane geometry (center lines, boundaries)Constrains lane detection; provides geometry when markings are faded
Traffic light 3D positionsGenerates camera ROIs for traffic light detection; associates detections with correct signal groups
Traffic sign positionsGuides sign detection search areas; validates sign classification
Speed limits per road segmentInforms kinematic feasibility bounds for prediction
Intersection topologyDefines valid turning paths; constrains prediction trajectories to map-legal maneuvers
Road boundary geometryConstrains drivable area estimation; provides curb positions even when LiDAR detection is unreliable
Crosswalk positionsTriggers enhanced pedestrian monitoring in classical pipeline
Stop line positionsProvides precise stopping points for traffic signal compliance

29.2 Dynamic Map Updating (US Patent 11,885,624)

Pony.ai's patent for a "Dynamic Map Updating System" describes a classical method for maintaining map accuracy:

  1. Entity change identification: The system identifies map entities that change over time (construction zones, road modifications, new buildings)
  2. Change prediction: Predicts the amount of change over time
  3. Map update: Updates the map based on predicted changes
  4. Vehicle navigation: Adjusts navigation based on updated map

This enables the perception system to anticipate map changes rather than only react to discrepancies between perception and outdated map data.

29.3 Map-Perception Discrepancy Detection

When online perception disagrees with the HD map, classical rules determine the response:

  • Minor discrepancy (< threshold): Likely noise or calibration error; trust the map
  • Persistent discrepancy (sustained over distance/time): Likely real-world change; flag for map update; trust perception
  • Major discrepancy (road blocked, lane closed): Trigger construction zone mode; rely entirely on real-time perception; apply conservative driving behavior

30. 1,000+ Monitoring Mechanisms

30.1 Architecture

Pony.ai states that "based on ISO 26262 functional safety methodology, more than a thousand monitoring mechanisms run in parallel with normal functions, with failure mode and safety state fully taken into consideration."

These monitors are primarily rule-based/classical -- they check deterministic conditions and thresholds rather than using learned models.

30.2 Categories of Monitors

Sensor Health Monitors:

  • LiDAR point cloud rate (each of 9 LiDARs must produce points above minimum threshold)
  • LiDAR point cloud statistics (range histogram, intensity distribution -- sudden changes indicate sensor degradation)
  • Camera frame rate and exposure quality (overexposed, underexposed, frozen frame, black frame)
  • Radar detection rate and statistics
  • GNSS signal quality (number of satellites, HDOP, RTK fix status)
  • IMU measurement validity (accelerometer/gyroscope range, bias drift rate)
  • Microphone audio level (silence detection for acoustic sensor health)
  • Water sensor state (precipitation detection)

Perception Health Monitors:

  • Detection count per frame (sudden drop indicates perception failure)
  • Detection latency (inference time exceeds deadline)
  • DL model output validity (NaN check, bounding box dimension sanity, class confidence distribution)
  • Heuristic path output validity (cluster count, ground plane fit quality)
  • Dual-path consistency (sustained disagreement triggers alarm)
  • Track continuity (tracks disappearing/appearing anomalously)

Calibration Monitors:

  • LiDAR-camera reprojection error statistics
  • Radar Doppler residuals (deviation from expected stationary object velocities)
  • Sudden calibration jumps (impact detection -> recalibration trigger)
  • Inter-LiDAR consistency (overlapping FOV regions should produce consistent point clouds)

Localization Monitors:

  • GNSS-IMU filter innovation sequence (large innovations indicate inconsistency)
  • LiDAR scan matching score (low matching quality indicates localization uncertainty)
  • Localization covariance bounds (position uncertainty exceeds safe threshold)
  • Map-relative position validity (vehicle position within mapped road boundaries)

Compute Health Monitors:

  • GPU temperature, utilization, memory usage
  • CPU load and scheduling latency
  • PCIe bus error rates
  • Memory ECC error counts
  • Inference engine status (TensorRT session health)
  • Power supply monitoring

Communication Monitors:

  • Sensor-to-compute data bus integrity
  • Inter-chip communication (main Orin <-> redundant Orin)
  • Vehicle CAN bus communication
  • Cellular network connectivity (for remote assistance)

Vehicle State Monitors:

  • Wheel speed sensor consistency
  • Steering angle sensor health
  • Brake system pressure monitoring
  • Drive-by-wire (DBW) system status

30.3 Monitor Architecture

Each monitor is a classical rule-based checker:

IF condition_violated(threshold, duration):
    report_fault(severity, subsystem)
    IF severity >= CRITICAL:
        trigger_degradation(appropriate_level)

Monitors run at the frequency of their monitored subsystem (e.g., perception monitors at 10 Hz, IMU monitors at 100+ Hz) and are independent of the main perception/planning pipeline -- they cannot be bypassed by a perception failure.


31. 20+ Safety Redundancies

31.1 Complete Redundancy Taxonomy

Pony.ai implements 20+ safety redundancies across 4 categories:

Category 1: Software System Redundancy (7 Types)

#RedundancyClassical/Rule-Based Component
1Multi-Layer Degradation SystemRule-based state machine governing transitions between Normal -> Degraded -> Minimal Risk Condition based on fault severity
2Fault Detection and System Arbitration ModuleRule-based monitors that detect faults and arbitrate between main and fallback systems; threshold-based checks
3Heterogeneous Algorithm on Main and Fallback SystemThe dual perception architecture itself -- heuristic (classical) vs. DL (learned); different algorithmic paradigms provide diversity
4Communication Redundancy on Main and Fallback SystemRule-based monitoring of communication buses; automatic switchover protocols
5Trajectory Cross-Validation RedundancyClassical comparison of trajectories generated by main and fallback planning; geometric consistency checks
6Multi-Sensor Fusion Perception & Localization RedundancyClassical sensor fusion (Kalman filter based) combining multiple sensor modalities; continues operating when individual sensors fail
7Multi-Algorithm Fusion Redundancy of Key ADS ModulesRunning multiple algorithmic implementations (classical + ML) of critical functions and cross-validating outputs

Category 2: Hardware Component Redundancy (7 Types)

#RedundancyDescription
1N x 360-degree FOV coverageOverlapping sensor fields of view; any single sensor failure still leaves coverage from adjacent sensors
2Redundant computing units3 main OrinX chips + 1 dedicated redundant OrinX chip; MRCC provides tertiary compute
3Redundant localization sensorsMultiple GNSS receivers and IMU units; continues localizing with partial failures
4Redundant cellular communicationsMultiple cellular modems for remote assistance and OTA; continues communicating with single modem failure
5Redundant accident detectionCollision sensor + multiple sensor modalities that can independently detect impacts
6Redundant data storageMultiple storage units for logging; ensures data preservation for incident reconstruction
7Redundant sensor cleaningSelf-developed cleaning system with redundant mechanisms for maintaining sensor clarity

Category 3: Vehicle Platform Redundancy (5 Types)

#RedundancyDescription
1Parking brake systemIndependent parking brake for vehicle securment if primary braking fails
2Steering systemRedundant steering actuators/controllers; vehicle remains steerable with single actuator failure
3Braking systemRedundant brake circuits; vehicle can stop safely with partial brake failure
4Power supplyRedundant power distribution; the ADC supports both liquid and passive cooling, with passive cooling enabling safe vehicle control if liquid cooling fails
5Drive-by-wire (DBW) systemRedundant DBW controllers ensuring electronic throttle/brake/steering commands are reliable

Category 4: Service Redundancy (3 Types)

#RedundancyDescription
1External safety warningsMultiple alert modalities (visual, acoustic) to warn other road users; includes patented directed acoustic alert (US10,647,250) and directed visual alert (US10,726,687)
2Cellphone NFC unlockBackup vehicle access method if primary digital unlock fails
3Emergency call systemIndependent emergency communication channel; passengers can reach human operators even if main compute fails

31.2 MRCC (Minimum Risk Condition Controller)

The MRCC is the ultimate safety fallback -- a separate compute system that can maintain vehicle control even when the primary autonomous driving system has completely failed:

  • Hardware: Runs on the dedicated 4th OrinX chip (separate from the 3 main chips)
  • Perception capability: Maintains critical perception including blind-spot coverage using a subset of sensors connected to the redundant system
  • Driving capability: Can "navigate intersections or ramps and safely pull over, minimizing the risk of traffic disruption or collisions"
  • Independence: Operates even when "main system's power or chassis communication fails"
  • Cooling independence: Passive cooling backup ensures the MRCC remains operational if liquid cooling fails

The MRCC's perception is necessarily simpler than the main system (running on a single OrinX chip vs. three), likely relying heavily on classical/heuristic algorithms that require less compute than full DL inference.


Sources

Primary Sources (Pony.ai)

NVIDIA Technical Documentation

Sensor Technology

Patent Sources

Industry Analysis

Radar Signal Processing

Public research notes collected from public sources.