Pony.ai Non-ML & Hybrid-ML Perception Stack: Exhaustive Technical Deep Dive
Last updated: March 15, 2026
Table of Contents
- Dual Perception Architecture
- Heuristic Perception Path
- Deep Learning Perception Path
- Fusion and Arbitration
- Why Dual Paths
- Hesai AT128 LiDAR Signal Processing
- Radar Signal Processing
- Camera ISP and Preprocessing Pipeline
- Time Synchronization
- Multi-Sensor Calibration
- Online Calibration
- Manufacturing Calibration
- Point Cloud Processing
- Free Space Estimation
- Road Geometry Detection
- Traffic Signal Processing
- Kalman Filtering and State Estimation
- Data Association
- Track Management
- IMU/GNSS Integration
- LiDAR-to-Map Matching
- Visual Odometry
- Multi-Sensor Localization Fusion
- Mixed Traffic Rule-Based Handling
- Non-Standard Infrastructure Handling
- ML Detections to Classical Tracking
- Classical Preprocessing to ML
- Kinematic Feasibility Checking
- HD Map Integration
- 1,000+ Monitoring Mechanisms
- 20+ Safety Redundancies
1. Dual Perception Architecture
Architectural Overview
Pony.ai's perception module is explicitly architected around a dual-path design that runs a heuristic (rule-based / classical) perception pipeline in parallel with a deep-learning perception pipeline. This is documented in the company's safety report, SEC filings (F-1, 20-F), and technology pages. The company states that its "Perception module combines the strengths of a heuristic approach and deep learning models to boost performance, while ensuring the safety and operational redundancy of the vehicles."
This is not a simple ensemble or a sequential pipeline where one feeds the other. Both paths run concurrently on every sensor cycle, producing independent perception outputs that are then compared and arbitrated. The architecture is a safety-critical design decision rooted in ISO 26262 functional safety methodology and maps to the "Multi-Algorithm Fusion Redundancy of Key ADS Modules" -- one of Pony.ai's seven types of software system redundancy.
Safety Motivation
The dual-path architecture directly implements Pony.ai's core safety principle:
- Single-point failure: The vehicle can continue to operate safely. If the deep learning path fails entirely (GPU crash, model produces NaN, inference timeout), the heuristic path provides baseline perception.
- Dual-point failure: The vehicle can park safely (minimal risk condition). If both perception paths fail, the MRCC (Minimum Risk Condition Controller) uses its own redundant perception to execute a safe stop.
Operational Modes
| Mode | Condition | Perception Capability |
|---|---|---|
| Normal Operation | All systems healthy | Full dual-path perception, multi-sensor fusion, all 34 sensors active |
| Degraded Safe Mode | Single-point failure (one LiDAR offline, DNN inference failure, single GPU crash) | Heuristic path + remaining sensors maintain safe driving; system may reduce speed or restrict operational domain |
| Minimal Risk Condition | Dual-point failure (multiple sensor failures, both perception paths degraded) | Emergency perception via MRCC redundant system; critical blind-spot coverage maintained; vehicle navigates intersections/ramps and pulls over safely |
2. Heuristic Perception Path
The heuristic path implements classical algorithmic approaches to perception that do not depend on trained neural network weights. Based on Pony.ai's safety report, patent filings, SEC disclosures, and job posting requirements, the heuristic path comprises the following components:
2.1 Point Cloud Clustering and Segmentation
Ground Plane Estimation:
- RANSAC (Random Sample Consensus) plane fitting to identify the dominant ground plane in each LiDAR scan
- The PCL-style implementation constrains the plane fit to near-horizontal orientations (within angular tolerance of gravity vector), rejecting wall surfaces and ramps
- Inlier points classified as ground are removed; remaining points represent potential obstacles
- Critical for separating drivable surface from above-ground objects
Voxel Grid Downsampling:
- Raw point clouds from four AT128 LiDARs generate >6.12 million points/second combined
- Voxel grid filtering reduces point density while preserving spatial structure
- Enables tractable processing for downstream clustering without GPU-intensive neural networks
Connected-Component / Euclidean Clustering:
- After ground removal, Euclidean clustering separates remaining point clouds into distinct object clusters
- KD-tree-based nearest neighbor search identifies points within a distance threshold as belonging to the same cluster
- Minimum and maximum cluster size bounds filter out noise (too few points) and merge artifacts (clusters too large to be single objects)
- Each cluster represents a candidate obstacle
2.2 Rule-Based Detection (Bounding Box Fitting)
- Principal Component Analysis (PCA): Applied to each point cluster to determine object orientation and dimensions
- L-shape fitting: For partially visible objects (e.g., vehicle seen from one side), L-shape model fitting estimates the full 3D bounding box from visible edges
- Model fitting: Geometric primitives (rectangles for vehicles, cylinders for pedestrians/poles) are fit to clusters
- Size-based classification: Cluster dimensions are compared against known size ranges for object classes (car: ~4.5 x 1.8 x 1.5m; truck: ~12 x 2.5 x 3.5m; pedestrian: ~0.5 x 0.5 x 1.7m)
- Height-based filtering: Objects below a minimum height threshold are classified as ground artifacts; objects above road-surface height are classified as above-ground obstacles
2.3 Geometric Tracking
- Kalman filtering: Extended Kalman Filter (EKF) or Unscented Kalman Filter (UKF) maintains state estimates for each tracked object (see Section 17)
- Nearest-neighbor data association: Greedy or Hungarian algorithm matches new detections to existing tracks based on Mahalanobis distance (see Section 18)
- No learned features: Association uses only geometric proximity and kinematic consistency, not appearance embeddings
2.4 Lane and Boundary Detection
- Edge detection: Sobel, Canny, or similar gradient-based operators on camera images (Y-channel luminance sufficient) to find lane marking edges
- Hough transforms: Line detection in image space to identify straight lane markings; extended to parabolic/polynomial models for curved lanes
- LiDAR ground return analysis: Intensity discontinuities in LiDAR ground returns correspond to painted lane markings (paint has higher reflectivity than asphalt)
- Model fitting: Polynomial curve fitting (2nd or 3rd order) to detected lane points, constrained by lane width priors and continuity with previous frames
2.5 Advantages of the Heuristic Path
| Advantage | Technical Explanation |
|---|---|
| Predictable behavior | No opaque neural network failure modes; failures are deterministic and root-cause analyzable |
| No training data dependency | Functions without labeled datasets or GPU-intensive training; works from day one on new sensor configurations |
| Complementary failure modes | When DL models fail (novel objects, adversarial conditions, distribution shift, compute saturation), the heuristic path may still produce valid detections |
| Low-latency fallback | Classical algorithms run on CPU; can serve as fast fallback when DNN inference is delayed or GPU memory is exhausted |
| Certifiable | Rule-based algorithms can be formally verified and tested against specifications per ISO 26262, which current ML cannot |
3. Deep Learning Perception Path
The deep learning path uses multi-modal neural networks for higher-accuracy perception:
| Component | Approach |
|---|---|
| 3D object detection | Multi-modal DNNs operating on LiDAR voxel/pillar features + camera image features, producing 3D bounding boxes with class labels, confidence scores, heading angles, and velocity estimates |
| BEV representation | Bird's Eye View as the unified representation: LiDAR features projected to top-down grid; camera features lifted to BEV via depth estimation (LSS/BEVDet/BEVFormer-style view transformation) |
| Semantic segmentation | Pixel-level and point-level classification of road surfaces, lane markings, drivable areas, crosswalks |
| Instance segmentation | Per-object masks combining LiDAR and camera modalities (covered by US Patent 11,250,240) |
| Learned tracking | Deep feature extraction for appearance-based data association; learned motion models capturing complex agent behaviors beyond constant-velocity assumptions |
| Traffic element recognition | CNN-based classification of traffic lights (red/yellow/green, arrow directions, countdown timers), signs, and signals |
The DL path runs on NVIDIA DRIVE Orin-X chips (3 main + 1 redundant = 1,016 TOPS total), optimized through TensorRT inference and CUDA 11.2+ memory management, achieving >10 Hz across all perception modules at ~80% sustained GPU utilization.
What the DL Path Does That the Heuristic Path Cannot
- Semantic understanding: Distinguishing between object classes based on appearance (e.g., police vehicle vs. civilian vehicle, construction barrier vs. guardrail)
- Rich texture-based detection: Detecting objects that have minimal 3D geometry in LiDAR (flat traffic signs, painted road markings)
- Generalization to novel appearances: Handling objects with unusual shapes, colors, or occlusion patterns that rule-based size/shape filters would miss
- Dense scene understanding: Producing complete drivable-area segmentation rather than sparse obstacle detection
4. Fusion and Arbitration
Cross-Validation Layer
The outputs of both paths are fused through an arbitration layer implementing these rules:
Agreement (both paths detect): Objects detected by both paths receive high confidence -- the consensus of independent algorithms provides strong evidence of a real object. The fused output uses the higher-precision bounding box (typically from the DL path) but validates existence via the heuristic path.
DL-only detection: If the DL path detects an object the heuristic path misses, the system treats it as present with moderate confidence. This is common for distant, small, or partially occluded objects where LiDAR points are too sparse for clustering but the camera-based DNN can detect the visual signature.
Heuristic-only detection: If the heuristic path detects an object the DL path misses, the system treats it conservatively as present. This is critical for safety -- an obstacle that produces a clear LiDAR cluster but is not recognized by the neural network (novel object, OOD input) must still be avoided.
Disagreement (conflicting classifications): When both paths detect an object at the same location but disagree on classification, the system uses the more safety-conservative interpretation (e.g., if DL says "traffic cone" but heuristic says "pedestrian," the system assumes "pedestrian" until resolved over subsequent frames).
Conservative Default Policy
The arbitration follows a fundamental principle: the union of detections is used, not the intersection. Any object detected by either path is assumed to exist until proven otherwise. This maximizes recall at the cost of some precision -- acceptable for a safety-critical system where missing an obstacle is far worse than a false detection.
Fault Detection and System Arbitration Module
This is one of the seven software redundancy types. It monitors both perception paths for:
- Latency violations: If either path exceeds its real-time deadline
- Output validity: NaN checks, bounding box sanity (negative dimensions, unreasonable velocities)
- Consistency monitoring: Sustained disagreement between paths triggers escalation
- GPU health monitoring: Memory errors, thermal throttling, inference failures
5. Why Dual Paths
5.1 ISO 26262 Diversity Requirement
ISO 26262 (Road Vehicles -- Functional Safety) calls for diverse redundancy in safety-critical systems. Running two perception implementations based on fundamentally different algorithmic paradigms (geometric vs. learned) provides:
- Algorithmic diversity: A bug or systematic error in one approach (e.g., a blind spot in the neural network's training distribution) is unlikely to manifest in the geometrically different heuristic approach
- Implementation diversity: Different code paths, different engineers, different failure modes
- Paradigmatic independence: ML models can fail silently on out-of-distribution inputs; classical algorithms fail loudly with interpretable error states
5.2 ML Opacity Problem
Neural networks are not certifiable under current ISO 26262 methods because:
- Their behavior cannot be exhaustively specified or tested
- They can produce confident wrong outputs on adversarial or out-of-distribution inputs
- Failure modes are not enumerable
The heuristic path provides a certifiable safety baseline against which ML outputs are cross-validated.
5.3 Complementary Failure Modes
| Scenario | DL Path | Heuristic Path | Dual-Path Outcome |
|---|---|---|---|
| Novel object (e.g., fallen mattress) | May miss -- not in training distribution | Detects as generic obstacle cluster | Detected |
| Distant small object (e.g., pedestrian at 180m) | Detects via camera DNN | Too few LiDAR points for clustering | Detected |
| GPU crash / inference timeout | Full failure | Unaffected (runs on CPU) | Heuristic path provides fallback |
| Adversarial pattern / sensor artifact | May produce false classification | Geometric analysis unaffected by visual patterns | Geometric validation catches error |
| Nighttime, poorly lit area | Camera-based DL may degrade | LiDAR-based heuristic unaffected by lighting | Detected via heuristic path |
5.4 CTO Philosophy
Pony.ai's CTO Lou Tiancheng has been described as a "rule-based true believer" in the planning domain. This philosophy extends to perception architecture: the company deliberately maintains strong classical/rule-based components as a counterweight to ML, rather than going all-in on end-to-end neural approaches. Their software is explicitly described as "a combination of AI models and rule-based code."
6. Hesai AT128 LiDAR Signal Processing
6.1 Sensor Overview
The Gen-7 robotaxi uses four Hesai AT128 hybrid solid-state LiDARs as primary perception sensors, each covering 120 degrees of horizontal FOV.
| Parameter | Specification |
|---|---|
| Channels | 128 genuine channels (128 VCSEL arrays) |
| Scan technology | 128 high-power multi-junction VCSEL arrays, electronic scanning (no mechanical rotation) |
| Horizontal FOV | 120 degrees |
| Vertical FOV | 25.4 degrees |
| Detection range | 210 m @ 10% reflectivity |
| Ground detection range | Up to 70 m effective |
| Angular resolution | 0.1 deg (H) x 0.2 deg (V) |
| Point rate | >1.53 million pts/sec (single return); higher in dual return |
| Pixel resolution | 1,200 x 128 |
| Range accuracy | +/- 3 cm |
| Wavelength | 905 nm |
| Eye safety | Class 1 |
| Functional safety | ISO 26262 ASIL-B certified |
6.2 Range Computation (Time-of-Flight)
The AT128 uses a pulsed Time-of-Flight (ToF) ranging method:
- Laser emission: Each VCSEL array fires a short (few nanosecond) 905 nm laser pulse
- Photon detection: SiPM (Silicon Photomultiplier) detectors receive reflected photons
- Time measurement: On-chip TDC (Time-to-Digital Converter) measures the round-trip time from emission to detection with sub-nanosecond resolution
- Range calculation:
distance = (c * t_roundtrip) / 2where c = speed of light (~0.3 m/ns), yielding +/- 3 cm accuracy - Waveform digitization: Hesai's proprietary ASIC performs waveform digitization and peak detection to identify return pulses within the time-domain signal
6.3 Multi-Return Handling
The AT128 supports multiple return modes:
- Single return (strongest): Reports only the strongest return pulse per laser firing. Point rate: 1,536,000 pts/sec. Used when maximum point density is not needed and processing bandwidth is constrained.
- Single return (last): Reports only the last return pulse, which corresponds to the most distant surface. Useful for seeing through rain, spray, and vegetation.
- Dual return: Reports both the strongest and last returns per laser firing. Doubles point rate but increases bandwidth and processing load. Critical for adverse weather: the first return may come from rain/spray while the last return reaches the actual obstacle behind it.
The dual return mode is essential for Pony.ai's all-weather operation across Chinese cities. By comparing strongest and last returns, the downstream processing pipeline can distinguish weather-related reflections from solid obstacles.
6.4 Intensity Calibration
Each point includes a reflectivity/intensity value representing the return signal strength relative to a calibrated reference. This value enables:
- Distinguishing high-reflectivity surfaces (retroreflectors, lane markings, road signs) from low-reflectivity surfaces (asphalt, dark vehicles)
- Material classification heuristics in the heuristic perception path
- Lane marking detection from LiDAR ground returns (painted markings have higher reflectivity than bare asphalt)
Factory calibration by Hesai establishes per-channel intensity correction factors to normalize response across all 128 channels.
6.5 Intelligent Point Cloud Engine (IPE) -- Rain/Spray Filtering
Hesai's proprietary Intelligent Point Cloud Engine (IPE), implemented in their ASIC firmware, provides hardware-level noise filtering:
- Real-time weather detection: Identifies rain, fog, exhaust fumes, and water splashes at the pixel level
- Per-point marking: Each point receives a confidence/noise flag indicating whether it is likely a weather artifact
- Filtering rate: Filters out >99.9% of environmental noise in adverse conditions (rain, fog, dust, exhaust)
- Waveform analysis: The IPE decodes laser return waveforms with nanosecond-level precision, processing 24.6 billion samples per second across all channels
- Multi-return comparison: Weather artifacts typically appear only in the first/strongest return while solid objects appear in both returns -- the IPE uses this discrepancy to flag noise points
This hardware-level filtering reduces the burden on Pony.ai's downstream software pipeline: the point cloud arriving at the perception stack has already been pre-cleaned by the LiDAR's onboard ASIC.
6.6 Point Cloud Packet Structure
Each AT128 UDP packet contains:
- Timestamp: Absolute time information for each data block, synchronized to GPS/PTP time source
- Channel ID: Identifying which of the 128 channels produced each point
- Range: Distance measurement
- Intensity/Reflectivity: Calibrated return signal strength
- Return mode flag: Strongest, last, or dual-return indicator
- Noise/confidence flag: IPE-generated quality indicator
6.7 Pony.ai's GPU-Side LiDAR Processing
After packets arrive via Ethernet (UDP), Pony.ai's processing pipeline (documented on the NVIDIA Developer Blog) performs:
- Packet collection and time sync: The upstream synchronization module (originally FPGA, now NVIDIA DRIVE Orin SoC) collects raw packets, applies time synchronization, and packages data
- Structure-of-Array (SoA) conversion: Point cloud data is restructured from packet format into GPU-friendly SoA layout (separate arrays for x, y, z, intensity, timestamp) for coalesced memory access
- Page-locked memory transfer: Fields exchanged between CPU and GPU use page-locked (pinned) memory for accelerated PCIe transfers
- NVIDIA CUB library operations: Scan/select operations for point filtering achieve ~58% faster performance than naive implementations
- Ground plane removal: RANSAC-based ground segmentation on GPU
- Point cloud filtering: Noise removal, range-gate filtering, removal of points flagged by IPE
Critical path latency reduction: ~4 ms achieved through these GPU optimizations.
7. Radar Signal Processing
7.1 Hardware Configuration
| Generation | Configuration | Key Feature |
|---|---|---|
| Gen-6 | 4 short-range + 1 long-range forward-facing mmWave radar (5 total) | Conventional mmWave, limited elevation resolution |
| Gen-7 | 4 x 4D imaging millimeter-wave radar | Dense point cloud, elevation angle, high-resolution azimuth |
7.2 4D Imaging Radar Signal Processing Pipeline
The Gen-7's 4D imaging radar follows the standard FMCW (Frequency-Modulated Continuous Wave) radar processing chain, which is entirely classical signal processing:
Step 1: Range-Doppler Map Generation (2D-FFT)
- Each radar chirp produces a beat frequency signal from mixing transmitted and received waveforms
- Range FFT: Fast Fourier Transform along the fast-time (samples within one chirp) dimension extracts range bins
- Doppler FFT: FFT along the slow-time (across chirps within one frame) dimension extracts velocity bins
- Result: A 2D Range-Doppler (RD) map per virtual antenna, with axes of range and radial velocity
Step 2: CFAR Detection (Constant False Alarm Rate)
- CFAR adaptively sets detection thresholds on the Range-Doppler map
- Cell-averaging CFAR (CA-CFAR) or Ordered-Statistics CFAR (OS-CFAR): Estimates local noise floor around each cell under test by averaging neighboring cells, then sets threshold as noise floor + margin
- Target cells exceeding the adaptive threshold are declared detections
- CFAR ensures reliable detection even in environments with varying clutter levels and low SNR
Step 3: Beamforming and Direction-of-Arrival (DOA) Estimation
- The 4D imaging radar uses MIMO antenna arrays (multiple TX, multiple RX) to create virtual apertures
- Digital beamforming (DBF): FFT along the antenna dimension of the virtual array to estimate azimuth angle
- Elevation estimation: Additional antenna dimension provides elevation angle (the "4th D" in 4D radar)
- Advanced algorithms: Capon beamforming or MVDR (Minimum Variance Distortionless Response) can nearly double angular resolution compared to standard FFT-based beamforming on the same hardware
Step 4: Point Cloud Generation
- Detected targets are converted to 3D point cloud format: (range, azimuth, elevation, Doppler velocity)
- Each radar "point" includes directly measured radial velocity -- a unique capability that LiDAR and cameras cannot provide
Step 5: Velocity Estimation
- Direct Doppler measurement: Radar provides instantaneous radial velocity for every detected point, without needing frame-to-frame differencing
- Velocity disambiguation: Phase unwrapping and multi-chirp techniques resolve velocity ambiguity
- This velocity information is fed to the tracking pipeline (Section 17) to improve state estimation, and is used for online radar calibration (Section 11)
7.3 Radar's Classical Contributions to Perception
All radar signal processing from raw ADC samples through CFAR detection and beamforming is entirely classical -- no neural networks are involved in the radar processing chain. The radar provides:
- Direct velocity measurements for moving object tracking
- Weather-robust detections independent of optical conditions
- Stationary object Doppler signatures used for online sensor calibration (Pony.ai patent US11,454,701)
8. Camera ISP and Preprocessing Pipeline
8.1 Hardware Evolution
Pony.ai's camera pipeline evolved through four major phases (documented on NVIDIA Developer Blog):
| Phase | Architecture | Camera Interface |
|---|---|---|
| Phase 1 | CPU-based I/O | USB + Ethernet cameras -> CPU -> GPU |
| Phase 2 | FPGA gateway | Cameras -> FPGA (trigger + sync) -> DMA -> main memory -> GPU (HostToDevice ~1.5ms) |
| Phase 3 | GPU Direct RDMA | Cameras -> FPGA -> PCIe switch -> GPU memory directly (~6 GB/s on PCIe Gen3 x8) |
| Phase 4 (Production) | NVIDIA DRIVE Orin SoC | Cameras -> Orin SoC (ISP + sync + encoding) -> NvStreams -> discrete GPU or host CPU |
8.2 ISP Pipeline
The Image Signal Processor handles raw sensor data to produce usable images. Pony.ai's ISP pipeline, implemented on the Orin SoC in production, performs:
- Debayering: Converting raw Bayer-pattern data from the image sensor into full-color images
- Exposure control: Adaptive exposure management for varying lighting conditions. Pony.ai collaborated with ON Semiconductor (now onsemi) on "next-generation image sensing and processing technologies" specifically addressing "exposure control in imaging for computer vision applications." The critical insight: "exposure parameters cannot be universally optimal, as variations in lighting and conditions affect the visual representation of objects." The ISP dynamically adapts exposure per camera based on scene illumination.
- White balance: Color temperature correction to normalize colors across varying ambient light
- Noise reduction: Spatial and temporal denoising to reduce sensor noise, especially in low-light conditions
- Tone mapping / Gamma correction: Dynamic range compression to preserve detail in both shadows and highlights
- HDR processing: Some cameras may employ multi-exposure HDR to handle high-contrast scenes (e.g., tunnel exits, direct sunlight with deep shadows)
- YUV420 output: The ISP outputs in YUV420 color space natively
8.3 YUV420 Native Format Strategy
A key optimization: Pony.ai adopted YUV420 throughout the entire pipeline, eliminating YUV-to-RGB conversion:
- Conversion savings: Eliminating YUV->RGB conversion saves ~0.3 ms per frame
- Memory savings: YUV420 uses 50% less GPU memory than RGB
- Luminance-only processing: Perception modules that need only brightness (edge detection, feature point extraction) use only the Y channel, saving 67% memory
- No perceptual loss: Human vision is less sensitive to chrominance than luminance; machine perception benefits from the same data in native format
8.4 Hardware Encoding
- HEVC encoding: NVIDIA Video Codec dedicated hardware encoders (~3 ms per FHD image)
- Replaces NvJPEG which required ~4 ms and caused CPU/GPU resource contention
- Dedicated hardware encoders preserve CUDA cores and CPU resources for neural network inference
8.5 Zero-Copy Memory Architecture
Camera frames reside in GPU memory from the moment they arrive via NvStreams/GPU Direct RDMA:
- Custom protobuf codegen plugin introduced
GpuDatafield type CameraFrameprotobuf messages contain GPU memory pointers, not pixel data copies- Multiple perception modules (DL detection, heuristic edge detection, segmentation) receive pointers to the same GPU-resident frame
- Zero-copy throughout the pipeline: No redundant CPU <-> GPU transfers
8.6 GPU Memory Management for Camera Processing
- Fixed slot-size GPU memory pool (early approach): Pre-allocated stacks matching camera frame sizes, reducing alloc/free overhead to near zero
- CUDA 11.2
cudaMemPool(current approach): Dynamic allocation with ~2 microsecond overhead, supporting cameras with varying resolutions (Gen-7 uses 14 cameras including 8MP sensors) - Page-locked memory: Used for CPU-GPU field exchanges that must occur during preprocessing
8.7 Traffic Light Camera (PiDC)
Pony.ai designed an in-house traffic light camera (PiDC) specifically for traffic signal detection:
- Uses a constant exposure time of 11 ms to achieve consistent image capture regardless of ambient light
- Employs a neutral density (ND) filter to prevent oversaturation at this high-sensitivity setting
- In the Gen-6 system, the self-developed traffic light camera had 1.5x resolution compared to the previous generation
- The PiDC eliminates the need for adaptive exposure that would otherwise cause traffic light appearance to vary across frames, simplifying both classical and DL detection
9. Time Synchronization
9.1 The Synchronization Problem
Pony.ai's Gen-7 system must synchronize 34 sensors operating at different frame rates, with different internal clocks, and different physical locations on the vehicle:
- 9 LiDARs (each at 10 or 20 Hz, electronic scanning)
- 14 cameras (10-30 Hz depending on type)
- 4 radars (10-20 Hz)
- 4 microphones (continuous audio stream)
- 2 water sensors + 1 collision sensor (event-based)
- GNSS receiver (1-10 Hz)
- IMU (100-400 Hz)
9.2 Synchronization Architecture
Hardware trigger (FPGA / Orin SoC): The synchronization arbiter was originally an FPGA, which "handles the camera trigger and synchronization logic to provide better sensor fusion." In production (Gen-7), the NVIDIA DRIVE Orin SoC assumes this role.
The synchronization module:
- Generates hardware trigger signals to cameras (all cameras fire simultaneously for spatial consistency)
- Receives LiDAR packet timestamps (GPS-disciplined PTP time)
- Correlates radar frame timestamps
- Produces a synchronized sensor data package with all modalities time-aligned
Time reference sources:
- GPS Pulse-Per-Second (PPS): A 1 PPS signal from the GNSS receiver provides a reference clock aligned to UTC with nanosecond-level accuracy
- PTP (Precision Time Protocol, IEEE 1588): PTPv2 distributes precise time across the sensor network via Ethernet, achieving +/- 100 ns synchronization between devices
- Orin SoC internal clock: Disciplined by PTP/GPS, used as the master clock for all sensor triggering
9.3 Temporal Alignment in Processing
After synchronized capture, the upstream module applies:
- Timestamp interpolation: For sensors with different frame rates, data is interpolated to a common reference time
- Motion compensation: Vehicle ego-motion between sensor capture times is compensated using IMU data at high rate (100+ Hz), so that point clouds and images can be fused in a consistent coordinate frame despite being captured at slightly different instants
- Latency equalization: Different sensors have different inherent processing latencies; the synchronization module accounts for these offsets
10. Multi-Sensor Calibration
10.1 The Calibration Challenge
With 34 sensors in the Gen-7 system, calibration involves determining precise 6-DOF transformations (3 rotations + 3 translations) for every sensor pair:
- 9 LiDAR-to-vehicle transformations
- 14 camera-to-vehicle transformations (plus intrinsic calibration: focal length, principal point, lens distortion per camera)
- 4 radar-to-vehicle transformations
- LiDAR-to-camera cross-modal alignments (for point cloud projection onto images)
- Radar-to-LiDAR alignments (for cross-modal association)
- Temporal calibration: Time offsets between sensor clocks
10.2 LiDAR-Camera Extrinsic Calibration
The LiDAR-camera transformation must enable accurate projection of 3D LiDAR points onto 2D camera images. Methods used in the industry (and likely by Pony.ai given their patent portfolio):
Target-based calibration (factory/initial):
- Calibration targets (checkerboard patterns, specialized reflective targets) are placed at known positions
- LiDAR detects target edges/corners in the point cloud; camera detects them in the image
- The 6-DOF transformation is computed by minimizing reprojection error between corresponding 3D-2D point pairs
- Used during vehicle assembly and after sensor replacement
Targetless calibration (online/continuous):
- Extracts natural features (edges, planes, corners) from both LiDAR point clouds and camera images
- Matches features across modalities to estimate and refine the extrinsic transformation
- Can run continuously during normal driving to detect and correct calibration drift
10.3 Radar-LiDAR/Camera Cross-Calibration
- Radar-LiDAR calibration exploits shared detections of the same objects in both modalities
- Spatial association of radar points and LiDAR clusters at known positions determines the radar-to-vehicle transformation
- Pony.ai's Doppler-based calibration patent (US11,454,701) provides continuous radar calibration using stationary object Doppler signatures (see Section 11)
10.4 LiDAR Intrinsic Calibration
- Hesai factory-calibrates each AT128's 128 channels: beam angle offsets, range bias corrections, intensity normalization
- The AT128's "unstitched" 120-degree FOV from genuine 128 VCSEL channels eliminates seam artifacts that would require additional cross-channel calibration
- Per-channel corrections are stored in sensor firmware and applied automatically
11. Online Calibration
11.1 Real-Time Doppler-Based Calibration (US Patent 11,454,701)
Pony.ai holds patent US11,454,701: "Real-time and dynamic calibration of active sensors with angle-resolved Doppler information for vehicles."
Algorithm:
- During normal driving, the radar continuously measures Doppler velocity of all detected objects
- Stationary object identification: Buildings, poles, parked vehicles, guardrails -- objects known to be stationary (zero true velocity)
- Expected Doppler computation: For a stationary object, the observed Doppler velocity should equal the negative of the vehicle's own velocity component projected along the radar beam direction:where
v_doppler_expected = -v_ego * cos(theta_object)theta_objectis the angle from the radar boresight to the object - Discrepancy measurement: Any systematic discrepancy between expected and measured Doppler across multiple stationary objects indicates sensor mounting angle error
- Parameter update: The system automatically adjusts the sensor's angular offset parameters to minimize the observed discrepancy
- Continuous operation: This calibration runs in real-time during normal driving, requiring no special calibration targets or procedures
Significance for fleet operations:
- Enables continuous recalibration across 1,000+ vehicle fleet without manual intervention
- Compensates for thermal expansion, vibration-induced drift, and minor impacts that gradually shift sensor alignment
- Maintains perception accuracy over the designed 600,000+ km vehicle lifespan
11.2 Static Object-Based LiDAR Calibration (US Patent 12,032,102)
Pony.ai's patent for vehicle sensor calibration using detected static objects:
Algorithm:
- Environmental assessment: The system first evaluates whether conditions are suitable for calibration (sufficient static objects, good visibility)
- Static object detection: Identifies poles, signs, building corners, and other static landmarks in the point cloud
- Height and shape variance check: If detected static objects exceed a predetermined variance threshold (indicating sufficient geometric diversity), calibration proceeds
- Transformation matrix computation: The first sensor's local coordinate system is iteratively aligned to a pre-calibrated second sensor's coordinate system
- Iterative refinement: If calibration accuracy falls below threshold, additional static objects are detected and the process repeats
- Global reference: Calibration is anchored to a global coordinate system using the HD map as reference -- detected landmarks are matched to known map features
This is a fully classical optimization algorithm: iterative closest point (ICP) style transformation estimation using geometric features, with no neural network involvement.
11.3 Calibration Monitoring
The system continuously monitors calibration quality by:
- Checking consistency of LiDAR-camera point projections against detected image edges
- Monitoring radar Doppler residuals against ego-motion estimates
- Flagging sudden calibration shifts (impact detection triggers immediate recalibration check)
- Running calibration quality checks as part of the 1,000+ monitoring mechanisms
12. Manufacturing Calibration
12.1 Gen-7 Mass Production Context
The Gen-7 is explicitly the "world's first mass-produced L4 autonomous vehicle" with three vehicle platforms (Toyota bZ4X, BAIC ARCFOX Alpha T5, GAC Aion V). Calibration at production scale is fundamentally different from research prototype calibration:
12.2 Platform-Based Design
The Gen-7 features an "enhanced platform-based design that enables rapid adaptation across multiple vehicle models." This means:
- Standardized sensor mounting points: Each platform has precisely machined mounting locations for the rooftop assembly (4x AT128 + cameras) and body-mounted sensors (5x near-range LiDAR + radar)
- Pre-assembled sensor modules: Sensors are "pre-assembled" in the highly integrated sensor package before vehicle integration, allowing factory calibration of relative sensor positions within the module
- Automotive-grade tolerances: 100% automotive-grade components with manufacturing tolerances that minimize initial calibration variation
12.3 Factory Calibration Pipeline
Based on industry practice for mass-produced AV systems and Pony.ai's disclosed capabilities:
- Sensor module assembly: Sensors are mounted in the rooftop/body assemblies with precision jigs
- Intra-module calibration: Relative positions of sensors within each module are calibrated using target-based methods in a controlled environment
- Vehicle integration: Modules are installed on the vehicle platform
- Vehicle-level calibration: A complete calibration run establishes the full set of sensor-to-vehicle transformations, potentially using a calibration facility with known reference targets
- Calibration verification drive: A short test drive verifies calibration quality using the online calibration system (Section 11) as a checker
- Calibration data storage: Calibration parameters are stored in the vehicle's compute system and updated via OTA as online calibration refines values during operation
12.4 Scale Considerations
With 1,000+ vehicles planned for 2025 fleet and 3,000+ by end of 2026:
- Manual per-vehicle calibration sessions are impractical at this scale
- Heavy reliance on automated calibration (Sections 11.1, 11.2)
- Quality control through statistical monitoring of calibration drift rates across the fleet
- OTA calibration parameter updates when systematic biases are detected
13. Point Cloud Processing
13.1 Classical Point Cloud Processing Pipeline
The point cloud processing pipeline operates on the combined output from all 9 LiDARs (4x AT128 + 5x near-range). Classical processing steps run in the heuristic perception path:
Step 1: Multi-LiDAR Point Cloud Merging
- Point clouds from individual LiDARs are transformed to the vehicle body frame using calibrated extrinsic transformations
- Motion compensation using IMU data corrects for vehicle motion during the scan period
- Combined point cloud represents a unified 360-degree 3D scene
Step 2: Ground Plane Estimation
- RANSAC plane fitting: Iteratively samples minimal point sets (3 points define a plane), fits candidate planes, counts inliers within a distance threshold
- Constrained to near-horizontal: Angular tolerance around gravity vector (from IMU) prevents fitting to vertical surfaces
- Segmented ground model: Rather than a single global plane, the ground is modeled as a piecewise planar surface (tiled grid of local planes), handling slopes, ramps, and road crown
- Ground points are separated from above-ground points; ground points feed into drivable surface estimation
Step 3: Noise and Outlier Removal
- Statistical Outlier Removal (SOR): For each point, the mean distance to its k-nearest neighbors is computed; points with mean distances exceeding a threshold (e.g., mean + 2*stddev) are removed
- Radius-based outlier removal: Points with fewer than N neighbors within radius R are removed
- IPE-flagged point removal: Points flagged as weather noise by Hesai's IPE are excluded
- Multi-return disambiguation: In dual-return mode, comparing strongest and last returns to filter weather artifacts
Step 4: Voxel Grid Downsampling
- 3D space is divided into uniform voxels (e.g., 10-20 cm cubes)
- Points within each voxel are replaced by their centroid
- Reduces point count while preserving spatial structure
- Critical for making downstream clustering computationally tractable at 6M+ pts/sec
Step 5: Above-Ground Clustering
- Euclidean clustering: KD-tree-based nearest-neighbor search groups points within a distance threshold into clusters
- Minimum/maximum cluster bounds: Filter out noise clusters (< min_points) and over-merged clusters (> max_points or > max_extent)
- Each cluster is a candidate obstacle for the heuristic detection path
Step 6: Bounding Box Fitting
- For each cluster: PCA determines principal axes; oriented bounding box (OBB) is fit aligned to principal components
- L-shape fitting for partially visible objects (vehicles seen from one corner/side)
- Height, width, length extracted for size-based classification
13.2 GPU-Optimized Implementation
All point cloud processing runs on GPU with specific optimizations documented by Pony.ai:
- SoA data layout: Separate contiguous arrays for x, y, z, intensity, timestamp enable coalesced GPU memory access
- CUB library: NVIDIA CUB scan/select operations for filtering (~58% faster than naive)
- Page-locked memory: For CPU-GPU exchanges during preprocessing
- Fixed memory pool / cudaMemPool: Pre-allocated GPU memory to avoid allocation latency
14. Free Space Estimation
14.1 Classical Occupancy Grid
Free space estimation determines which areas around the vehicle are traversable (free of obstacles). The classical approach:
2D Occupancy Grid Construction:
- The area around the vehicle is discretized into a 2D grid (cells typically 10-20 cm)
- Each cell has one of three states: free, occupied, or unknown
- LiDAR points that hit above-ground obstacles mark cells as occupied
- Ground-level LiDAR returns (from ground plane estimation) confirm cells as free
Ray Casting:
- For each LiDAR point, a ray is traced from the sensor origin to the point's location
- All grid cells along the ray path (before the hit point) are marked as free (the ray passed through them unobstructed)
- The cell containing the hit point is marked as occupied
- Cells never traversed by any ray remain unknown
- This classical method provides a conservative estimate of free space with no ML dependency
Temporal Accumulation:
- Occupancy evidence accumulates over multiple scans using log-odds updating (Bayesian occupancy grid)
- Cells transition from unknown to free or occupied as evidence accumulates
- Moving objects are handled by decaying occupancy evidence over time
14.2 Near-Field Occupancy
The five body-mounted near-range LiDARs (historically RoboSense Bpearl) provide dense near-field occupancy:
- Detection within 10 cm of the vehicle body
- Critical for parking, narrow-passage navigation, and low-speed maneuvering
- Provides occupancy data where roof-mounted AT128s have blind spots (directly beside and below the vehicle)
14.3 Hybrid Free Space
The DL path provides semantic free space (drivable area segmentation from cameras), while the classical path provides geometric free space (occupancy grid from LiDAR ray casting). Both are fused:
- Geometric free space is authoritative for physical obstacle presence
- Semantic free space adds context (road surface type, curb boundaries, sidewalk vs. road)
- The union is used: if either method indicates occupied, the cell is treated as non-traversable
15. Road Geometry Detection
15.1 Classical Lane Detection Components
LiDAR-based lane marking detection:
- Ground return intensity analysis: Painted lane markings reflect more 905 nm laser light than asphalt
- Intensity gradient detection along ground plane points identifies marking edges
- Model fitting (polynomial curves) to detected intensity edges produces lane boundary estimates
- Effective up to AT128's 70 m ground detection range
Camera-based classical lane detection:
- Preprocessing: Y-channel extraction from YUV420 (luminance only, 67% memory savings)
- Edge detection: Sobel/Canny operators on the Y-channel image detect marking edges
- Perspective transformation: Inverse perspective mapping (IPM) transforms the camera image to a top-down BEV view, making lane lines parallel
- Hough line detection: Identifies straight-line candidates in the transformed image
- Polynomial fitting: 2nd/3rd-order polynomial curves fit to detected marking points, constrained by:
- Lane width priors (3.0-3.75 m for Chinese national standard lanes)
- Continuity with previous frame's lane model
- Symmetry constraints (parallel lane boundaries)
Road boundary detection:
- LiDAR height discontinuities at curb edges (10-25 cm height steps)
- LiDAR reflectivity changes at road edge (asphalt vs. grass/dirt)
- Camera-based guardrail/barrier detection via edge detection and template matching
15.2 HD Map-Assisted Road Geometry
Classical lane detection is augmented by the HD map:
- Known lane geometry from the map provides strong prior constraints
- Online perception refines/confirms map-based lane positions
- When lane markings are faded, obscured by snow, or missing, the HD map provides the geometry
- Discrepancies between perceived and mapped lane geometry trigger construction zone detection
16. Traffic Signal Processing
16.1 Classical Components in Traffic Light Detection
Traffic light detection uses a hybrid of classical and ML approaches:
Classical preprocessing:
- Color space conversion: RGB or YUV to HSV for color-based segmentation
- Color thresholding: HSV-space masks isolate red, yellow, and green candidate regions
- Morphological operations: Erosion/dilation to clean up color masks and remove noise
- Region of Interest (ROI) extraction: HD map provides expected traffic light positions in image coordinates (projecting known 3D map positions through calibrated camera intrinsics); the classical pipeline constrains detection to these ROIs, dramatically reducing false positives from other red/green light sources (neon signs, tail lights)
Temporal state machine:
- State transition validation: Traffic light states follow physical constraints -- a light cannot transition from red to green without passing through yellow (or from green to red without yellow in most configurations)
- Hysteresis filtering: A state transition is only confirmed after the new state is observed for a minimum number of consecutive frames (e.g., 2-3 frames), preventing flicker-induced errors from LED PWM dimming, camera rolling shutter artifacts, or momentary occlusions
- Countdown timer tracking: For Chinese traffic lights with countdown displays, OCR-based digit recognition provides additional temporal context
HD map association:
- Each detected traffic light is matched to a known signal position in the HD map
- This resolves ambiguity when multiple traffic light groups are visible (common at large Chinese intersections)
- The map specifies which signal group governs which lane, enabling correct lane-to-signal association
16.2 PiDC Camera for Traffic Lights
Pony.ai's self-developed PiDC (Pony intelligent Detection Camera) is specifically optimized for traffic light detection:
- Fixed 11 ms exposure: Eliminates exposure variation that would change traffic light appearance across frames
- Neutral density filter: Prevents oversaturation at the fixed high-sensitivity exposure setting
- After deploying PiDC, monthly traffic-light-related issues dropped to zero (August 2021)
- The fixed exposure simplifies both classical color classification (consistent HSV values) and ML-based classification (consistent input distribution)
16.3 Chinese Traffic Light Challenges
| Challenge | Classical Handling Approach |
|---|---|
| Multiple signal groups visible | HD map ROI constraints + spatial association |
| Vertical and horizontal orientations | Aspect ratio analysis of detected light cluster geometry |
| Arrow signals (left, right, U-turn) | Shape template matching within detected light region |
| Countdown timers | OCR digit recognition on timer display region |
| LED PWM flicker | Temporal hysteresis filtering (multi-frame confirmation) |
| Nighttime neon interference | HD map ROI + size/shape constraints filter non-signal lights |
17. Kalman Filtering and State Estimation
17.1 State Vector
The heuristic tracking path maintains a state vector for each tracked object. The typical state vector for a vehicle target:
x = [px, py, pz, vx, vy, heading, yaw_rate, length, width, height]Where:
- (px, py, pz): 3D position in vehicle frame
- (vx, vy): 2D velocity in the ground plane
- heading: Yaw angle (orientation)
- yaw_rate: Angular velocity
- (length, width, height): Object dimensions
For pedestrians, a simpler state may be used (no yaw_rate, smaller dimension vector).
17.2 Process Model
Constant Turn-Rate and Acceleration (CTRA) model:
px(t+dt) = px(t) + vx(t)*dt + 0.5*ax*dt^2
py(t+dt) = py(t) + vy(t)*dt + 0.5*ay*dt^2
heading(t+dt) = heading(t) + yaw_rate*dtFor simpler cases, Constant Velocity (CV) or Constant Turn-Rate and Velocity (CTRV) models are used. The model selection may be class-dependent:
- Vehicles: CTRA or CTRV (captures turning behavior)
- Pedestrians: CV (relatively constant velocity between observations)
- E-bikes: CTRA (frequent turning, acceleration changes)
17.3 Extended Kalman Filter (EKF) / Unscented Kalman Filter (UKF)
- Prediction step: Propagate state forward using the process model; propagate covariance through the nonlinear motion model (EKF linearizes via Jacobian; UKF uses sigma points)
- Update step: When a new detection is associated with the track, incorporate the measurement to correct the predicted state
- Measurement model: Maps state to expected measurement (position, dimensions, heading from detection)
- Radar velocity integration: When radar Doppler measurements are available for a tracked object, they provide a direct velocity measurement that dramatically improves velocity estimation accuracy compared to position-only differentiation
17.4 Multi-Sensor State Estimation
The Kalman filter fuses measurements from multiple sensor modalities:
- LiDAR detections: Provide precise 3D position and dimensions
- Camera detections: Provide 2D bounding boxes; 3D position estimated via known camera geometry + assumed ground plane or depth estimation
- Radar detections: Provide range, angle, and direct radial velocity
- Each measurement source has its own noise model (measurement covariance matrix), reflecting the sensor's accuracy characteristics
18. Data Association
18.1 The Association Problem
Each perception cycle produces new detections from both perception paths and all sensor modalities. Data association answers: "Which new detection corresponds to which existing track?"
18.2 Classical Association Methods
Global Nearest Neighbor (GNN) / Hungarian Algorithm:
- Construct a cost matrix: rows = existing tracks, columns = new detections
- Cost = Mahalanobis distance (accounting for state uncertainty) between predicted track position and detection position
- Hungarian algorithm finds the optimal one-to-one assignment that minimizes total cost
- Unassigned detections become candidate new tracks; unassigned tracks age without update
Joint Probabilistic Data Association (JPDA):
- In dense scenes (Chinese urban intersections with closely spaced e-bikes, pedestrians), GNN may make incorrect hard assignments
- JPDA computes the probability that each detection belongs to each track
- The state update uses a weighted combination of all plausible associations
- More robust in clutter but computationally more expensive
Gating:
- Before association, a gate (validation region) around each track's predicted position filters out implausible associations
- Mahalanobis distance gating: only detections within the track's predicted uncertainty ellipsoid (e.g., 3-sigma) are considered
- Reduces computation and prevents gross misassociations
18.3 Multi-Sensor Association
Cross-modal association is particularly challenging:
- A single physical object may produce detections from multiple LiDARs, multiple cameras, and radar simultaneously
- Association must merge these into a single track
- Spatial consistency: Detections from different sensors at the same 3D location (within calibration error) are candidates for association
- Temporal consistency: Detections arriving at similar timestamps with consistent motion are likely from the same object
- Classification consistency: If LiDAR classifies a cluster as "vehicle" and camera classifies the same region as "vehicle," association confidence increases
19. Track Management
19.1 Track Lifecycle
Track birth:
- A new detection not associated with any existing track is initialized as a tentative track
- Tentative tracks must be confirmed by subsequent detections in consecutive frames (e.g., detected in 3 of 5 frames)
- Confirmation prevents single false detections from creating permanent tracks
Track maintenance:
- Active tracks are updated with each associated detection via the Kalman filter
- Track confidence increases with consistent detections across multiple frames and sensor modalities
- Classification is refined over time: early frames may be ambiguous; as more observations accumulate, classification certainty increases
Track coasting:
- When a track receives no associated detection in a frame (occlusion, sensor blind spot, missed detection), it coasts -- the state is propagated forward by the process model without measurement update
- Coasting increases state uncertainty (covariance grows)
- Maximum coast duration depends on object class and velocity: high-speed vehicles may coast for fewer frames (their predicted position becomes unreliable faster)
Track death:
- Tracks that coast beyond a maximum duration without re-detection are terminated
- Tracks that exit the sensor coverage area are terminated
- Tracks with unreasonably high uncertainty (covariance exceeds threshold) are terminated
19.2 Track-to-Track Fusion
When the dual perception paths produce independent track lists:
- Track-to-track association: Heuristic and DL path tracks at similar positions are associated
- State fusion: If both paths track the same object, a fused state estimate (weighted by respective covariances) produces the output track
- Discrepancy handling: If one path tracks an object the other doesn't, the single-path track is retained with adjusted confidence
20. IMU/GNSS Integration
20.1 Sensor Hardware
Pony.ai's Gen-7 system includes high-accuracy GNSS and IMU as documented in SEC filings:
- GNSS receiver: Multi-constellation (GPS, GLONASS, BeiDou, Galileo) with RTK correction capability
- IMU: 6-DOF inertial measurement unit (3-axis accelerometer + 3-axis gyroscope), likely tactical-grade MEMS for automotive application
20.2 IMU Processing (Classical)
The IMU provides high-rate (100-400 Hz) measurements of vehicle acceleration and angular velocity. Classical processing:
Inertial navigation (dead reckoning):
orientation(t+dt) = orientation(t) + gyro_measurement * dt
velocity(t+dt) = velocity(t) + (rotation_matrix * accel_measurement - gravity) * dt
position(t+dt) = position(t) + velocity(t) * dt + 0.5 * acceleration * dt^2- Integration of accelerometer data (after removing gravity and rotating to navigation frame) provides velocity and position updates
- Integration of gyroscope data provides orientation updates
- Drift problem: Double integration of accelerometer noise causes position error to grow quadratically with time (~meters per minute without correction)
- Dead reckoning is therefore only useful for short-term bridge periods when other sensors are unavailable (GPS outage in tunnels, LiDAR-denied environments)
Bias estimation:
- Accelerometer and gyroscope biases are estimated and compensated in real-time
- Temperature-dependent bias models correct for thermal drift
- Turn-on bias calibration occurs during vehicle startup (stationary period)
20.3 GNSS Processing (Classical)
Position fixing:
- Multi-constellation GNSS provides absolute position in WGS84 coordinates
- RTK correction from base stations or network corrections provides centimeter-level accuracy in open-sky conditions
- Accuracy degrades in urban canyons, tunnels, underpasses, and dense tree cover
GPS correction of IMU drift:
- GNSS position fixes reset accumulated IMU drift
- The Kalman filter (or error-state Kalman filter) continuously estimates and corrects IMU biases using GNSS measurements as a reference
- When GNSS is unavailable (tunnel, underground parking), the filter extrapolates using IMU-only dead reckoning, with growing uncertainty
20.4 GNSS-IMU Fusion (Classical Kalman Filter)
A tightly-coupled GNSS-IMU fusion filter is standard practice:
- State vector: Position, velocity, orientation, IMU biases (accelerometer bias, gyroscope bias), GNSS clock offset
- Process model: IMU-driven state propagation at high rate (100+ Hz)
- Measurement model: GNSS pseudorange and carrier phase observations at 1-10 Hz
- Output: Continuous 6-DOF pose estimate at IMU rate, with accuracy bounded by GNSS corrections
This is entirely classical estimation theory -- no neural networks involved.
21. LiDAR-to-Map Matching
21.1 HD Map for Localization
Pony.ai uses HD maps for centimeter-level localization. The HD map contains:
- 3D point cloud map (dense reference point cloud from prior mapping runs)
- Lane-level road geometry
- Traffic infrastructure positions (traffic lights, signs, poles)
- Building and curb geometry
21.2 Scan Matching Algorithms
Localization achieves centimeter-level accuracy by matching live LiDAR scans to the pre-built HD map point cloud:
Iterative Closest Point (ICP):
- For each point in the live scan, find the closest point in the reference map
- Compute the rigid transformation (rotation + translation) that minimizes the sum of squared distances between corresponding point pairs
- Apply the transformation and iterate until convergence
- Variants: point-to-point ICP, point-to-plane ICP (more robust, faster convergence)
Normal Distributions Transform (NDT):
- Discretize the reference map into cells (voxels)
- For each cell, compute the mean and covariance of contained points (modeling the local surface as a Gaussian distribution)
- Score the live scan against the map by evaluating how well each live point fits the Gaussian of its containing cell
- Optimize the 6-DOF pose to maximize the total score
- NDT is more computationally efficient than ICP for large maps and provides smoother optimization landscapes
Pony.ai likely uses a variant or combination of these methods, potentially with initial alignment from GNSS-IMU and refinement via scan matching.
21.3 Feature-Based Matching
Beyond dense point cloud matching, feature-based approaches provide faster and more robust localization:
- Pole/post detection: Vertical structures (poles, posts, trees) provide stable localization features
- Curb edge detection: Road boundary geometry matched against map curb positions
- Building corner matching: Sharp geometric features in the point cloud matched to map building geometry
- These features are extracted using classical geometric algorithms (vertical line detection, height discontinuity detection, corner detection in 3D)
21.4 SLAM Patent (US11,908,198)
Pony.ai's patent for "Generating graphical illustrations of point cloud frames for SLAM algorithms" describes:
- A system that obtains sensor data and determines sensor position/orientation using SLAM algorithms
- Generates graphical illustrations of captured point cloud frames including trajectory points
- Visualizes loop closure constraints (when the vehicle revisits a previously mapped area, correcting accumulated drift)
- Fuses GNSS data and IMU data in generating accurate maps
- Interactive interface for examining and adjusting loop closure constraints
This patent indicates Pony.ai uses graph-based SLAM for map construction, with pose graph optimization and loop closure -- entirely classical optimization techniques (Levenberg-Marquardt or Gauss-Newton optimization on factor graphs).
22. Visual Odometry
22.1 Camera-Based Motion Estimation
Visual odometry provides ego-motion estimates from camera images, serving as an additional input to the localization fusion:
Feature extraction and matching:
- Classical feature detectors (FAST, ORB, SIFT/SURF variants) identify salient keypoints in consecutive camera frames
- Descriptor matching between frames establishes point correspondences
- Outlier rejection via RANSAC eliminates mismatches
Motion estimation:
- Essential matrix estimation: From matched feature correspondences, the essential matrix encoding relative rotation and translation (up to scale) is computed via the 5-point or 8-point algorithm
- PnP (Perspective-n-Point): When 3D positions of features are known (from LiDAR depth or previous triangulation), PnP directly estimates the 6-DOF camera pose
- Scale estimation: Monocular VO has inherent scale ambiguity; resolved by fusing with LiDAR range measurements or IMU-derived velocity
22.2 Role in Pony.ai's Stack
Visual odometry likely serves as a secondary motion estimation source:
- Primary localization comes from LiDAR-to-map matching (centimeter accuracy) and GNSS-IMU
- VO provides additional motion estimates, particularly useful when LiDAR features are sparse (e.g., featureless highway stretches) or when rapid visual features are available
- VO is entirely classical signal processing and geometric computation
23. Multi-Sensor Localization Fusion
23.1 Fusion Architecture
Pony.ai achieves centimeter-level localization through multi-sensor fusion. The system is described as using "multi-sensor fusion of rich datasets for understanding the static environment" to achieve centimeter-level accuracy. The sensors feeding localization include GNSS, IMU, LiDAR (scan matching), cameras (visual odometry), and wheel odometry.
23.2 Classical Fusion Filter
The localization fusion is a multi-state Kalman filter or factor graph optimization:
Error-State Extended Kalman Filter (ES-EKF):
- State: 6-DOF pose (position + orientation), velocity, IMU biases
- Prediction: IMU-driven at 100+ Hz
- Updates from multiple asynchronous sources:
- GNSS: 1-10 Hz, absolute position (when available)
- LiDAR scan matching: 10 Hz, relative or absolute pose correction
- Visual odometry: 10-30 Hz, relative motion
- Wheel odometry: continuous, forward velocity
- Each update source has its own measurement model and noise covariance
- The filter seamlessly handles sensor dropouts (e.g., GPS loss in tunnel) by continuing on remaining sensors
Factor Graph Optimization (alternative or complement):
- Pose graph with nodes at each timestep and edges representing:
- IMU preintegration factors (high-rate relative motion)
- GNSS absolute position factors
- LiDAR scan matching relative pose factors
- Loop closure factors (from SLAM, if applicable)
- Batch or sliding-window optimization provides globally consistent trajectory estimates
- More accurate than filtering but computationally more expensive
23.3 GPS-Denied Localization
In GPS-denied environments (tunnels, underground parking, dense urban canyons):
- LiDAR-to-map matching + IMU dead reckoning provide continuous localization
- Pre-mapped tunnel/parking structure maps enable LiDAR matching even without GPS
- Localization uncertainty grows during GPS outages but remains within safe bounds for the duration of typical outages
24. Mixed Traffic Rule-Based Handling
24.1 Chinese Urban Traffic Context
Pony.ai's software is explicitly described as "a combination of AI models and rule-based code, designed to interpret traffic patterns, predict behaviors, and execute driving decisions." The rule-based code is essential for handling China's unique traffic:
24.2 Rule-Based E-Bike/Scooter Handling
E-bikes are the most challenging road user class in Chinese cities. Rule-based handling:
| Behavior | Rule-Based Response |
|---|---|
| E-bike riding in vehicle lane | Classify as VRU; increase lateral clearance buffer to 1.5m+; reduce speed |
| E-bike crossing against traffic signal | Apply conservative yield; treat as potential red-light violator; maintain enlarged safety zone |
| E-bike with oversized cargo (delivery packages) | Increase bounding box extent beyond detected boundaries; apply wider clearance |
| E-bike emerging from between parked vehicles | Apply occlusion-aware safety zone; reduce speed in narrow passages with parked vehicles |
| Multiple e-bikes in cluster | Track as group; apply group-level motion prediction with expanded safety zone |
24.3 Three-Wheeler and Non-Standard Vehicle Rules
- Size variance handling: Three-wheelers range from small enclosed vehicles to large open cargo platforms. Rule-based size filters widen acceptance ranges beyond standard vehicle templates
- Overloaded vehicle rules: When detected object dimensions exceed standard vehicle bounds, apply extended clearance zones
- Slow-moving vehicle rules: Vehicles moving significantly slower than traffic flow trigger enhanced monitoring and safe passing behavior
24.4 Pedestrian Rule-Based Handling
- Jaywalking prediction zones: Near certain locations (bus stops, shopping areas, median breaks), pedestrian detection zones are enlarged and yield behavior is activated at lower confidence thresholds
- Group crossing rules: When multiple pedestrians are detected in proximity, assume coordinated group movement and yield to the entire group
- Crosswalk state machines: Pedestrians approaching a crosswalk trigger proactive yield behavior even before entering the crosswalk
25. Non-Standard Infrastructure Handling
25.1 Missing or Faded Lane Markings
Classical handling when lane markings are absent or undetectable:
- HD map fallback: Lane geometry from the map provides lane positions when online detection fails
- Road boundary detection: Curb edges and barriers detected from LiDAR height discontinuities define road extent
- Vehicle trajectory following: Other vehicles' trajectories (from tracking) imply lane structure in unmarked areas
- Width estimation from road boundaries: When boundaries are detected but markings are not, lane positions are inferred from road width and standard lane width assumptions
25.2 Construction Zone Classical Handling
Pony.ai has published a technical blog post specifically about construction zone handling, revealing classical components:
Live Semantic Map:
- The system constructs a real-time live semantic map to detect non-movable obstacles (cones, barriers, construction equipment)
- This map persists obstacle positions even when they are temporarily occluded
- When construction cones are occluded by passing trucks, the live semantic map remembers their positions
Cone Boundary Formation:
- Detected construction cones are connected to form boundaries in the perception system
- This creates a virtual "wall" preventing the vehicle from entering the construction area
- The boundary formation is a classical geometric operation: connecting detected cone positions in sequence using nearest-neighbor ordering
- The vehicle then navigates within the corridor formed by cones on both sides, maintaining safe distance from the boundaries
Map Discrepancy Detection:
- When perceived road geometry (from online perception) differs significantly from the HD map, the system infers a construction zone or road modification
- This triggers a switch to a more conservative driving mode relying on real-time perception rather than map-based priors
25.3 Non-Standard Road Geometry
- Unmarked intersections: Dead-reckoning through the intersection using the HD map's intersection topology
- Temporary lane shifts: Detected via cone/barrier positions forming new lane boundaries different from the HD map
- Narrow hutong/alley navigation: Near-range LiDAR (5 body-mounted units) provides 10 cm-level clearance detection; occupancy grid-based path planning ensures collision avoidance
26. ML Detections to Classical Tracking
26.1 The Handoff Interface
The deep learning perception path produces per-frame detections:
- 3D bounding boxes with class labels, confidence scores, heading angles
- These are "instantaneous snapshots" -- no temporal continuity
These detections are fed into the classical tracking pipeline (Sections 17-19):
Per-frame DL detections (3D bbox, class, confidence)
|
v
Data Association (Hungarian algorithm / JPDA)
|-- Match to existing tracks (Mahalanobis distance gating)
|-- Create new tracks for unmatched detections
|-- Age tracks without matches
|
v
Kalman Filter State Update
|-- Incorporate DL detection as measurement
|-- Measurement noise set by DL confidence
|
v
Track Management (birth / coast / death)
|
v
Output: Smoothed tracks with filtered position, velocity,
classification history, predicted trajectories26.2 Confidence-Weighted Updates
The DL detection confidence score modulates the Kalman filter measurement noise:
- High confidence DL detection: Low measurement noise covariance -> filter trusts the detection strongly
- Low confidence DL detection: High measurement noise covariance -> filter relies more on its prediction
- This allows the classical filter to gracefully handle uncertain DL outputs
26.3 Classification Refinement
DL classification (vehicle, pedestrian, cyclist, etc.) feeds into the track's classification history:
- A Bayesian classification accumulator counts votes from each frame's DL classification
- Over multiple frames, the most likely class emerges with high confidence
- This filters out single-frame misclassifications that are common in DL perception
27. Classical Preprocessing to ML
27.1 LiDAR Preprocessing Before Neural Networks
Before point clouds reach the DL detection networks, classical preprocessing has already been applied:
- Time synchronization and motion compensation (Section 9): Nanosecond-level sync and IMU-based motion correction
- Multi-LiDAR point cloud merging: Extrinsic calibration transforms all points to vehicle frame
- Ground plane estimation (RANSAC): Separating ground from obstacles improves 3D detection by reducing background clutter
- Noise filtering: Statistical outlier removal, IPE-flagged point removal
- Voxelization or pillar encoding: Discretizing the continuous point cloud into a structured grid for neural network consumption (VoxelNet-style or PointPillars-style encoding)
These classical steps ensure the neural network receives clean, organized, time-aligned data rather than raw, noisy, desynchronized sensor packets.
27.2 Camera Preprocessing Before Neural Networks
- ISP pipeline (Section 8): Debayering, exposure correction, white balance, noise reduction, tone mapping
- YUV420 format: Native ISP output, no conversion overhead
- GPU-resident data: Camera frames arrive in GPU memory via NvStreams/GPU Direct RDMA, ready for neural network inference
- HEVC encoding: For recording/playback; neural networks consume the raw YUV frames directly
- Region of Interest extraction: For specific tasks (traffic light detection), HD map-guided ROI extraction crops the input before feeding the network
27.3 Radar Preprocessing Before Any ML
All radar signal processing (Section 7) -- range-Doppler FFT, CFAR detection, beamforming, DOA estimation -- is classical and occurs before any potential neural network processing of radar data. The neural network (if used for radar object classification) receives already-detected target lists with range, velocity, and angle, not raw ADC samples.
28. Kinematic Feasibility Checking
28.1 Classical Motion Model Constraints on ML Predictions
The prediction module produces future trajectory predictions for other road agents. Classical kinematic feasibility checking validates these predictions:
Vehicle kinematic constraints:
- Maximum steering angle: Vehicles cannot turn tighter than their minimum turning radius (typically 5-6 m for passenger cars)
- Maximum acceleration/deceleration: Physical limits on longitudinal acceleration (typically 0.3-0.8g for normal driving, up to ~1g for emergency braking)
- Maximum lateral acceleration: Tire friction limits (typically 0.3-0.5g for comfortable driving)
- Maximum yaw rate: Coupled to velocity and steering angle via bicycle model kinematics
Feasibility check:
- Each predicted trajectory waypoint is checked against kinematic constraints:
curvature(t) <= 1 / min_turning_radius longitudinal_accel(t) <= max_accel lateral_accel(t) = v(t)^2 * curvature(t) <= max_lateral_accel - Trajectories violating these constraints are either clipped to the constraint boundary or discarded
- This prevents the ML prediction module from producing physically impossible trajectories (e.g., a truck turning on a dime)
Class-specific models:
- Vehicle: Bicycle kinematic model with steering constraints
- Pedestrian: Point-mass model with maximum acceleration limits (~2 m/s^2)
- E-bike: Bicycle model with higher acceleration and tighter turning radius than cars
- Truck: Extended bicycle model with longer wheelbase and wider minimum turning radius
28.2 Ego-Vehicle Trajectory Validation
Classical kinematic checking also validates the planning module's proposed ego trajectories:
- Every planned trajectory is checked against the vehicle's physical capabilities
- This is a classical safety check that runs independently of any ML components in the planning pipeline
29. HD Map Integration
29.1 Classical Map Features Informing Perception
The HD map provides strong classical priors that constrain and enhance perception:
| Map Feature | Perception Use |
|---|---|
| Lane geometry (center lines, boundaries) | Constrains lane detection; provides geometry when markings are faded |
| Traffic light 3D positions | Generates camera ROIs for traffic light detection; associates detections with correct signal groups |
| Traffic sign positions | Guides sign detection search areas; validates sign classification |
| Speed limits per road segment | Informs kinematic feasibility bounds for prediction |
| Intersection topology | Defines valid turning paths; constrains prediction trajectories to map-legal maneuvers |
| Road boundary geometry | Constrains drivable area estimation; provides curb positions even when LiDAR detection is unreliable |
| Crosswalk positions | Triggers enhanced pedestrian monitoring in classical pipeline |
| Stop line positions | Provides precise stopping points for traffic signal compliance |
29.2 Dynamic Map Updating (US Patent 11,885,624)
Pony.ai's patent for a "Dynamic Map Updating System" describes a classical method for maintaining map accuracy:
- Entity change identification: The system identifies map entities that change over time (construction zones, road modifications, new buildings)
- Change prediction: Predicts the amount of change over time
- Map update: Updates the map based on predicted changes
- Vehicle navigation: Adjusts navigation based on updated map
This enables the perception system to anticipate map changes rather than only react to discrepancies between perception and outdated map data.
29.3 Map-Perception Discrepancy Detection
When online perception disagrees with the HD map, classical rules determine the response:
- Minor discrepancy (< threshold): Likely noise or calibration error; trust the map
- Persistent discrepancy (sustained over distance/time): Likely real-world change; flag for map update; trust perception
- Major discrepancy (road blocked, lane closed): Trigger construction zone mode; rely entirely on real-time perception; apply conservative driving behavior
30. 1,000+ Monitoring Mechanisms
30.1 Architecture
Pony.ai states that "based on ISO 26262 functional safety methodology, more than a thousand monitoring mechanisms run in parallel with normal functions, with failure mode and safety state fully taken into consideration."
These monitors are primarily rule-based/classical -- they check deterministic conditions and thresholds rather than using learned models.
30.2 Categories of Monitors
Sensor Health Monitors:
- LiDAR point cloud rate (each of 9 LiDARs must produce points above minimum threshold)
- LiDAR point cloud statistics (range histogram, intensity distribution -- sudden changes indicate sensor degradation)
- Camera frame rate and exposure quality (overexposed, underexposed, frozen frame, black frame)
- Radar detection rate and statistics
- GNSS signal quality (number of satellites, HDOP, RTK fix status)
- IMU measurement validity (accelerometer/gyroscope range, bias drift rate)
- Microphone audio level (silence detection for acoustic sensor health)
- Water sensor state (precipitation detection)
Perception Health Monitors:
- Detection count per frame (sudden drop indicates perception failure)
- Detection latency (inference time exceeds deadline)
- DL model output validity (NaN check, bounding box dimension sanity, class confidence distribution)
- Heuristic path output validity (cluster count, ground plane fit quality)
- Dual-path consistency (sustained disagreement triggers alarm)
- Track continuity (tracks disappearing/appearing anomalously)
Calibration Monitors:
- LiDAR-camera reprojection error statistics
- Radar Doppler residuals (deviation from expected stationary object velocities)
- Sudden calibration jumps (impact detection -> recalibration trigger)
- Inter-LiDAR consistency (overlapping FOV regions should produce consistent point clouds)
Localization Monitors:
- GNSS-IMU filter innovation sequence (large innovations indicate inconsistency)
- LiDAR scan matching score (low matching quality indicates localization uncertainty)
- Localization covariance bounds (position uncertainty exceeds safe threshold)
- Map-relative position validity (vehicle position within mapped road boundaries)
Compute Health Monitors:
- GPU temperature, utilization, memory usage
- CPU load and scheduling latency
- PCIe bus error rates
- Memory ECC error counts
- Inference engine status (TensorRT session health)
- Power supply monitoring
Communication Monitors:
- Sensor-to-compute data bus integrity
- Inter-chip communication (main Orin <-> redundant Orin)
- Vehicle CAN bus communication
- Cellular network connectivity (for remote assistance)
Vehicle State Monitors:
- Wheel speed sensor consistency
- Steering angle sensor health
- Brake system pressure monitoring
- Drive-by-wire (DBW) system status
30.3 Monitor Architecture
Each monitor is a classical rule-based checker:
IF condition_violated(threshold, duration):
report_fault(severity, subsystem)
IF severity >= CRITICAL:
trigger_degradation(appropriate_level)Monitors run at the frequency of their monitored subsystem (e.g., perception monitors at 10 Hz, IMU monitors at 100+ Hz) and are independent of the main perception/planning pipeline -- they cannot be bypassed by a perception failure.
31. 20+ Safety Redundancies
31.1 Complete Redundancy Taxonomy
Pony.ai implements 20+ safety redundancies across 4 categories:
Category 1: Software System Redundancy (7 Types)
| # | Redundancy | Classical/Rule-Based Component |
|---|---|---|
| 1 | Multi-Layer Degradation System | Rule-based state machine governing transitions between Normal -> Degraded -> Minimal Risk Condition based on fault severity |
| 2 | Fault Detection and System Arbitration Module | Rule-based monitors that detect faults and arbitrate between main and fallback systems; threshold-based checks |
| 3 | Heterogeneous Algorithm on Main and Fallback System | The dual perception architecture itself -- heuristic (classical) vs. DL (learned); different algorithmic paradigms provide diversity |
| 4 | Communication Redundancy on Main and Fallback System | Rule-based monitoring of communication buses; automatic switchover protocols |
| 5 | Trajectory Cross-Validation Redundancy | Classical comparison of trajectories generated by main and fallback planning; geometric consistency checks |
| 6 | Multi-Sensor Fusion Perception & Localization Redundancy | Classical sensor fusion (Kalman filter based) combining multiple sensor modalities; continues operating when individual sensors fail |
| 7 | Multi-Algorithm Fusion Redundancy of Key ADS Modules | Running multiple algorithmic implementations (classical + ML) of critical functions and cross-validating outputs |
Category 2: Hardware Component Redundancy (7 Types)
| # | Redundancy | Description |
|---|---|---|
| 1 | N x 360-degree FOV coverage | Overlapping sensor fields of view; any single sensor failure still leaves coverage from adjacent sensors |
| 2 | Redundant computing units | 3 main OrinX chips + 1 dedicated redundant OrinX chip; MRCC provides tertiary compute |
| 3 | Redundant localization sensors | Multiple GNSS receivers and IMU units; continues localizing with partial failures |
| 4 | Redundant cellular communications | Multiple cellular modems for remote assistance and OTA; continues communicating with single modem failure |
| 5 | Redundant accident detection | Collision sensor + multiple sensor modalities that can independently detect impacts |
| 6 | Redundant data storage | Multiple storage units for logging; ensures data preservation for incident reconstruction |
| 7 | Redundant sensor cleaning | Self-developed cleaning system with redundant mechanisms for maintaining sensor clarity |
Category 3: Vehicle Platform Redundancy (5 Types)
| # | Redundancy | Description |
|---|---|---|
| 1 | Parking brake system | Independent parking brake for vehicle securment if primary braking fails |
| 2 | Steering system | Redundant steering actuators/controllers; vehicle remains steerable with single actuator failure |
| 3 | Braking system | Redundant brake circuits; vehicle can stop safely with partial brake failure |
| 4 | Power supply | Redundant power distribution; the ADC supports both liquid and passive cooling, with passive cooling enabling safe vehicle control if liquid cooling fails |
| 5 | Drive-by-wire (DBW) system | Redundant DBW controllers ensuring electronic throttle/brake/steering commands are reliable |
Category 4: Service Redundancy (3 Types)
| # | Redundancy | Description |
|---|---|---|
| 1 | External safety warnings | Multiple alert modalities (visual, acoustic) to warn other road users; includes patented directed acoustic alert (US10,647,250) and directed visual alert (US10,726,687) |
| 2 | Cellphone NFC unlock | Backup vehicle access method if primary digital unlock fails |
| 3 | Emergency call system | Independent emergency communication channel; passengers can reach human operators even if main compute fails |
31.2 MRCC (Minimum Risk Condition Controller)
The MRCC is the ultimate safety fallback -- a separate compute system that can maintain vehicle control even when the primary autonomous driving system has completely failed:
- Hardware: Runs on the dedicated 4th OrinX chip (separate from the 3 main chips)
- Perception capability: Maintains critical perception including blind-spot coverage using a subset of sensors connected to the redundant system
- Driving capability: Can "navigate intersections or ramps and safely pull over, minimizing the risk of traffic disruption or collisions"
- Independence: Operates even when "main system's power or chassis communication fails"
- Cooling independence: Passive cooling backup ensures the MRCC remains operational if liquid cooling fails
The MRCC's perception is necessarily simpler than the main system (running on a single OrinX chip vs. three), likely relying heavily on classical/heuristic algorithms that require less compute than full DL inference.
Sources
Primary Sources (Pony.ai)
- Pony.ai Technology Page
- Pony.ai Safety Report (March 2022)
- Pony.ai SEC Form 20-F (April 2025)
- Pony.ai SEC Form F-1 (October 2024)
- Pony.ai Construction Zone Blog Post
- Pony.ai Traffic Light Camera Blog Post
- Pony.ai L4 Domain Controller Milestone (July 2025)
- Pony.ai Gen-7 Mass Production (July 2025)
- Pony.ai ISP Data Augmentation Presentation (2020)
NVIDIA Technical Documentation
- Accelerating the Pony AV Sensor Data Processing Pipeline (NVIDIA Developer Blog)
- Van, Go: Pony.ai Robotaxi Fleet on DRIVE Orin (NVIDIA Blog)
- Pony.ai DRIVE Orin ADC Mass Production (BusinessWire)
Sensor Technology
- Hesai AT128 Product Page
- Hesai OT128 Product Page (IPE details)
- Hesai AT128 Selection for Gen-7 Robotaxis (Hesai)
- Hesai Fourth Generation Chip Architecture Analysis
- ON Semiconductor / Pony.ai ISP Collaboration (GlobeNewsWire)
Patent Sources
- US11,250,240B1 -- Instance Segmentation (Cross-Modal)
- US11,454,701 -- Real-Time Doppler Calibration
- US12,032,102 -- Static Object Calibration
- US11,908,198 -- SLAM Point Cloud Visualization
- US11,885,624 -- Dynamic Map Updating System
- US11,774,978 -- GAN-Based Scenario Generation
- US20230384451A1 -- FMCW LiDAR Sensor
- Pony.ai Patents Overview (GreyB)
- Pony.ai Patent Filings (USPTO Report)
Industry Analysis
- 4D Millimeter-Wave Radar in Autonomous Driving: A Survey
- Towards BEV+Transformer Autonomous Driving: Chinese Robotaxi Case Study (maadaa.ai)
- Inside Pony.ai's Staying Power (KrASIA)
- Automatic Targetless LiDAR-Camera Calibration: A Survey
- Pony AI Update: Robotaxis and Robotruck Services (EETimes)
- Grizzly Research Pony.ai Report