Skip to content

Motional Perception Stack: Exhaustive Deep Dive

Last updated: March 2026


Table of Contents

  1. nuTonomy Heritage
  2. Sensor Fusion Architecture
  3. LiDAR Perception
  4. Camera Perception
  5. Radar Perception
  6. 3D Object Detection
  7. Object Tracking
  8. Prediction
  9. Large Driving Models (LDM) Pivot
  10. nuScenes Dataset Architecture
  11. nuPlan Planning Dataset
  12. Semantic Segmentation
  13. Occupancy Prediction
  14. BEV Representation
  15. Temporal Fusion
  16. Traffic Infrastructure
  17. Pedestrian Detection
  18. Non-ML Components
  19. Auto-Labeling
  20. Key Papers
  21. Key Patents
  22. Perception Metrics

1. nuTonomy Heritage

Founding and Formal Methods Roots

Motional's perception stack traces its intellectual lineage to nuTonomy, co-founded in 2013 by Dr. Karl Iagnemma (director of MIT's Robotic Mobility Group) and Dr. Emilio Frazzoli (MIT professor of Aeronautics and Astronautics, now at ETH Zurich). nuTonomy was a spinout from MIT research with a distinguishing technical philosophy: the use of formal methods -- formal logic, sampling-based motion planning, and model checking -- to produce provably correct autonomous driving behaviors.

AttributeDetail
Founded2013, as MIT spinout
Technical DistinctionFormal logic-based decision making; provable safety guarantees
Key IPSampling-based motion planning with completeness, correctness, and optimality guarantees
Acquired ByDelphi Automotive (now Aptiv) for ~$450M in October 2017

Singapore Deployment: The World's First Robotaxi

In August 2016, nuTonomy launched the world's first public robotaxi pilot in Singapore's one-north business district, using modified Renault Zoes and Mitsubishi i-MiEVs. The perception system for this deployment was comparatively early-generation but established several foundational principles that persist in Motional's architecture today:

  • Multi-modal sensing: The Singapore vehicles carried LiDARs on the roof and around the front bumper, radar, and cameras providing near-complete surround coverage -- establishing the pattern of overlapping multi-sensor fields of view that Motional's IONIQ 5 would later scale to 30+ sensors.
  • LiDAR-centric localization: nuTonomy used LiDAR data as the primary source for localization, providing more accurate determination of the vehicle's position within its environment than GPS alone.
  • Formal logic for behavior: Rather than purely reactive control, nuTonomy's cars used decision-making software based on formal logic. As Karl Iagnemma described it: "a rigorous algorithmic process that's translating specifications on how the car should behave into verifiable software." The formal logic told the taxis when low-priority "rules of the road" could be broken safely (e.g., crossing a center line to navigate around a double-parked vehicle) while maintaining safety invariants.

Perception Architecture Evolution: nuTonomy to Motional

The progression from nuTonomy to Motional represents three distinct architectural eras:

Era 1: nuTonomy (2013-2017) -- Formal Methods + Classical Perception

  • Classical computer vision and early deep learning for object detection
  • LiDAR-centric perception with camera augmentation
  • Formal methods for decision making with provable guarantees
  • Rule-based behavior specifications using formal languages and model checking
  • Compositional data-driven approaches for formal collision risk estimation

Era 2: Aptiv/Early Motional (2017-2024) -- Modular ML Stack

  • Progressive replacement of classical perception with deep neural networks
  • Introduction of PointPillars, PointPainting, and other ML-based detectors
  • Modular pipeline: Perception --> Prediction --> Planning --> Control
  • Each module independently developed and optimized
  • Transformer neural networks (TNNs) adopted for perception
  • BEV representations introduced for multi-camera fusion

Era 3: AI-First / LDM (2024-present) -- End-to-End Foundation Models

  • Perception integrated into Large Driving Models (LDMs) as part of unified perception-prediction-planning
  • Shared transformer backbone across all modules
  • End-to-end training replacing stitched module outputs
  • Safety guardrail system running in parallel for edge cases
  • Embodied foundation models trained on diverse multi-city datasets

Key nuTonomy/Motional Perception Researchers

NameRoleKey Contributions
Oscar BeijbomSr. Director of ML at Motional (now at Zoox/Nyckel)nuScenes, PointPillars, PointPainting; PhD in CV/ML from UCSD
Holger CaesarResearch Scientist (now Asst. Prof. at TU Delft)nuScenes, nuPlan, Panoptic nuScenes
Alex H. LangResearch Scientist (later at Waymo)PointPillars, PointPainting
Sourabh VoraML Research EngineerPointPillars, PointPainting
Whye Kit FongResearch ScientistPanoptic nuScenes, nuScenes expansions
Lubing ZhouResearch ScientistPointPillars, Panoptic nuScenes

2. Sensor Fusion Architecture

IONIQ 5 Robotaxi Sensor Suite

The Hyundai IONIQ 5 robotaxi carries an industry-leading 30+ sensor configuration providing 360-degree perception with redundant overlapping coverage:

Sensor TypeCountSpecificationsRole
Cameras13Multiple focal lengths and FOVs; varying lenses for near-field and far-fieldHigh-resolution imaging, object classification, lane/sign recognition
Radar11Aptiv FLR4+ long-range radars; 360-degree coverage; 200m+ range; 77GHz mmWaveDoppler velocity measurement, all-weather detection, AEB
LiDAR (Long-Range)1+Ouster Alpha Prime VLS-128; 300m range; 0.1-degree resolution; 360-degreePrimary 3D depth sensing, surround point cloud
LiDAR (Short-Range)4Hesai Technology unitsClose-range around-vehicle coverage
GPSYesPosition referenceCoarse localization
IMUYesInertial measurementMotion tracking, ego-motion estimation

Key design principle: Hyundai and Motional teams spent months co-designing sensor placement locations for every sensor. Unlike earlier AV platforms where sensors were bolted on as aftermarket additions, the IONIQ 5's sensors are aesthetically integrated into the body design while achieving 360-degree perception with no blind spots.

Fusion Architecture: Multi-Level, Multi-Modal

Motional's sensor fusion operates at multiple levels, combining data from all three primary modalities (camera, LiDAR, radar) using neural network-based fusion:

                        ┌─────────────────────────────────────┐
                        │         Unified BEV Feature Space    │
                        │   (Bird's-Eye View Representation)   │
                        └──────────────┬──────────────────────┘

                    ┌──────────────────┼──────────────────┐
                    │                  │                  │
            ┌───────┴───────┐  ┌──────┴──────┐  ┌───────┴───────┐
            │ Camera Branch │  │ LiDAR Branch│  │ Radar Branch  │
            │               │  │             │  │               │
            │ 13 cameras    │  │ 5+ LiDARs   │  │ 11 radars     │
            │ Image backbone│  │ Voxelization│  │ Point cloud   │
            │ View transform│  │ 3D backbone │  │ ML processing │
            │ --> BEV feats │  │ --> BEV feats│  │ --> BEV feats │
            └───────────────┘  └─────────────┘  └───────────────┘

Level 1: Point-Level Fusion (PointPainting)

Motional's first major fusion approach, PointPainting (CVPR 2020), established a sequential fusion paradigm:

  1. Run a 2D image semantic segmentation network on camera images
  2. Project LiDAR points into the segmented image space
  3. Append per-class segmentation scores to each LiDAR point
  4. Feed the "painted" point cloud to any LiDAR-only 3D detector

This approach allowed LiDAR detectors (PointPillars, VoxelNet, PointRCNN) to benefit from camera semantic information without requiring architectural modifications to the detector itself. On the nuScenes benchmark, PointPainting improved detection performance across all tested backbones.

Level 2: Neural Network Point Cloud Fusion (LiDAR + Radar)

Motional uses a neural network to fuse LiDAR and radar point clouds, preserving more information than processing each point cloud separately. The fused LiDAR-radar point cloud is then combined with camera data via PointPainting, creating a three-modality fusion:

Radar Point Cloud ──┐
                    ├── Neural Net Fusion ──> Fused Point Cloud ──> PointPainting with Camera ──> Detection
LiDAR Point Cloud ──┘

This approach is significant because radar provides instantaneous Doppler velocity measurements that neither LiDAR nor cameras can directly measure, while LiDAR provides precise 3D geometry that radar lacks.

Level 3: BEV-Space Fusion (TransFusion / BEVFusion)

More recent fusion approaches operate in the BEV feature space, where features from all modalities are projected into a common bird's-eye-view representation:

  • TransFusion (CVPR 2022) uses a soft-association mechanism via transformer attention to fuse LiDAR and camera features, avoiding the brittleness of hard geometric projection. A two-layer transformer decoder first generates initial bounding boxes from LiDAR, then adaptively fuses with camera features using attention-based spatial and contextual relationships.

  • BEVFusion (ICRA 2023, MIT Han Lab) unifies multi-modal features in a shared BEV space, applying modality-specific encoders before transforming all features into BEV. It achieves 1.3% higher mAP/NDS on nuScenes detection and 13.6% higher mIoU on BEV map segmentation with 1.9x lower computation cost.

Level 4: End-to-End LDM Fusion (Current Architecture)

In Motional's current LDM architecture, sensor fusion is subsumed into the end-to-end model. Raw sensor inputs from all modalities feed into the Large Driving Model, which jointly learns perception, prediction, and planning representations. This eliminates the information loss that occurred at module boundaries in the modular stack.

Redundancy and Graceful Degradation

The 30+ sensor configuration provides multiple layers of redundancy:

  • Cross-modal redundancy: Any single sensor failure (e.g., a camera failure) is compensated by other modalities (LiDAR, radar) covering the same region
  • Within-modality redundancy: 13 cameras provide overlapping fields of view; 11 radars provide 360-degree coverage with overlap
  • Weather resilience: Radar maintains functionality in rain, snow, fog, and darkness where cameras and LiDAR may degrade
  • Range coverage: Long-range LiDAR (300m) and radar (200m+) cover far-field; short-range LiDAR and cameras cover near-field

3. LiDAR Perception

Hardware: Ouster Alpha Prime VLS-128 (Long-Range)

SpecificationValue
SupplierOuster (exclusive long-range LiDAR supplier through 2026)
ModelAlpha Prime VLS-128
Beams128 channels
RangeUp to 300 meters
ResolutionUp to 0.1-degree vertical and horizontal
Coverage360-degree surround view
Data OutputReal-time 3D point cloud
Points/SecondUp to ~2.6 million (128 channels x 20Hz rotation)

The Alpha Prime VLS-128 represents a significant upgrade from the Velodyne HDL-32E (32-beam, 70m range, ~1.39M pts/sec) used in the nuScenes data collection era. The 4x increase in beam count and 4x increase in range provide substantially denser point clouds at greater distances, improving detection of small objects (pedestrians, cyclists) at long range.

Hardware: Hesai Technology (Short-Range)

SpecificationValue
SupplierHesai Technology
Count4 units per vehicle
CoverageClose-range, around-the-car perimeter
PurposeNear-field blind spot coverage, low-speed maneuvering, parking

The Hesai units fill the near-field coverage gaps that the roof-mounted Ouster cannot see (e.g., small objects directly adjacent to the vehicle, curbs, low obstacles).

Historical Context: Velodyne HDL-32E (nuScenes Era)

The nuScenes dataset was collected using a Velodyne HDL-32E, which constrains the characteristics of all models trained and evaluated on nuScenes:

SpecificationValue
Beams32
Capture Frequency20 Hz
Points per Ring~1,080 (+/- 10)
Usable RangeUp to 70 meters
Accuracy+/- 2 cm
Points/SecondUp to ~1.39 million

LiDAR Point Cloud Processing Pipeline

Motional's LiDAR processing follows a well-established pipeline that has evolved through their published research:

Raw Point Cloud --> Preprocessing --> Feature Extraction --> Detection Head --> 3D Bounding Boxes
                        │                    │                    │
                   - Range filter       - Voxelization       - Heatmap head
                   - Ground removal       (VoxelNet)           (center detection)
                   - Ego-motion comp.   - Pillarization      - Regression heads
                   - Multi-sweep         (PointPillars)        (size, orientation,
                     aggregation        - Sparse 3D conv        velocity)
                                        - BEV flattening

Voxelization (VoxelNet-Based)

The 3D space around the vehicle is divided into a regular grid of voxels (3D pixels). Points within each voxel are encoded using a small PointNet-like network that captures local geometry. Sparse 3D convolutions process the voxel features, and the resulting 3D feature volume is flattened along the height dimension to produce a 2D BEV feature map.

Pillarization (PointPillars-Based)

PointPillars (CVPR 2019), developed by nuTonomy researchers Alex Lang, Sourabh Vora, Holger Caesar, and Oscar Beijbom, introduced a faster alternative to voxelization:

  • The point cloud is organized into vertical columns (pillars) instead of 3D voxels
  • A simplified PointNet encodes the points within each pillar
  • The resulting representation is a 2D pseudo-image (BEV) where each pixel corresponds to a pillar
  • All subsequent operations are standard 2D convolutions, enabling GPU-efficient processing

Performance characteristics:

  • 62 Hz detection rate (vs. ~2-10 Hz for earlier methods)
  • A faster variant matched state-of-the-art at 105 Hz
  • Despite using LiDAR only, outperformed fusion methods on KITTI bird's-eye view detection
  • Became the backbone for auto-labeling in the nuPlan dataset

Multi-Sweep Aggregation

For temporal context, multiple LiDAR sweeps (typically 10 sweeps = 0.5 seconds at 20 Hz) are aggregated after ego-motion compensation. Each point is tagged with its relative timestamp, providing the network with motion cues (moving objects produce "trails" in the aggregated cloud). This temporal aggregation was used extensively in CenterPoint and subsequent detectors.


4. Camera Perception

13-Camera Pipeline

The IONIQ 5 robotaxi uses 13 cameras with varying focal lengths and fields of view to achieve 360-degree visual coverage. This represents a significant expansion from the 6-camera setup used in nuScenes data collection.

Camera ConfigurationnuScenes (Historical)IONIQ 5 (Current)
Count613
ModelBasler acA1600-60gcUndisclosed (production-grade)
Resolution1600x900 ROIHigher resolution (undisclosed)
Capture Rate12 HzHigher (undisclosed)
Coverage360-degree with 1 rear camera360-degree with multiple focal lengths

Surround-View Image Networks (Transformer-Based)

Motional uses Surround-View Image Networks built on Transformer Neural Networks (TNNs) to convert camera inputs into BEV representations. The key technical aspects:

Why Transformers over CNNs: Motional adopted transformers because they "capture global dependencies and long-range interactions within the data." Traditional CNNs process local patches with limited receptive fields, while transformers can attend to any part of the image, enabling better understanding of scene context. TNNs excel at "blocking background noise" and focusing on critical objects through their "long-distance attention module."

Camera-to-BEV View Transformation: The fundamental challenge in camera perception for autonomous driving is converting 2D perspective images into 3D world-frame representations. As Motional describes it: "we must convert that two-dimensional, street-level image into a 3D object viewable from overhead." This view transformation is performed by the Surround-View Image Network, which:

  1. Encodes each camera image independently using a vision transformer backbone
  2. Lifts 2D features into 3D using depth estimation and camera intrinsic/extrinsic parameters
  3. Projects the 3D features into a unified BEV grid
  4. Applies BEV-space feature processing for downstream tasks

Inference Optimization: In their CVPR 2023 Workshop paper "Training Strategies for Vision Transformers for Object Detection," Motional evaluated strategies to optimize inference time of vision transformer-based detection. They achieved a 63% improvement in inference time at the cost of only 3% performance drop through:

  • Reduced input resolution
  • Image pre-cropping
  • Query embedding adjustments
  • On-vehicle network pruning and quantization

Camera-Only Detection Capabilities

Cameras provide capabilities that LiDAR and radar cannot:

  • Color and texture recognition: Distinguishing between a pedestrian and a traffic cone based on visual appearance
  • Traffic light state detection: Red/green/yellow recognition requires color discrimination
  • Sign reading: Speed limit, stop sign, construction zone identification
  • Lane marking detection: Dashed vs. solid lines, lane colors
  • Fine-grained classification: Vehicle make/model, pedestrian attributes (adult vs. child, carrying objects)

5. Radar Perception

Hardware Configuration

Motional uses 11 radar units (more than double the typical L2/L3 vehicle configuration) providing 360-degree coverage:

SpecificationValue
Count11 units
Primary ModelAptiv FLR4+ long-range radar
Frequency77 GHz millimeter-wave (mmWave)
Detection RangeBeyond 200 meters
CoverageFull 360-degree (front, sides, rear)
Key CapabilityDirect Doppler velocity measurement
Weather ResilienceFunctional in rain, snow, fog, dust, darkness

Unlike Level 2/3 vehicles that typically mount a single forward-facing radar for adaptive cruise control, Motional deploys radars in a surround configuration including rear-facing units -- critical for detecting vehicles approaching from behind during lane changes.

Low-Level Radar Data: The Paradigm Shift

Motional has made a deliberate strategic decision to move beyond conventional radar processing. Traditional automotive radar systems use on-chip Digital Signal Processors (DSPs) that process raw data locally and output only a sparse set of detections (a few hundred per frame). This pre-processing discards significant semantic information from the low-level radar signal.

Motional's approach replaces this with a centralized low-level radar architecture:

Traditional Approach:
  Radar Frontend --> On-Chip DSP --> Sparse Detections (~100-300 points/frame)

Motional's Approach:
  Radar Frontend --> Raw ADC Data --> Central Computer --> ML Pipeline --> Dense Radar Imagery
  (multi-Gbps)       (preserved)     (GPU processing)    (end-to-end)   (20M+ pts/sec equiv.)

Key innovations in Motional's imaging radar architecture:

  1. Raw ADC Processing: The end-to-end perception model is trained directly from the radar's raw Analog-to-Digital Converter (ADC) output, bypassing traditional signal processing entirely.

  2. Multi-Channel Multi-Scan (MCMS) Aggregation: An ML module aggregates low-level radar data across multiple channels and multiple scans, producing high-fidelity radar images.

  3. Radar Point Cloud Density: The system generates the equivalent of over 20 million points per second -- comparable to "a LiDAR system generating 2 million points per second" -- a dramatic improvement from conventional systems producing "merely a few hundred detections per frame."

  4. Update Rate: High-fidelity, low-latency radar images are produced at 20 Hz, matching LiDAR frame rates.

  5. VRU Detection: The ML-trained radar perception achieves 3x Average Precision (AP) improvement in Vulnerable Road User (VRU) detection compared to conventional radar processing.

Radar Dataset

Motional has curated a petabyte-scale multi-modality dataset that integrates low-level radar output with synchronized camera and LiDAR data. This dataset is enriched through both automated labeling and manual annotation methods, enabling iterative improvement of the radar perception pipeline.

Radar as Primary Sensor: The Strategic Vision

Motional has publicly articulated a vision for radar to become "the central sensing modality" for future AV platforms, potentially reducing dependence on expensive LiDAR:

FactorRadar Advantage
CostAutomotive-grade radars cost 5-10x less than LiDAR
Maturity70+ years of industrial and defense applications
DurabilitySolid-state electronics, no moving parts
WeatherRetains functionality in rain, snow, fog, dust, darkness
VelocityDirect Doppler measurement provides instantaneous velocity (critical for prediction)
Range FidelityScans maintain fidelity beyond 200 meters

Motional is "studying whether future iterations could utilize radars as more of a central sensing modality, without sacrificing performance" -- a strategy that could "enable a faster pathway to profit" by dramatically reducing per-vehicle sensor costs.

Radar-LiDAR-Camera Fusion

The current radar fusion pipeline:

  1. Radar point cloud generated from low-level data via ML
  2. Neural network fuses radar and LiDAR point clouds, preserving more data than separate processing
  3. PointPainting projects the fused point cloud onto camera imagery
  4. Detection head produces 3D bounding boxes with velocity estimates

Motional is collaborating with Aptiv to improve AI/ML radar classification capabilities for distinguishing vehicles, pedestrians, and cyclists using radar data alone.


6. 3D Object Detection

Detection Architectures Used and Developed by Motional/nuTonomy

PointPillars (CVPR 2019) -- Developed at nuTonomy

Authors: Alex H. Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, Oscar Beijbom

PointPillars is a LiDAR-only 3D detector that organizes point clouds into vertical pillars and processes them with 2D convolutions. It was the first real-time 3D detector that could run at automotive-grade speeds.

FeatureDetail
EncodingPoints organized into vertical columns (pillars); PointNet encodes each pillar
ProcessingEntirely 2D convolutional after pillar encoding
Speed62 Hz standard; 105 Hz fast variant
Output3D bounding boxes with class, position, size, orientation
SignificanceFirst real-time LiDAR detector; used as backbone in nuPlan auto-labeling

PointPainting (CVPR 2020) -- Developed at Motional

Authors: Sourabh Vora, Alex H. Lang, Bassam Helou, Oscar Beijbom

PointPainting is a sequential camera-LiDAR fusion method:

  1. Run image semantic segmentation on camera images
  2. Project LiDAR points into segmented images
  3. Append per-class scores to each point's feature vector
  4. Feed "painted" point cloud to any LiDAR detector

Results showed large improvements when applied to PointRCNN, VoxelNet, and PointPillars on both KITTI and nuScenes. The "painted" PointRCNN achieved state-of-the-art on KITTI bird's-eye view detection.

CenterPoint (CVPR 2021) -- Key Researchers Later at Motional

Authors: Tianwei Yin, Xingyi Zhou, Philipp Krahenbuhl (UT Austin)

CenterPoint became a dominant detection architecture on the nuScenes benchmark and is widely used in the AV industry. While developed at UT Austin rather than directly at Motional, it became foundational to the nuScenes ecosystem and was used in nuPlan auto-labeling alongside PointPillars.

FeatureDetail
ApproachCenter-based detection: detect object centers first, then regress attributes
BackboneVoxelNet or PointPillars for point cloud feature extraction
BEV Flattening3D features flattened to BEV; keypoint detector applied to BEV map
Detection HeadsCenter heatmap head (K-channel for K classes); regression heads for 3D size, orientation, velocity
Two-Stage RefinementSecond stage refines estimates using additional point features on detected objects
TrackingSimplifies 3D MOT to greedy closest-point matching on detected centers
Performance65.5 NDS, 63.8 AMOTA on nuScenes (single model); 1st place among LiDAR-only submissions on Waymo Open Dataset
Impact3 out of top 4 entries in NeurIPS 2020 nuScenes 3D Detection Challenge used CenterPoint

TransFusion (CVPR 2022)

Authors: Xuyang Bai, Zeyu Hu, Xinge Zhu, et al.

TransFusion addresses the brittleness of hard LiDAR-camera association (via calibration matrices) with a transformer-based soft-association mechanism:

FeatureDetail
ArchitectureTwo-layer transformer decoder
Layer 1Generates initial bounding boxes from LiDAR using sparse object queries
Layer 2Adaptively fuses object queries with image features via attention
Key InnovationAttention determines where and what information to extract from images
RobustnessResilient to degraded image quality and calibration errors
Query InitializationImage-guided strategy for objects difficult to detect in point clouds
Performance1st place on nuScenes tracking leaderboard

BEVFusion (ICRA 2023, MIT Han Lab)

Authors: Zhijian Liu, Haotian Tang, et al.

BEVFusion unifies multi-modal features in a shared BEV representation:

FeatureDetail
ApproachModality-specific encoders; unified BEV feature space
Key BottleneckIdentified and resolved camera-to-BEV transformation as latency bottleneck
OptimizationBEV pooling optimization reduces view transformation latency by 40x
Multi-TaskSupports 3D detection and BEV segmentation with same architecture
Improvement over CenterPoint+3.0-7.1% mAP with LiDAR-camera fusion
Improvement over PointPillars+18.4% mAP
BEV Segmentation+13.6% mIoU vs. prior methods
Computation1.9x lower cost than comparable methods

MVFuseNet (CVPR 2021 Workshop) -- Developed at Motional

Authors: Ankit Laddha, Shivam Gautam, et al. (Motional)

MVFuseNet is Motional's internally developed multi-view temporal fusion network:

FeatureDetail
InnovationFirst to use both Range View (RV) and BEV for LiDAR feature learning
Temporal FusionSequential aggregation of sweeps by projecting between consecutive sweeps
Multi-ScaleMulti-view features at multiple spatial scales in backbone
TasksJoint object detection and motion forecasting (end-to-end)
PerformanceState-of-the-art on large-scale self-driving datasets
EfficiencyScales to large operating ranges while maintaining real-time performance

Detection Classes

The 10 classes used in the nuScenes detection challenge (merged from the full 23):

ClassDescription
carPassenger vehicles
truckCargo vehicles
busPublic transit buses (bendy + rigid merged)
trailerTowed cargo units
construction_vehicleBulldozers, excavators, etc.
pedestrianAdults, children, construction workers, police (merged)
motorcycleTwo-wheeled motorized vehicles
bicycleHuman-powered two-wheeled vehicles
barrierRoad barriers, Jersey barriers
traffic_coneOrange/safety cones

The full 23 nuScenes annotation classes provide finer granularity:

CategorySubclasses
vehiclecar, truck, bus.bendy, bus.rigid, construction, trailer, motorcycle, bicycle, emergency.ambulance, emergency.police
human.pedestrianadult, child, construction_worker, police_officer, personal_mobility, stroller, wheelchair
movable_objectbarrier, debris, pushable_pullable, trafficcone
static_objectbicycle_rack
animal(single class)

7. Object Tracking

Unified End-to-End Tracking Model

Motional has developed a unified end-to-end tracking model that consolidates traditionally separate tracking components into a single inference pass. The tracking module operates between detection and prediction in the pipeline and handles three fundamental tasks:

  1. Data Association: Linking detections across individual time frames to form coherent object trajectories
  2. Motion Estimation: Providing position, velocity, acceleration, and other kinematic estimates for each tracked object
  3. Information Fusion: Combining detection data, segmentation masks, and fine-grained object attributes from upstream perception

Architecture

Rather than using multiple individual models for each tracking subtask, Motional designed a single unified model that performs all tracking components in one inference pass:

Detections (per frame) ──> Unified Tracking Model ──> Tracked Objects (with trajectories)

                            ┌───────┼───────┐
                            │       │       │
                      Data Assoc.  Motion  Info
                                   Est.    Fusion
                            │       │       │
                            └───────┼───────┘

                            Feature Sharing
                            (common features)

Key advantages of the unified approach:

  • Feature sharing: Context features learned for data association can be reused for motion estimation, making the model easier to train and more parameter-efficient
  • Single inference: Only one forward pass needed at runtime, reducing latency
  • Polygon support: Through the data-driven approach, the model processes irregular polygon-shaped objects (not just bounding boxes), better representing objects like construction barriers or oddly-shaped vehicles

CenterPoint Tracking

CenterPoint simplified 3D multi-object tracking to greedy closest-point matching: detected object centers in the current frame are matched to tracked object centers from the previous frame using closest-point association. This simple approach achieved 63.8 AMOTA on nuScenes, demonstrating that strong detection reduces the complexity of the tracking problem.

Tracking Metrics (nuScenes)

MetricDescription
AMOTA (primary)Average Multi-Object Tracking Accuracy; averages MOTA across 40 recall thresholds
AMOTPAverage Multi-Object Tracking Precision; averages MOTP across recall thresholds
IDSIdentity Switches -- how often a track is associated with the wrong detection
FPFalse Positives -- tracks reported where no real object exists
FNFalse Negatives -- real objects not tracked

AMOTA and AMOTP are computed using 40-point interpolation over the MOTA/MOTP curves, excluding points with recall < 0.1 to avoid noise.


8. Prediction

Behavior Prediction Architecture

Motional uses a Graph Attention Network (GAT) processed through the vehicle's onboard compute for trajectory prediction. The system models the prediction problem as a graph where:

  • Agent nodes (blue in Motional's visualizations): Represent vehicles, pedestrians, cyclists, and other dynamic agents
  • Map element nodes (orange): Represent lane segments, crosswalks, traffic signals, and other static infrastructure

The graph attention mechanism learns attention weights from data to understand how agents interact with each other and with the road geometry.

Input Features

For each agent, the prediction model ingests:

  • Position (x, y in world coordinates)
  • Velocity (magnitude and direction)
  • Acceleration (magnitude and direction)
  • Road geometry (lane boundaries, curvature, connectivity)
  • Historical trajectory (past positions over multiple timesteps)

Multi-Modal Trajectory Prediction

Rather than predicting a single future trajectory, the system generates multiple trajectories with associated probabilities. This is critical because:

  • A vehicle at an intersection might go straight, turn left, or turn right
  • A uni-modal prediction model would predict the average of multiple modes, producing an unrealistic trajectory (e.g., predicting a car will drive into the median, which is the average of "go straight" and "turn left")
  • Each predicted trajectory waypoint is represented with a 2D Gaussian distribution (mean center position + covariance matrix), providing uncertainty estimates

Integration with Planning

The prediction system runs thousands of times per minute and directly informs the planning module. The multi-modal predictions with confidence scores allow the planner to:

  • Plan around the most likely trajectories of other agents
  • Maintain contingency plans for less likely but dangerous scenarios
  • Adjust confidence dynamically as new observations arrive

Training Data

The prediction model is trained on thousands of hours of auto-labeled data from Motional's fleet operations. The Continuous Learning Framework continuously identifies high-error prediction scenarios for targeted retraining, enabling "the prediction model to continuously improve with every mile."


9. Large Driving Models (LDM) Pivot

The 2024 Strategic Decision

In 2024, Motional made a fundamental decision to redesign its autonomous driving system architecture around AI, transitioning from a traditional modular stack to Large Driving Models (LDMs). This decision was described as "an important turning point in autonomous driving technology development" and coincided with the company's major restructuring.

LDM Architecture for Perception

LDMs are described by Motional as "embodied foundation models" -- not generic language models, but purpose-built models that understand the physical world through sensor data.

MODULAR STACK (Pre-2024):
  Sensors --> Perception --> Prediction --> Planning --> Control
                Module         Module        Module      Module
              (separate ML)  (separate ML) (separate ML)

LDM ARCHITECTURE (2024-present):
  Sensors ──────────> Large Driving Model ──────────> Control

                   ┌────────┼────────┐
                   │        │        │
              Perception Prediction Planning
              (shared transformer backbone)
                   │        │        │
                   └────────┼────────┘

                    Safety Guardrail System (parallel)

How LDM Changes Perception

The LDM architecture transforms perception in several fundamental ways:

  1. Shared Representations: Instead of perception producing a fixed intermediate representation (e.g., a list of detected objects) that prediction and planning consume, the LDM learns shared embeddings that serve all downstream tasks simultaneously. Features useful for detection are also useful for prediction and planning.

  2. Joint Training: Perception, prediction, and planning are co-trained, enabling bidirectional knowledge transfer. Planning requirements can influence what perception learns to focus on, and perception features directly inform prediction without lossy intermediate representations.

  3. PredictNet as Scene Encoder: Motional's PredictNet component learns "spatiotemporal relationships between vehicles, pedestrians, and static map elements, encoding each scene's semantic contexts into a structured, high-dimensional latent representation." These embeddings "retain rich contextual information" and reason about future behaviors to inform planning.

  4. Self-Attention Across Modalities: The transformer backbone encodes relationships between agent-agent, agent-ego, and ego-environment interactions using self-attention. This allows the model to track multi-agent interactions over time.

  5. Encoder-Generator-Ranker Pipeline: The architecture follows:

    • PredictNet generates transformer-based embeddings capturing agent interactions
    • An optimization-based trajectory generator creates motion plan candidates
    • An ML ranker evaluates trajectories using multi-objective loss balancing safety, comfort, and human-likeness

Training Methodology

LDM training incorporates:

  • Supervised learning on expert driving demonstrations
  • Unsupervised learning for representation learning from raw sensor data
  • Reinforcement learning in simulated environments for closed-loop performance
  • Closed-loop training with distributed infrastructure to address distribution shift

Training data comes from "extremely diverse" datasets collected across Las Vegas, Pittsburgh, Boston, Los Angeles, and Singapore.

Safety Guardrail System

A critical complement to the LDM is a parallel safety guardrail system:

AspectDetail
CoverageHandles ~1% edge cases (unexpected events)
ApproachRule-based, deterministic, validated over extended period
FunctionPrevents LDM from making erroneous decisions in unusual scenarios
IndependenceRuns in parallel with, not inside, the LDM
ValidationHas been validated extensively on real-world data

For ~90% of general driving situations, the E2E LDM handles all decisions. The guardrail system provides deterministic safety guarantees for edge cases where the learned model may have insufficient training data -- a design philosophy influenced by nuTonomy's formal methods heritage.

Key LDM Design Principles

Motional identifies four requirements for their LDMs:

  1. Achieve driverless safety benchmarks
  2. Reduce costs for rapid geographic scaling (a single model architecture that works across cities)
  3. Enable efficient training across vast datasets
  4. Provide sufficient introspection to understand and solve long-tail issues (unlike black-box E2E models)

The emphasis on introspection is notable: Motional explicitly states their LDMs provide "enough introspection to really understand what's happening so that we can more easily improve the system and solve long tail issues." This suggests the LDM maintains some internal structure (not a fully opaque end-to-end model) that allows engineers to diagnose failure modes.


10. nuScenes Dataset Architecture

Overview

nuScenes (nuTonomy Scenes) is the first large-scale, multimodal dataset to provide data from the full autonomous vehicle sensor suite, released by Motional (then nuTonomy) and published at CVPR 2020. It has become the de facto benchmark for 3D perception in autonomous driving.

AttributeValue
Paper"nuScenes: A multimodal dataset for autonomous driving" (CVPR 2020)
AuthorsHolger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, Oscar Beijbom
Downloads12,000+
Citing Publications600+ (as of initial reports; now substantially higher -- 8,000+ researchers have used the dataset)
ImpactPioneered AV data sharing movement; 10+ new public datasets released industry-wide in following 18 months

Sensor Configuration

Data was collected using two Renault Zoe supermini electric cars with identical sensor layouts in Boston and Singapore:

SensorModelCountFrequencySpecifications
LiDARVelodyne HDL-32E120 Hz32 beams; ~1,080 pts/ring; 70m range; +/-2cm accuracy
CamerasBasler acA1600-60gc612 Hz1600x900 ROI; 360-degree surround view
RadarContinental ARS 408-21513 Hz77 GHz; up to 250m; FMCW; measures distance and velocity
GPS/IMUAdvanced Navigation Spatial1--20mm position accuracy

Dataset Scale

ComponentCount
Scenes1,000 (each 20 seconds long)
Keyframes40,000 (annotated at 2 Hz)
Camera Images~1.4 million
LiDAR Sweeps~390,000
Radar Sweeps~1.4 million
3D Bounding Boxes~1.4 million (across 40,000 keyframes)
Data Split700 train, 150 val, 150 test scenes

Annotation Architecture

Each object in every keyframe is annotated with:

AnnotationDetail
Semantic CategoryOne of 23 object classes
AttributesVisibility level, activity state, pose
Instance IdentifierUnique ID linking same object across frames
3D Bounding Boxx, y, z (center), width, length, height, yaw angle
Velocity2D velocity vector derived from consecutive annotations

Coordinate System

nuScenes uses three coordinate frames:

FrameDescriptionUse
GlobalFixed world coordinate frameAll annotations are stored in global coordinates
Ego VehicleDefined at the midpoint of the rear axleExtrinsic sensor calibrations are relative to ego frame
SensorEach sensor's local coordinate frameRaw sensor data (e.g., radar points) are in sensor coordinates

Transformations between frames use 4x4 rigid-body transformation matrices:

  • Sensor-to-Ego: calibrated_sensor table provides rotation (quaternion) and translation
  • Ego-to-Global: ego_pose table provides rotation (quaternion: w, x, y, z) and translation (meters: x, y, z)

The 23 Object Classes (Full Taxonomy)

CategorySubclassDetection Class (Merged)
vehicle.car--car
vehicle.truck--truck
vehicle.bus.bendy--bus
vehicle.bus.rigid--bus
vehicle.trailer--trailer
vehicle.construction--construction_vehicle
vehicle.motorcycle--motorcycle
vehicle.bicycle--bicycle
vehicle.emergency.ambulance--(excluded from detection)
vehicle.emergency.police--(excluded from detection)
human.pedestrian.adult--pedestrian
human.pedestrian.child--pedestrian
human.pedestrian.construction_worker--pedestrian
human.pedestrian.police_officer--pedestrian
human.pedestrian.personal_mobility--(excluded from detection)
human.pedestrian.stroller--(excluded from detection)
human.pedestrian.wheelchair--(excluded from detection)
movable_object.barrier--barrier
movable_object.debris--(excluded from detection)
movable_object.pushable_pullable--(excluded from detection)
movable_object.trafficcone--traffic_cone
static_object.bicycle_rack--(excluded from detection)
animal--(excluded from detection)

How nuScenes Reflects Motional's Perception Architecture

The nuScenes sensor configuration (1 LiDAR + 6 cameras + 5 radars) represents a scaled-down version of Motional's production architecture (5+ LiDARs + 13 cameras + 11 radars). The design principles are identical:

  • Multi-modal coverage: Every point in space is observed by multiple sensor types
  • Complementary modalities: LiDAR for geometry, cameras for semantics, radar for velocity
  • 360-degree surround: No azimuthal blind spots
  • Temporal annotations: Instance tracking across keyframes enables temporal perception research

The class taxonomy directly reflects Motional's operational priorities on public roads: vehicles of all types, vulnerable road users (pedestrians, cyclists), and road infrastructure (barriers, traffic cones).


11. nuPlan Planning Dataset

Overview

nuPlan is the world's first closed-loop ML-based planning benchmark for autonomous vehicles, developed and open-sourced by Motional.

AttributeDetail
Paper"nuPlan: A closed-loop ML-based planning benchmark for autonomous vehicles"
Scale1,500 hours of human driving data
CitiesBoston, Pittsburgh, Las Vegas, Singapore
Data TypesAuto-labeled object tracks, traffic light data, camera images, LiDAR point clouds, localization, steering inputs
Total Data Volume200+ TB (full dataset)
Sensor Data Release120 hours of raw sensor data (~16 TB) -- 10% of full dataset
Images~500 million (full dataset)
LiDAR Scans~100 million (full dataset)
Sensors8 cameras, 5 LiDARs
AvailabilityFree for academic use; commercial licensing; hosted on AWS Open Data Registry

How Perception Feeds Planning

nuPlan bridges perception and planning by providing perception outputs as inputs to the planning benchmark:

  1. Auto-Labeling: Object tracks in nuPlan are generated using Motional's offline perception system, employing state-of-the-art detectors including PointPillars and CenterPoint
  2. Noise Injection: To capture the realistic uncertainty of online perception, uniform noise is injected into the auto-labeled detections, with variance calibrated by comparing offline and online perception outputs
  3. Scenario Mining: Attributes for scenario mining (vehicle speed, lane occupancy, agent proximity, etc.) are inferred from offline perception tracks and traffic light states
  4. Closed-Loop Evaluation: Unlike open-loop benchmarks that compare planned trajectories to recorded expert trajectories (using L2 distance, which is "not suitable for fairly evaluating long-term planning"), nuPlan provides closed-loop simulation where simulated agents react to the ego vehicle's planned trajectory

Three Core Components

  1. Large-Scale Driving Dataset: 1,500 hours of real-world driving across 4 cities
  2. Lightweight Closed-Loop Simulator: Reactive simulation environment
  3. Planning-Specific Metrics: Traffic rule compliance, vehicle dynamics, goal achievement, passenger comfort (e.g., acceleration in turns)

12. Semantic Segmentation

nuScenes-lidarseg

Released in July 2020, nuScenes-lidarseg provides per-point semantic annotations for every LiDAR point in the nuScenes keyframes.

AttributeValue
Annotated Points1.4 billion
Pointclouds40,000 (all keyframes from 1,000 scenes)
Classes32 (23 foreground "things" + 9 background "stuff")
Challenge Classes16 (merged/filtered for benchmark evaluation)
AnnotationEach LiDAR point assigned exactly one semantic label
Split850 scenes (train/val), 150 scenes (test)

The 32 Semantic Classes

Foreground Classes (Things) -- 23 classes: The same 23 object classes used for bounding box annotation (vehicles, pedestrians, movable objects, etc.)

Background Classes (Stuff) -- 9 classes:

ClassDescription
flat.driveable_surfaceAll paved or unpaved surfaces a car can drive on
flat.sidewalkSidewalks, pedestrian walkways, bike paths
flat.terrainNatural horizontal surfaces: ground-level vegetation, grass, hills, soil, sand, gravel
flat.otherOther flat surfaces
static.manmadeGround-level structures not in other categories (walls, fences, buildings)
static.vegetationTrees, bushes, hedges (non-ground-level vegetation)
static.otherOther static objects
vehicle.egoThe ego vehicle itself
noisePoints that are noise/artifacts

16 Challenge Classes

For the official lidar segmentation challenge, similar classes are merged and rare classes are removed, resulting in 16 evaluation classes.

How Per-Point Segmentation Works

Per-point semantic segmentation assigns a class label to every individual point in the LiDAR point cloud. This is fundamentally different from 3D bounding box detection:

  • Bounding boxes provide coarse object localization (a box around a car)
  • Per-point segmentation provides fine-grained scene understanding (which points belong to the road, which to the sidewalk, which to vegetation)

This enables understanding of free space (drivable area), road boundaries, and scene layout -- critical for planning and localization.

Evaluation Metrics for Segmentation

MetricDescription
mIoUMean Intersection-over-Union; primary metric; averaged across all classes
fwIoUFrequency-weighted IoU; weights each class by its frequency in the dataset

Panoptic nuScenes

Motional extended nuScenes further with Panoptic nuScenes (2021), providing:

  • Panoptic segmentation: Combined semantic and instance segmentation (each point gets both a class label and an instance ID)
  • Panoptic tracking: Instance tracking across frames in the panoptic segmentation
  • Scale: 1,000 scenes with over 1.1 billion annotated points
  • Authors: Whye Kit Fong, Rohit Mohan, Juana Valeria Hurtado, Lubing Zhou, Holger Caesar, Oscar Beijbom, Abhinav Valada

13. Occupancy Prediction

nuScenes-Based Occupancy Benchmarks

While Motional has not published a dedicated occupancy prediction method, the nuScenes dataset has become the foundation for the field's primary occupancy benchmarks:

Occ3D-nuScenes

AttributeDetail
SourceTsinghua-MARS Lab (built on nuScenes)
TaskDense 3D semantic occupancy prediction
AnnotationDense voxel annotations with robust occlusion reasoning
Sensors UsedSynchronized 6 cameras, LiDAR, radars from nuScenes
EvaluationPer-class IoU, mIoU, RayIoU

SurroundOcc

SurroundOcc produces dense occupancy labels using spatial attention to reproject 2D camera features back into 3D voxel space. It has been evaluated on nuScenes with missing-view protocols.

Relationship to BEV Perception

Semantic occupancy prediction extends BEV perception from 2D overhead maps to dense 3D voxel grids. Two mainstream approaches (BEVDet and BEVFormer) have been adapted for occupancy by replacing detection decoders with occupancy decoders while retaining their BEV feature encoders.

Relevance to Motional

Occupancy prediction is relevant to Motional's architecture because:

  • It provides a sensor-agnostic scene representation (every voxel is classified as occupied/free/unknown)
  • It can detect arbitrarily-shaped obstacles that bounding boxes poorly represent (e.g., fallen trees, construction debris, unusual structures)
  • Motional's patent portfolio includes work on dynamic occupancy grids (DOGs) generated from LiDAR data combined with semantic maps

14. BEV Representation

What Is BEV and Why Motional Uses It

Bird's-Eye View (BEV) representation is a top-down 2D feature map where each pixel corresponds to a location in the ground plane around the vehicle. Motional uses BEV as the central representation for multi-sensor fusion because:

  1. LiDAR is naturally BEV: After voxelization and height flattening, LiDAR features are inherently in BEV
  2. Camera-to-BEV is the key challenge: Motional's Surround-View Image Networks perform the view transformation from perspective camera images to BEV
  3. Radar projects easily to BEV: Radar returns (range, azimuth, velocity) map directly to BEV
  4. Fusion is straightforward: With all modalities in BEV, fusion becomes element-wise operations on aligned feature maps
  5. Planning operates in BEV: Motion planning is naturally performed in the ground plane

Motional's BEV Pipeline

Cameras (13) ──> Vision Transformer Backbone ──> View Transformation ──> Camera BEV Features ──┐

LiDARs (5+) ──> Voxelization/Pillarization ──> 3D Backbone ──> Height Flatten ──> LiDAR BEV ──>├──> Fused BEV ──> Detection/Tracking/Prediction

Radars (11) ──> ML Point Cloud Generation ──> Radar Feature Extraction ──> Radar BEV Features ──┘

Coordinate Frames in BEV

Motional's BEV representation uses the ego vehicle coordinate frame (centered at the midpoint of the rear axle):

AxisDirectionConvention
XForward (longitudinal)Positive ahead of vehicle
YLeft (lateral)Positive to the left
ZUp (vertical)Positive upward

The BEV grid typically covers a region such as [-50m, 50m] x [-50m, 50m] around the ego vehicle, with resolution determined by the grid cell size (e.g., 0.25m per cell = 400x400 grid).

Temporal BEV Aggregation

BEV features from multiple timesteps can be aggregated to provide temporal context:

  • Previous BEV frames are ego-motion compensated (warped to current ego frame)
  • Deformable attention or feature concatenation merges temporal features
  • This captures agent motion and provides velocity cues without explicit flow estimation

15. Temporal Fusion

Multi-Sweep LiDAR Aggregation

Temporal fusion of LiDAR data is a fundamental technique used throughout Motional's perception pipeline:

Standard Multi-Sweep Aggregation:

  • Typically 10 past LiDAR sweeps (0.5 seconds at 20 Hz) are aggregated
  • Each past sweep is ego-motion compensated using the vehicle's odometry
  • Points are tagged with relative timestamps, allowing the network to learn motion
  • Stationary objects produce dense point clusters; moving objects produce trailing patterns

MVFuseNet Temporal Fusion (Motional):

  • Sequential aggregation projecting data from one sweep to the next in the temporal sequence
  • Operates in both Range View (RV) and BEV for richer spatio-temporal features
  • Enables joint detection and motion forecasting from temporal LiDAR sequences

Video-Based Camera Perception

Temporal fusion for camera perception extends single-frame detection to video understanding:

  • Temporal BEV fusion: BEV features from consecutive camera frames are aligned via ego-motion and aggregated
  • Recurrent architectures: Systems like OnlineBEV use recurrent structures with spatio-temporal deformable attention to align BEV features across frames
  • Motion context: Features from adjacent frames are used to extract motion context, improving velocity estimation and handling of occluded objects

Temporal Fusion in the LDM

In Motional's current LDM architecture, temporal fusion is implicit in the transformer backbone. The self-attention mechanism operates over temporal sequences, allowing the model to:

  • Track objects through occlusions using learned object permanence
  • Estimate velocity and acceleration from sequential observations
  • Build contextual understanding of evolving traffic scenarios

16. Traffic Infrastructure

Traffic Light Detection

Traffic light detection is handled by the camera perception pipeline, as it requires color discrimination that neither LiDAR nor radar can provide:

  • Detection: Camera backbone identifies traffic light regions in images
  • State Classification: The system classifies traffic light state (red, green, yellow, flashing, arrow states)
  • Association: Detected traffic lights are associated with specific lanes and intersections using the HD map
  • Temporal tracking: Traffic light state is tracked across frames to prevent erroneous state changes from single-frame noise

In nuScenes and nuPlan, traffic light data is provided as auto-labeled annotations, and nuPlan specifically includes traffic light states as inputs to the planning benchmark.

Traffic Sign Detection

Traffic sign detection in Motional's stack relies on:

  • Camera-based recognition using deep learning classifiers
  • HD map prior knowledge (sign locations are encoded in the semantic map layer)
  • Cross-validation between detected signs and map expectations

Lane Marking and Road Boundary Detection

Lane markings and road boundaries are detected through:

  • Camera-based semantic segmentation (distinguishing dashed lines, solid lines, road edges)
  • LiDAR-based curb detection (height discontinuities at road edges)
  • HD map matching for validation

17. Pedestrian Detection

VRU Detection: The Core Challenge

Vulnerable Road User (VRU) detection is one of the most safety-critical perception tasks. Motional's perception stack must detect and classify:

  • Adult pedestrians
  • Children
  • Construction workers
  • Police officers
  • People in wheelchairs
  • People pushing strollers
  • Cyclists
  • Scooter riders

Las Vegas Strip: The Ultimate Edge Case Environment

Las Vegas presents uniquely challenging perception scenarios that no other testing environment provides:

Unusual Pedestrian Appearances:

  • Performers in large feathery costumes (wings, elaborate headdresses)
  • People walking on stilts
  • Costumed characters (clowns, showgirls, mascots)
  • Tourists carrying oversized objects (yard-long drinks, large signs)

In a documented edge case from Motional's testing, a clown juggling pins on the sidewalk dropped a pin and stepped into the street to retrieve it directly in front of an IONIQ 5 robotaxi. Despite the bizarre costume and unexpected behavior, the vehicle recognized the risk in advance and stopped safely.

Unusual Vehicle Types:

  • Stretch limousines (much longer than standard cars)
  • Billboard trucks (large flat surfaces, unusual geometry)
  • Trike motorcycles
  • Classic and exotic cars (Rolls Royce, Lamborghini -- unusual shapes)

Environmental Challenges:

  • Bright neon signs and dynamic lighting on The Strip
  • Large crowds with dense pedestrian clusters
  • Jaywalking across wide boulevards
  • Costumed performers who may appear as non-human objects to naive detectors

VRU Detection in Radar

Motional's imaging radar architecture specifically targets VRU detection improvement. The ML-trained radar perception achieves 3x Average Precision (AP) improvement in VRU detection compared to conventional radar processing. This is critical because:

  • Pedestrians have small radar cross-sections
  • Cyclists and scooter riders move at varying speeds
  • Traditional radar processing often cannot distinguish pedestrians from clutter

How the Perception Stack Handles Edge Cases

  1. Multi-modal verification: A pedestrian in an unusual costume may confuse one sensor modality but is unlikely to confuse all three (camera, LiDAR, radar) simultaneously
  2. Continuous Learning Framework: Edge cases encountered in Las Vegas (costumed performers, unusual vehicles) are mined from fleet data and used to retrain perception models
  3. Offline perception with object permanence: Temporarily occluded pedestrians are maintained in the scene model using temporal reasoning
  4. Conservative default behavior: When perception confidence is low, the AV defaults to treating ambiguous objects as vulnerable road users

18. Non-ML Components

Sensor Calibration

Calibration is a critical classical (non-ML) component that ensures all sensors are correctly aligned in the vehicle's coordinate frame:

Intrinsic Calibration:

  • Camera intrinsics (focal length, principal point, distortion coefficients) are calibrated using standard checkerboard procedures
  • LiDAR intrinsics are factory-calibrated by the manufacturer

Extrinsic Calibration:

  • The 6-DOF rigid-body transformation (rotation + translation) between each sensor and the ego vehicle body frame
  • In nuScenes, a laser liner is used to accurately measure the relative location of the LiDAR to the ego frame
  • For the IONIQ 5 with 30+ sensors, extrinsic calibration is performed at Motional's Autonomous Vehicle Integration Center at HMGICS in Singapore
  • Includes camera-to-LiDAR, radar-to-LiDAR, camera-to-ego, and sensor-to-sensor calibrations

Cross-Sensor Temporal Calibration:

  • Sensors operate at different frame rates (cameras at 12 Hz, LiDAR at 20 Hz, radar at 13 Hz in nuScenes)
  • Temporal synchronization is required to align data from non-synchronized sensors
  • Motional holds a patent (DK180393B1) on data fusion for vehicles equipped with non-synchronized perception sensors

Sensor Preprocessing

LiDAR Preprocessing:

  • Range filtering (removing returns beyond maximum reliable range)
  • Ground plane removal (optional; separates ground points from object points)
  • Ego-motion compensation (correcting for vehicle motion during a single LiDAR rotation)
  • Point cloud accumulation (aggregating multiple sweeps with timestamp tagging)
  • Coordinate transformation (sensor frame --> ego frame --> global frame)

Camera Preprocessing:

  • Lens distortion correction using intrinsic calibration parameters
  • Exposure and white balance normalization
  • Image cropping/resizing to model input resolution
  • Data augmentation during training (random flipping, scaling, rotation, color jitter)

Radar Preprocessing (Traditional):

  • CFAR (Constant False Alarm Rate) detection for target extraction
  • Clustering of radar returns into object hypotheses
  • Doppler velocity estimation

Radar Preprocessing (Motional's ML Approach):

  • Bypasses traditional DSP entirely
  • Raw ADC data streamed to central computer
  • ML pipeline processes raw data end-to-end

Classical Algorithms Still in Use

ComponentAlgorithmPurpose
Ego-Motion EstimationIMU integration, wheel odometryDead reckoning between LiDAR/GPS updates
Map MatchingICP, NDT, or learned matchingLocalization against HD map
Ground SegmentationRANSAC plane fitting or height thresholdingSeparating ground from non-ground points
Coordinate TransformsRigid-body transformations (4x4 matrices)Converting between sensor, ego, and global frames
SLAMGraph SLAM (for map creation)Building geometric maps from sensor data
AEBRule-based emergency brakingTime-to-collision computation for safety-critical braking
Kalman FilteringExtended/Unscented Kalman FilterState estimation, sensor fusion (in classical tracking)

19. Auto-Labeling

Offline Perception System

Motional's auto-labeling system is built on a cloud-based offline perception pipeline that operates fundamentally differently from the real-time onboard (online) perception:

AspectOnline PerceptionOffline Perception
EnvironmentOnboard vehicle computerCloud-based distributed computing
LatencyMust run in real-time (~50-100ms)No latency constraints
ComputeLimited to onboard GPUsMultiple machines and GPUs in parallel
Temporal AccessCausal only (past + present)Full temporal access (past + present + future)
Processing TimeHours --> hours (massive parallelism)Weeks --> hours for training data

Foresight and Hindsight

The key innovation of offline perception is temporal analysis with both foresight and hindsight:

  • Hindsight: Using past observations to confirm present detections. A distant light at night can be confirmed as an approaching vehicle by checking earlier frames when it was closer.
  • Foresight: Using future frames to validate current detections. A truck's dimensions can be assessed from a better vantage point after the ego vehicle overtakes it.
  • Object Permanence: The system infers that "a pedestrian that has been observed in the past and in the future, must also be there in the present" -- even if the pedestrian is momentarily occluded (e.g., hidden behind a tree) in the current frame.

This produces a "globally consistent estimate of the scene" that approaches human-level annotation accuracy while operating at orders of magnitude greater speed.

Continuous Learning Framework Integration

Auto-labeling powers Motional's Continuous Learning Framework (CLF) in several ways:

  1. Scenario Mining: By comparing online and offline perception outputs, the system automatically identifies perception failures:

    • Online perception misses a pedestrian behind a tree, but offline perception detects it through object permanence
    • This disagreement is flagged as a scenario for targeted retraining
  2. Error Attribution: Discrepancies between online and offline perception are decomposed into:

    • Detection errors (missed objects, false positives)
    • Tracking errors (ID switches, fragmented tracks)
    • Prediction errors (incorrect trajectory forecasts)
  3. Auto-Labeled Training Data at Scale: Motional can now "annotate any amount of data collected by its fleet with a system that approaches the same level of accuracy as human-labeled data," reducing annotation time from weeks to hours.

  4. nuPlan Dataset: The nuPlan dataset (1,500 hours, eventually described as 1,800 hours) is entirely auto-labeled using this offline perception system, representing unprecedented scale for AV ML development.

Omnitag: ML-Powered Multimodal Data Mining

Omnitag is Motional's framework for transforming raw driving data into targeted training data. It operates on three pillars:

Pillar 1: Multimodal Encoding

  • Uses pretrained multimodal foundation models from the open-source community
  • Encodes preprocessed data (image, video, audio, LiDAR, world-state) into high-dimensional embeddings preserving semantic and contextual information
  • Cross-modal disambiguation (e.g., using LiDAR to clarify visual occlusions)

Pillar 2: RAG-Driven Few-Shot Dataset Creation

  • Retrieval-Augmented Generation (RAG) loop surfaces informative positive and negative examples
  • Users interactively curate few-shot datasets with minimal manual effort
  • Both few-shot decoding (lightweight decoders on cached embeddings) and zero-shot decoding (in-context prompting) are supported

Pillar 3: Encoder-Decoder Adaptation

  • Domain-specific fine-tuning on representative data from target operational domains
  • Continuous feedback loop with incremental model adaptation as rare events are discovered
  • Supports the "teacher-student paradigm" where powerful offline models prepare high-quality datasets for lighter on-car models

Scenario Mining Pipeline

AV Sensor Logs ──> ML-Powered Offline Perception ──> Auto Ground-Truth Labels

                                                  Attribute Computation
                                                  (hundreds of searchable attributes)

                                                 ┌──────────┼──────────┐
                                                 │          │          │
                                             AV State   Agent-Based  Error
                                             Attributes  Attributes  Attributes
                                             (speed,    (type, dist, (detection,
                                              lane)      speed)      prediction)
                                                 │          │          │
                                                 └──────────┼──────────┘

                                                 Searchable Scenario Database

                                                  SQL-Based Queries
                                                  ("find all scenarios where
                                                   online perception missed
                                                   a pedestrian at > 50m")

20. Key Papers

Papers Authored by Motional/nuTonomy Researchers

YearPaperVenueAuthors (Affiliation)Contribution
2019PointPillars: Fast Encoders for Object Detection from Point CloudsCVPR 2019Alex H. Lang, Sourabh Vora, Holger Caesar, Lubing Zhou, Jiong Yang, Oscar Beijbom (nuTonomy)First real-time LiDAR detector (62 Hz); pillar-based encoding; KITTI SOTA
2020nuScenes: A Multimodal Dataset for Autonomous DrivingCVPR 2020Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh Vora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, Oscar Beijbom (Motional)First full-sensor-suite AV dataset; 23 classes; NDS metric; 12,000+ downloads
2020PointPainting: Sequential Fusion for 3D Object DetectionCVPR 2020Sourabh Vora, Alex H. Lang, Bassam Helou, Oscar Beijbom (Motional)Sequential camera-LiDAR fusion; improved all tested LiDAR detectors
2021nuPlan: A Closed-Loop ML-Based Planning Benchmark for Autonomous VehiclesarXiv/ICRAHolger Caesar et al. (Motional)World's first ML planning benchmark; 1,500 hours; 4 cities
2021MVFuseNet: Improving End-to-End Object Detection and Motion Forecasting through Multi-View Fusion of LiDAR DataCVPR 2021 Workshop (WAD)Ankit Laddha, Shivam Gautam et al. (Motional)First dual RV+BEV LiDAR temporal fusion; joint detection + forecasting
2021Panoptic nuScenes: A Large-Scale Benchmark for LiDAR Panoptic Segmentation and TrackingIEEE RA-L 2021Whye Kit Fong, Rohit Mohan, Juana Valeria Hurtado, Lubing Zhou, Holger Caesar, Oscar Beijbom, Abhinav Valada (Motional + U. Freiburg)1.1B annotated points; panoptic segmentation + tracking
2023Training Strategies for Vision Transformers for Object DetectionCVPR 2023 Workshop (WAD)(Motional)63% inference speedup with 3% performance drop for ViT-based detection
2023Offline Tracking with Object PermanenceCVPR/arXiv 2023(Related research)Temporal reasoning for occluded object tracking
2024nuScenes Revisited: Progress and Challenges in Autonomous DrivingarXiv 2024Whye Kit Fong, Venice Erin Liong, Kok Seang Tan, Holger CaesarRetrospective on nuScenes impact and future directions
YearPaperVenueRelevance to Motional
2021CenterPoint: Center-based 3D Object Detection and TrackingCVPR 2021State-of-the-art on nuScenes; used in nuPlan auto-labeling
2022TransFusion: Robust LiDAR-Camera Fusion for 3D Object Detection with TransformersCVPR 20221st place on nuScenes tracking; soft-association fusion approach
2022/23BEVFusion: Multi-Task Multi-Sensor Fusion with Unified Bird's-Eye View RepresentationNeurIPS 2022 / ICRA 2023SOTA on nuScenes detection and BEV segmentation; unified BEV fusion
2023Occ3D: A Large-Scale 3D Occupancy Prediction Benchmark for Autonomous DrivingNeurIPS 2023Built on nuScenes; occupancy prediction benchmark

nuTonomy Foundational Research

TopicContribution
Sampling-Based Motion PlanningAlgorithms (RRT*, PRM) with provable guarantees of completeness, correctness, and optimality
Formal Language SpecificationsTranslating driving rules into formal logic that can be verified
Model CheckingAutomated verification of control software against formal specifications
Formal Collision Risk EstimationCompositional data-driven approaches for probabilistic collision risk
Rules of the Road ComplianceFormal methods for traffic law compliance in dynamic environments

21. Key Patents

Motional AD LLC Patent Portfolio

Motional AD LLC holds patents across perception, planning, and safety. In the LiDAR software sector specifically, Motional holds 5 patent assets in the Software section of Automobile Vision: LIDAR -- the second most behind only General Motors.

PatentTitleTechnology Area
US20210080558A1Extended Object Tracking Using RADARRadar-based extended object tracking using RLS-based velocity estimation; originally assigned to Aptiv, reassigned to Motional AD LLC (2020)
US20190004159A1LiDAR Sensor Alignment SystemImaging device + LiDAR alignment; object classification for orientation confirmation; assigned Aptiv --> Motional (2020)
US10598791B2Object Detection Based on LiDAR IntensityDetermining object characteristics from LiDAR intensity values
US10366294B2Transparency-Characteristic Based Object ClassificationObject classification for automated vehicles using LiDAR; assigned Aptiv --> Motional (2022)
DK180393B1Data Fusion System for Non-Synchronized Perception SensorsTemporal synchronization of multi-sensor data with different timestamps (filed 2018, granted 2021)
US10126136B2Route Planning for an Autonomous VehicleRoute planning algorithms (nuTonomy heritage)
US9645577Facilitating Vehicle Driving and Self-DrivingCore autonomous driving facilitation (nuTonomy heritage)
(Unpublished)Scene-Dependent Object Queries for Bounding Box GenerationPerception system generating bounding boxes using scene-dependent object queries
(Unpublished)Dynamic Occupancy Grid from LiDAR + Semantic MapDOG generation with per-cell probability density functions from LiDAR data

Planning and Safety Patents

PatentTitleTechnology Area
EP3593337A4Planning for Unknown Objects by an Autonomous VehicleHandling unknown/novel objects in motion planning
DE112019005425T5Redundancy in Autonomous VehiclesSensor and compute redundancy architecture

Patent Portfolio Characteristics

  • Heritage transfer: Multiple patents were originally assigned to Aptiv Technologies Limited and subsequently reassigned to Motional AD LLC upon JV formation
  • Karl Iagnemma: 50+ issued/filed patents across his career (many in the Motional/nuTonomy portfolio)
  • Focus areas: LiDAR processing, radar tracking, sensor fusion, object classification, motion planning, safety systems

22. Perception Metrics

nuScenes Detection Score (NDS)

The nuScenes Detection Score (NDS) is the primary metric for evaluating 3D object detection on the nuScenes benchmark. It was designed by Motional's research team to provide a single number that captures both detection accuracy and the quality of additional object attributes.

NDS Formula

NDS = (1/10) * [5 * mAP + sum(1 - min(1, mTP_i)) for i in {ATE, ASE, AOE, AVE, AAE}]

Or equivalently:

NDS = (1/10) * [5*mAP + (1-min(1,mATE)) + (1-min(1,mASE)) + (1-min(1,mAOE)) + (1-min(1,mAVE)) + (1-min(1,mAAE))]

Weighting: mAP receives weight 5; each of the 5 True Positive (TP) metrics receives weight 1; the total (5+5=10) is normalized.

Component Metrics

MetricFull NameDescriptionUnit
mAPMean Average PrecisionDetection accuracy; uses 2D center distance on ground plane (not IoU) as matching criterion--
mATEMean Average Translation Error2D Euclidean center distance (on ground plane)meters
mASEMean Average Scale Error1 - IOU after aligning centers and orientation--
mAOEMean Average Orientation ErrorSmallest yaw angle difference between prediction and ground truthradians
mAVEMean Average Velocity ErrorAbsolute velocity error (2D)m/s
mAAEMean Average Attribute Error1 - accuracy of attribute classification (e.g., parked vs. moving)--

mAP Matching: Center Distance, Not IoU

A critical design decision: nuScenes uses 2D center distance on the ground plane instead of intersection-over-union (IoU) for matching detections to ground truth. This was deliberate because:

  • At long range, small errors in size estimation can drastically change IoU even when the center is well-localized
  • Center distance is more interpretable (in meters) than IoU (dimensionless)
  • Size and orientation errors are captured separately by mASE and mAOE

The matching thresholds for mAP are: {0.5m, 1.0m, 2.0m, 4.0m} for 2D center distance.

Decomposability

NDS is designed to be decomposable on multiple levels:

  • The overall NDS breaks down into mAP and 5 TP metrics
  • Each TP metric is an average across all 10 detection classes
  • mAP can be examined per-class to understand which object types are hardest to detect
  • This allows detailed analysis of model strengths and weaknesses

nuScenes Tracking Metrics

MetricDescription
AMOTA (primary)Average Multi-Object Tracking Accuracy; integrates MOTA over 40 recall thresholds (n=40 point interpolation); excludes recall < 0.1
AMOTPAverage Multi-Object Tracking Precision; integrates MOTP over recall thresholds
MOTAMulti-Object Tracking Accuracy (at a single threshold)
MOTPMulti-Object Tracking Precision (position error of matched tracks)
IDSIdentity Switches
FPFalse Positives
FNFalse Negatives

AMOTA is the primary ranking metric for the nuScenes tracking challenge. It remedies limitations of single-threshold MOTA by averaging across a range of recall rates, providing a more robust evaluation.

Segmentation Metrics

MetricTaskDescription
mIoUSemantic/Panoptic SegmentationMean Intersection-over-Union across all classes
fwIoUSemantic SegmentationFrequency-weighted IoU (class-weighted by frequency)
PQPanoptic SegmentationPanoptic Quality = SQ * RQ (segmentation quality * recognition quality)

Motional Internal Metrics (Inferred)

While Motional has not fully disclosed its internal evaluation metrics, the following are known or inferred:

  • High-performance metric computation framework: A flexible system designed for efficient evaluation at scale (referenced in CLF documentation)
  • Online vs. Offline disagreement rate: Used in scenario mining to identify perception failures
  • VRU detection AP: Specifically tracked for radar perception improvement (3x improvement cited)
  • Safety-relevant detection metrics: False negative rate for safety-critical objects (pedestrians, vehicles in path) is almost certainly tracked with tighter thresholds than general mAP

Sources

Motional Official Sources

Research Papers

nuScenes Devkit and Documentation

Patents

Other Sources

Public research notes collected from public sources.