Mobileye Perception Stack: Exhaustive Deep Dive
Last updated: March 2026
Table of Contents
- True Redundancy Architecture
- EyeQ Chip Perception Accelerators
- Camera-Only Perception Subsystem
- ViDAR (Vision as Virtual LiDAR)
- Road Experience Management (REM)
- LiDAR/Radar Subsystem
- Free Space Detection
- Lane Detection
- Traffic Light and Sign Detection
- Pedestrian/VRU Detection
- Object Detection Architecture
- Object Tracking
- Semantic Segmentation
- Occupancy Prediction
- Depth Estimation
- Temporal Processing
- RSS Integration with Perception
- VLSA (Vision Language Safety Agent)
- Model Efficiency
- Calibration
- Key Patents
- Key Publications
- Perception Evolution: EyeQ1 through EyeQ Ultra
1. True Redundancy Architecture
Philosophical Foundation
Mobileye's True Redundancy is a fundamental architectural divergence from the industry-standard sensor fusion paradigm. Rather than fusing multiple sensor modalities into a single unified world model -- the approach used by Waymo, Cruise, and most other AV developers -- Mobileye maintains two fully independent perception subsystems, each of which must independently achieve safety-level perception.
The rationale is rooted in validation complexity. A fused perception system creates a single pipeline whose failure modes are combinatorially complex: sensor interactions, calibration drift, and cross-modal ambiguities compound into a validation burden that Mobileye estimates requires hundreds of millions of hours of real-world driving data. With True Redundancy, each independent channel requires only tens of thousands of hours of validation. The probability of both independent subsystems failing simultaneously on the same hazard is the product of their individual failure rates -- a dramatically smaller number than the single-system failure rate.
The Two Subsystems
Camera-Only Subsystem
- Processes data from up to 13 cameras (seven 8MP, four to six 2MP depending on platform)
- Must independently detect all road users (vehicles, pedestrians, cyclists, animals, debris), lanes, drivable paths, traffic signs, traffic signals, and road geometry
- Runs entirely on EyeQ silicon using deep learning perception
- Produces a complete, standalone world model including object lists with position, velocity, classification, and predicted trajectories
- The camera subsystem serves as the "backbone" of the production AV
Radar/LiDAR Subsystem
- Processes data from imaging radar (proprietary 4D imaging radar SoC) and, on the Drive platform, LiDAR sensors
- Must also independently detect all critical road users and road geometry with zero camera input
- Builds its own complete world model: object lists, drivable space, hazard map
- Serves as a "diversified safety backup" that provides a significantly higher mean time between failures (MTBF)
Algorithmic Redundancy Within Subsystems
Mobileye implements a further layer of redundancy within each subsystem. Within the camera channel, critical perception functions are developed using multiple independent algorithmic principles:
- Vehicle detection: Performed both via pattern recognition (monocular deep learning) and via triangulation (multi-camera stereo geometry). These are fundamentally different mathematical approaches -- one is appearance-based, the other is geometry-based.
- Depth estimation: Computed both from monocular depth networks and from multi-view stereo/structure-from-motion across temporally adjacent frames.
- Free space: Estimated both from semantic segmentation (per-pixel classification) and from geometric methods (ground plane estimation, structure from motion).
This means the camera subsystem itself has internal redundancy: even within a single sensor modality, Mobileye develops multiple diverse algorithms for each safety-critical function.
Cross-Check Mechanism
The two subsystems do not fuse at the perception level. Instead, each independently filters its world model through Mobileye's RSS (Responsibility-Sensitive Safety) framework. The planning layer receives two independent RSS-filtered world models. The system follows the more conservative response when the two models disagree -- if either channel detects a hazard, the vehicle acts on it.
Mobileye maintains this independence rigorously in development: the company operates two separate developmental AV fleets -- one driving camera-only (no radar or LiDAR installed), and another driving radar/LiDAR-only (cameras physically covered or absent). Each fleet must demonstrate safe autonomous operation independently.
Primary-Guardian-Fallback (PGF) Architecture
Building on True Redundancy, Mobileye's safety architecture uses a layered decision-making model called Primary-Guardian-Fallback:
- Primary: The main self-driving system generates a trajectory. It has full access to the camera subsystem's world model and REM maps.
- Guardian: An independent monitoring layer evaluates the Primary's planned trajectory against RSS safety constraints and perception data from both subsystems. The Guardian can veto any trajectory it deems unsafe.
- Fallback: A separate self-driving system capable of generating its own trajectory. If the Guardian rejects the Primary's plan, the system switches to the Fallback trajectory.
PGF ensures that a failure requires at least two independent subsystems to fail simultaneously. With three different sensor types (camera, radar, LiDAR) covering the same field of view, the probability of undetected failure drops to negligible levels.
2. EyeQ Chip Perception Accelerators
Heterogeneous Architecture Philosophy
Mobileye's EyeQ chips achieve their extreme power efficiency (5--40W for full autonomous perception) through a heterogeneous computing architecture specifically co-designed with Mobileye's perception software. Rather than using general-purpose GPUs (which waste power on capabilities unnecessary for automotive perception), each EyeQ contains four classes of proprietary accelerators, each optimized for a specific family of algorithms.
The Four Accelerator Classes
XNN -- Deep Learning Accelerator
The XNN is Mobileye's dedicated CNN/deep learning inference engine. It is purpose-built to execute convolutional neural networks, transformer attention computations, and dense matrix operations at maximum throughput per watt.
- Architecture: Systolic-array-like dataflow architecture optimized for matrix multiply-accumulate operations
- Precision: Int8 inference with support for quantization-aware training (QAT) and post-training quantization (PTQ) to maintain accuracy
- EyeQ5: Contains dedicated XNN cores capable of 24 TOPS peak deep learning performance
- EyeQ Ultra: Contains 16 XNN CNN accelerator cores, providing the bulk of the 176 TOPS rating
- Workload: All CNN inference -- object detection, semantic segmentation, depth estimation, lane detection, traffic sign classification. The XNN handles the heaviest computational load in the perception pipeline.
PMA -- Programmable Macro Array (CGRA)
The PMA is a coarse-grained reconfigurable architecture (CGRA) -- a dataflow machine with a unique architecture that delivers outstanding performance for dense, regular computational patterns.
- Architecture: An array of processing elements connected by a configurable interconnect, programmed at the macro-operation level rather than individual instructions
- EyeQ Ultra: Contains 8 PMA cores
- Workload: Dense computer vision algorithms (feature extraction, image preprocessing, filtering), deep learning operations that map well onto dataflow architectures. PMA cores can also run deep neural networks "extremely efficiently," providing low-level support for any dense-resolution sensor (cameras, next-generation LiDARs, and radars)
VMP -- Vector Microcode Processor (SIMD/VLIW)
The VMP is a vector processor combining single-instruction multiple-data (SIMD) parallelism with very-long-instruction-word (VLIW) scheduling for maximum throughput on vectorizable workloads.
- Architecture: SIMD VLIW machine -- multiple operations packed into a single wide instruction word, each operating on vectors of data simultaneously
- Workload: FFT (Fast Fourier Transform) processing required for radar and ultrasonic sensor data; signal processing chains; image filtering operations where the computation pattern is regular but requires precise programmatic control. The VMP is particularly important for the radar/LiDAR subsystem.
MPC -- Multi-Thread Processor Cluster
The MPC is a cluster of barrel-threaded CPU cores designed for high-throughput, latency-tolerant workloads.
- Architecture: Barrel-threaded (hardware multi-threaded) cores that can switch between threads on every cycle, hiding memory latency and keeping execution units busy
- Workload: Control flow-heavy algorithms, tracking, state management, data marshaling between accelerators, and perception tasks that require complex conditional logic rather than pure data parallelism
Workload Distribution Across Accelerators
The perception pipeline is decomposed across accelerators based on computational characteristics:
Raw Camera Frames
|
v
[ISP] -- Image Signal Processing (dedicated hardware block)
|
v
[PMA] -- Preprocessing: image normalization, feature pyramid construction
|
v
[XNN] -- CNN inference: object detection, segmentation, depth estimation
|
v
[MPC] -- Post-processing: NMS, tracking state updates, constellation formation
|
v
[VMP] -- Radar/LiDAR signal processing (FFT, beamforming)
|
v
[CPU] -- Planning integration, RSS computation, trajectory generationThis heterogeneous design means EyeQ achieves its performance targets with far fewer raw TOPS than competing solutions: the EyeQ Ultra's 176 TOPS delivers the equivalent perception capability of systems rated at 500+ TOPS on general-purpose architectures, because every transistor is tuned for its assigned workload.
EyeQ Ultra Silicon Specifications
| Attribute | Specification |
|---|---|
| Process node | 5 nm (TSMC) |
| CPU cores | 12 dual-threaded RISC-V cores (first EyeQ to abandon MIPS ISA) |
| Accelerator cores | 64 total: 16 XNN, 8 PMA (CGRA), additional VMP and MPC cores |
| GPU | Arm GPU, up to 256 GFLOPS |
| VPU | Vision Processing Unit for traditional CV operations |
| ISP | Integrated Image Signal Processor |
| Video encoding | H.264/H.265 encoding cores |
| DL performance | 176 TOPS (int8) |
| Power envelope | < 100 W |
| Target | Single-chip L4 AV-on-chip: processes both camera and radar/LiDAR subsystems, central computing, HD map, and driving policy |
EyeQ6H Silicon Specifications
| Attribute | Specification |
|---|---|
| Process node | 7 nm |
| CPU/threads | 8 cores / 32 threads |
| DL performance | 34 DL TOPS (int8) |
| Power | ~6.25 W (25% more than EyeQ5H for 3x the compute) |
| Built-in blocks | Dedicated ISP, GPU, video encoder |
| Benchmark | > 1,000 FPS on pixel-labeling neural networks (vs. 91 FPS on EyeQ5) |
| Efficiency | ResNet-50 single-stream latency: 0.5 ms (vs. 1.64 ms for Jetson AGX Orin at 275 TOPS) |
EyeQ5 Silicon Specifications
| Attribute | Specification |
|---|---|
| Process node | 7 nm FinFET (TSMC, 15 metal layers) |
| CPU cores | 8 multithreaded |
| Vision processor cores | 18 next-generation |
| Accelerator classes | XNN, PMA, VMP, MPC |
| DL performance | Up to 24 TOPS (peak); 12 TOPS typical; EyeQ5M variant: 4.6 DL TOPS int8 |
| Power | < 5 W (typical operating) |
| Sensors | Fuses up to 20 sensors |
| Co-design partner | STMicroelectronics (design); TSMC (fabrication) |
3. Camera-Only Perception Subsystem
Overview
The camera-only perception subsystem is the heart of Mobileye's autonomy stack -- it is the one subsystem that is present across every product tier from basic L0 ADAS (one camera, one EyeQ4) to full L4 Drive (13 cameras, 4x EyeQ6H). The entire subsystem runs exclusively on camera data processed by EyeQ chips, with no radar or LiDAR input.
Camera Array Configuration
SuperVision (L2++):
- 7 x 8MP cameras: front narrow, front wide, 2x side-front, 2x side-rear, rear
- 4 x 2MP cameras: short-range surround (parking/close proximity)
- Total: 11 cameras, 360-degree coverage
Drive (L4, ID.Buzz AD):
- Up to 13 cameras with overlapping fields of view for stereo triangulation
All cameras are high-dynamic-range CMOS sensors operational in low light, high glare, direct sun, and tunnel transitions.
Perception Pipeline Architecture
The camera perception pipeline runs at approximately 10 Hz (10 full perception cycles per second) and produces a comprehensive scene understanding through multiple specialized neural networks running in parallel on EyeQ accelerators.
Raw Camera Images (11-13 streams)
|
v
+------------------+
| Image Signal | -- Debayering, HDR tone mapping, noise reduction,
| Processing (ISP) | lens distortion correction
+------------------+
|
v
+------------------+
| Feature | -- Multi-scale backbone (shared across tasks)
| Extraction | Produces feature pyramid at multiple resolutions
+------------------+
|
+--------+--------+--------+--------+--------+--------+
| | | | | | |
v v v v v v v
[Object] [Lane] [Free] [Depth] [Seg] [TSR] [TFL]
[Detect] [Det.] [Space] [Est.] [ment] [Sign] [Light]
| | | | | | |
v v v v v v v
+-----------------------------------------------------------+
| Post-Processing & Association |
| -- NMS, multi-camera projection, temporal fusion, |
| track management, 3D lifting |
+-----------------------------------------------------------+
|
v
+-----------------------------------------------------------+
| World Model Construction |
| -- Object list (position, velocity, class, uncertainty) |
| -- Road model (lanes, boundaries, drivable area) |
| -- Infrastructure (signs, signals, construction zones) |
+-----------------------------------------------------------+
|
v
[RSS Safety Layer] --> [Planning/Policy]Multi-Scale Feature Extraction
Mobileye uses a shared backbone network that produces a feature pyramid at multiple spatial resolutions. This enables:
- High-resolution features (1/4 or 1/8 of input): Capture fine-grained details needed for small/distant object detection, lane marking localization, and sign text recognition
- Mid-resolution features (1/16): Balanced detail and context for standard vehicle/pedestrian detection
- Low-resolution features (1/32 or 1/64): Large receptive field for scene-level understanding, road type classification, and weather/lighting estimation
The shared backbone amortizes compute cost -- feature extraction is performed once, and multiple task-specific heads consume features from the pyramid at appropriate scales.
2D Detection to 3D Lifting
Camera perception inherently operates in 2D image space. Mobileye employs multiple techniques to lift detections into 3D world coordinates:
- Monocular 3D estimation: Neural networks trained to predict 3D bounding box parameters (depth, height, width, orientation) directly from 2D image features, using learned priors about object sizes and road geometry
- Multi-camera triangulation: Overlapping camera fields of view allow stereo-like triangulation to compute metric depth for objects visible in multiple cameras
- Temporal structure from motion: Camera motion between frames (known from ego-motion estimation) creates parallax that enables depth computation through multi-view geometry
- ViDAR point cloud generation: Deep networks predict dense 3D point clouds from camera images (see Section 4)
- Ground plane projection: For objects assumed to rest on the road surface, 2D bounding box bottom edges combined with camera calibration yield metric distance estimates
Road Model Generation
The camera subsystem constructs a comprehensive road model:
- Lane geometry: Represented as polynomial curves (3rd or 4th degree) in bird's-eye-view coordinates, with lane type classification (solid, dashed, double, Botts' dots)
- Lane topology: Graph structure capturing merges, splits, and intersection connectivity
- Road edges: Curbs, barriers, guardrails, and soft road boundaries (grass, gravel)
- Drivable path: The full free-space region where the vehicle can safely travel
- Road surface: Estimated curvature, slope, bank angle, and surface condition
Outputs
The camera-only subsystem produces a complete, self-contained world model sufficient for autonomous driving:
| Output Category | Specifics |
|---|---|
| Object list | 3D position, velocity, acceleration, heading, classification (vehicle, pedestrian, cyclist, motorcycle, animal, debris), dimensions, uncertainty bounds |
| Lane model | Per-lane polynomial geometry, lane type, lane graph topology, merge/split points |
| Free space | Per-direction free-space boundary with boundary classification (curb, vehicle, barrier, unknown) |
| Traffic infrastructure | Sign detections with classification, signal detections with state (red/yellow/green/arrow), construction zone boundaries |
| Road model | Road geometry (curvature, slope, bank), road type classification, speed limit inference |
| Depth map | Dense per-pixel depth estimates (ViDAR) |
| Semantic map | Per-pixel semantic labels for entire scene |
4. ViDAR (Vision as Virtual LiDAR)
Concept
ViDAR (Vision Detection and Ranging, or Visual LiDAR) is Mobileye's technique for generating LiDAR-like 3D point clouds from camera data alone. The fundamental idea: if a deep neural network can learn the mapping from 2D camera images to 3D point cloud representations, then the camera subsystem can reason about 3D geometry with the same data format as a LiDAR -- enabling a single set of downstream algorithms (path planning, obstacle avoidance, free-space estimation) to operate on either real LiDAR data or camera-derived virtual LiDAR data.
Architecture
ViDAR uses multiple views of the environment (both spatial -- from multiple cameras -- and temporal -- from vehicle motion) to produce dense depth estimates that are converted into a 3D point cloud.
The architecture broadly follows a three-stage pipeline:
History Encoder: A visual backbone encodes multiple camera images (potentially from multiple timesteps) into a bird's-eye-view (BEV) feature representation. This step lifts 2D features into a 3D-aware latent space.
Latent Rendering Operator: A novel module that models the 3D geometric latent space and bridges the encoder to the decoder. This operator converts the BEV features into a form suitable for 3D point prediction by reasoning about geometric relationships in latent space.
Point Cloud Decoder: Predicts dense 3D point clouds at target locations, generating the virtual LiDAR output. In the forecasting variant, an auto-regressive transformer iteratively predicts future point clouds for arbitrary timestamps.
Training Methodology
ViDAR networks are trained on paired camera + LiDAR data:
- Data: Driving sequences where camera images and simultaneous real LiDAR point clouds are captured
- Supervision: The real LiDAR point cloud serves as ground truth; the network learns to predict 3D structure from camera images alone
- Loss: Geometric losses (chamfer distance, depth error) ensure the predicted point cloud matches the real LiDAR spatially
- Augmentation: The training framework can also leverage "visual point cloud forecasting" -- predicting future point clouds from historical visual inputs -- which provides simultaneous supervision of semantics, 3D structure, and temporal dynamics
Accuracy vs. Real LiDAR
ViDAR does not fully match real LiDAR in raw point density or range precision, but it achieves sufficient accuracy for safe driving decisions:
- At close range (0--30 m): Depth accuracy is high due to strong stereo/motion cues
- At medium range (30--100 m): Accuracy degrades gracefully; the system compensates through learned priors about object size and road geometry
- At long range (100+ m): ViDAR produces sparser, less certain estimates, but imaging radar fills this gap in the True Redundancy architecture
ViDAR has been validated on four downstream perception tasks: 3D object detection, semantic occupancy prediction, map segmentation, and multi-object tracking. Notably, ViDAR using solely Image-LiDAR pre-training sequences outperforms supervised 3D detection pre-training on these benchmarks.
Role in Production
In Mobileye's SuperVision system (camera-only), ViDAR provides the 3D perception backbone that allows the camera subsystem to maintain spatial awareness comparable to vehicles equipped with active sensors. The ViDAR output feeds directly into free-space estimation, path planning, and RSS computations.
For the Chauffeur and Drive platforms (which also have radar/LiDAR), ViDAR serves as the camera subsystem's 3D backbone, while real LiDAR/radar point clouds feed the independent active-sensor subsystem. The two 3D representations are never fused -- they are compared only at the planning layer, maintaining True Redundancy.
5. Road Experience Management (REM)
Architecture: Three-Layer System
REM is built on three layers that form a closed loop between millions of production vehicles and Mobileye's cloud infrastructure:
Layer 1: Harvesting Agents
Every Mobileye-equipped ADAS vehicle on the road (150+ million as of 2025) is a potential harvesting agent. The onboard EyeQ chip performs real-time analysis of camera feeds and extracts:
- Driving path geometry: The trajectory the vehicle follows, encoded as a 3D spline curve
- Stationary landmarks: Traffic signs (including text, colors, shape), lane markings, lane markings directional arrows, stop lines, traffic signals, guardrails, barriers, poles, road edges
- Road surface features: Line representations of lane markings and road edges
- Road signature profiles: Elevation and terrain characteristics
- Traffic patterns: Dynamic observations of typical speeds, congestion, behavior
The critical innovation is on-device compression: all of this information is compressed into Road Segment Data (RSD) packets at approximately 10 KB per kilometer -- roughly the size of a plain text email. This is achieved through Mobileye's real-time geometrical and semantic analysis algorithms, which extract only the essential map-relevant information and discard raw imagery. The data is fully anonymized before transmission.
Layer 2: Map Aggregating Server (Cloud)
Road Segment Data from millions of vehicles is uploaded to Mobileye's cloud infrastructure (running on AWS with 500,000 peak CPU cores, Apache Spark on Amazon EKS). The cloud pipeline:
- Aggregates multiple drives over the same road segment from different vehicles and time periods
- Aligns overlapping data using landmark detection and matching
- Reconciles conflicting observations through statistical fusion
- Detects changes (new construction, removed signs, changed lane markings) through sophisticated change-detection algorithms
- Builds/updates the Mobileye Roadbook -- the output HD map
Map generation is fully automated -- Mobileye demonstrated mapping all of Japan (25,000 km of roads) in 24 hours at the push of a button, with the entire map occupying only 400 MB (~16 KB/km).
Layer 3: Map-Consuming Agents
Autonomous and semi-autonomous vehicles receive the Roadbook via OTA (over-the-air) updates and use it for:
- Centimeter-accurate localization: The vehicle detects landmarks stored in the Roadbook using its cameras, then computes its precise position relative to those landmarks. This achieves 5-10 cm localization accuracy -- far superior to GPS alone
- Path planning: Lane-level geometry and topology provide the reference trajectory
- Anticipation: The map tells the vehicle what to expect ahead (upcoming curves, intersections, speed changes), enabling smoother driving
- Change detection feedback: If the vehicle observes a discrepancy between the Roadbook and reality (e.g., a sign has been removed), it reports this back to the cloud, closing the loop
Roadbook Contents
| Category | Data Stored |
|---|---|
| Lane geometry | 3D polynomial representations of each lane boundary and center |
| Lane topology | Graph structure: merge/split points, connectivity through intersections |
| Road signs | Position, type, text content, colors, relevancy to each lane |
| Traffic signals | Position, type, number of lights, applicable lanes |
| Road boundaries | Curbs, barriers, guardrails with 3D positions |
| Landmarks | Distinctive features used for localization (poles, sign posts, specific markings) |
| Driving culture | Typical speeds, common driving behaviors, traffic patterns per time of day |
| Road surface | Curvature, slope, bank angle, elevation profile |
| Stop/yield lines | Position and type |
| Directional arrows | Lane-level directional information |
Scale and Update Cadence
- Harvesting fleet: 150+ million vehicles equipped with Mobileye technology
- Data harvested in 2024: 29.6 billion miles
- Update frequency: Near real-time for high-traffic areas; change-detection algorithms trigger updates when discrepancies are detected
- Coverage: Global, with particularly dense coverage in Europe, North America, and East Asia
- Map freshness: Measured in hours/days for heavily traveled roads, not months or years
Compression: 10 KB/km
The 10 KB/km figure is achieved because REM transmits semantic abstractions rather than raw sensor data:
- Instead of transmitting images or point clouds (megabytes per frame), the harvesting agent extracts the geometric and semantic essence: "there is a speed limit sign reading 50 at position (x, y, z) relative to the lane center"
- Landmarks are stored as compact geometric descriptors (position, type, orientation) rather than image patches
- Driving paths are stored as polynomial curves (a few coefficients) rather than dense point sequences
- Traffic patterns are statistical summaries (average speed, variance) rather than raw trajectory logs
6. LiDAR/Radar Subsystem
Imaging Radar
Mobileye has developed a proprietary software-defined imaging radar from the ground up -- silicon, algorithms, and signal processing -- designed to serve as an independent perception channel that operates without any camera input.
Hardware Architecture
| Specification | Value |
|---|---|
| Processing SoC | Proprietary radar processor, 11 TOPS compute |
| RFIC | Custom Mobileye-designed radar RF integrated circuits |
| BSR (forward-facing) virtual channels | > 1,500 at 20 FPS |
| BSRC (corner-mounted) virtual channels | > 300 |
| Angular resolution | < 0.5 degrees |
| Sidelobe suppression | -40 dBc (vs. typical -20 to -30 dBc in automotive radars) |
| Dynamic range | 100 dB (vs. ~60 dB in conventional automotive radars) |
| Frame rate | 20 Hz |
| Manufacturing partner | WNC (collaboration for production) |
The massive antenna array combined with Mobileye-designed RFICs allows exceptional flexibility in signal transmission and the ability to receive and sample the entire radar signal in a wide bandwidth while keeping noise at a low level. Mobileye achieved a 12x resolution increase without a proportional compute increase through algorithmic innovation in signal processing.
Detection Capabilities
| Detection Target | Range |
|---|---|
| Road users (pedestrians, motorcycles, cyclists) | Up to 315 m |
| Potential hazards (debris, tire on road) | Up to 230 m |
| Vehicles | Up to 300+ m |
| Operating speed | Reliable up to 130 km/h |
The imaging radar's key differentiator over conventional automotive radar is its ability to provide height (elevation) information as an additional dimension beyond traditional 2D angular and range data, producing a "rich point cloud" that enables:
- Exact lane assignment of detected objects
- Distinction between overhead signs/bridges and road-level obstacles
- Detection of stationary vehicles under bridges (a notorious failure mode of conventional radar)
- Detection of a child at 150 m when a bus is only 10 m away (demonstrating exceptional dynamic range)
Independent Perception
The imaging radar is designed to function as a complete, independent perception system aligned with True Redundancy principles. It:
- Detects both dynamic and static objects independently of cameras or LiDAR
- Generates point clouds dense enough for object classification and tracking
- Operates reliably in adverse conditions (rain, fog, dust, direct sunlight, darkness) where cameras degrade
- Has been selected by a global automaker for SAE Level 3 eyes-off driving applications (announced May 2025)
LiDAR Strategy
Former FMCW LiDAR Program (Ended September 2024)
Mobileye had been developing a proprietary FMCW (Frequency-Modulated Continuous-Wave) LiDAR-on-a-chip:
- 4D measurements: position (x, y, z) + instantaneous velocity
- Range: up to 300 m
- Point density: 600 points per degree
- Pulse rate: 2 million laser pulses per second
- Advantage over ToF LiDAR: Direct velocity measurement and operation at lower, safer power levels
The ~100-person LiDAR division was shut down in September 2024 after advances in camera perception (EyeQ6-based) and imaging radar performance made proprietary FMCW LiDAR "less essential."
Current LiDAR Approach
- Drive (L4): Uses third-party LiDARs from Innoviz Technologies
- Chauffeur (L3): Front-facing LiDAR from Luminar (e.g., Polestar 4 integration announced)
- ID.Buzz AD configuration: 9 LiDARs + 5 radars + 13 cameras = 27 total sensors
Radar/LiDAR Subsystem Perception Pipeline
The active-sensor subsystem runs its own independent perception pipeline:
- Radar signal processing: Raw antenna data --> FFT (on VMP accelerators) --> beamforming --> detection --> point cloud generation
- LiDAR point cloud processing: Raw returns --> filtering --> ground plane removal --> clustering
- Object detection: 3D object detection on radar/LiDAR point clouds, independent of any camera data
- Tracking: Multi-object tracking using radar/LiDAR detections only
- Free space estimation: Drivable area determined from point cloud density and ground plane analysis
- World model: Complete object list + drivable space map, fed independently to RSS
7. Free Space Detection
Definition
Free space detection determines the drivable area surrounding the vehicle -- the region of road surface where the vehicle can safely travel without collision. It is one of the most safety-critical perception outputs: errors in free space estimation directly translate to either unnecessary stopping (false obstacles) or collisions (missed boundaries).
Camera-Based Free Space Methods
Mobileye employs multiple complementary algorithms for free space detection from cameras:
Semantic Segmentation Approach
A fully convolutional neural network classifies every pixel in the camera image into semantic categories. The "road surface" class defines the drivable area. Pixels classified as non-road (vehicles, pedestrians, curbs, barriers, vegetation) define the free-space boundary.
- Advantages: Dense, per-pixel output; can handle arbitrary road shapes and complex intersections
- Challenges: Requires robust classification at boundaries; errors at road edges are most safety-critical
Top-View Free Space Method
A dedicated algorithm projects camera features into a bird's-eye-view (BEV) representation and estimates the free-space boundary as a polar function around the ego vehicle -- for each angular direction, the algorithm determines the distance to the nearest obstacle or road boundary.
This method operates similarly to how a LiDAR system would determine free space: for each radial direction, find the first non-traversable point.
Column-Wise Boundary Regression
Rather than classifying every pixel, this approach treats each image column as a 1D classification problem: for each column, predict the row (height) at which the free-space boundary lies. This is computationally cheaper than full semantic segmentation and produces smooth, continuous boundary curves.
The boundary is further classified by type:
- Traversable boundaries: Lane markings (can be crossed if needed)
- Non-traversable boundaries: Curbs, barriers, walls, vehicles
- Specific classes: Pedestrian, vehicle, curb, barrier, unknown
Geometric Methods
Structure-from-motion and ground plane estimation provide geometric free-space cues:
- Points that are consistent with the estimated ground plane model are classified as road surface
- Points that deviate from the ground plane (elevated or depressed) indicate obstacles or boundaries
Boundary Classification
Mobileye's free space system doesn't just detect the boundary -- it classifies what forms the boundary. This classification is safety-critical: the proper response to a curb boundary differs from the proper response to a pedestrian boundary.
| Boundary Type | Example | Response Implication |
|---|---|---|
| Vehicle | Parked car, moving vehicle | Full stop or lane change |
| Pedestrian | Person standing at curb | Maximum caution, full stop |
| Curb | Raised curb edge | Traversable in emergency; normally a hard boundary |
| Barrier | Jersey barrier, guardrail | Non-traversable |
| Vegetation | Grass, bushes | Soft boundary; traversable in emergency |
| Construction | Cones, temporary barriers | Non-traversable |
| Unknown | Unclassified obstacle | Treated conservatively |
Fusion of Free Space Methods
The multiple free-space algorithms produce overlapping estimates. Post-processing fuses these into a single free-space map using conservative logic: if any algorithm detects a boundary, it is included in the final free-space map. This "union of hazards" approach ensures maximum safety.
Snow and Edge Cases
Mobileye holds patents covering free-space detection in challenging conditions, including:
- Identifying probable road edge locations when lane markings are obscured by snow
- Using road geometry history from the REM Roadbook to infer boundaries when visual cues are absent
- Leveraging tire tracks in snow as implicit lane marking substitutes
8. Lane Detection
Mobileye's Lane Detection Heritage
Lane detection was one of Mobileye's founding capabilities -- the EyeQ1 (2008) shipped with lane departure warning (LDW) as a core function. Over nearly two decades, Mobileye has evolved from simple edge-detection-based lane finding to neural-network-based 3D lane detection with full topological understanding.
Classical Lane Model
In earlier EyeQ generations and the Mobileye 560/630 aftermarket systems, lanes were represented as polynomial curves:
- Model: Third-degree polynomial in the vehicle's coordinate system: y = a0 + a1x + a2x^2 + a3*x^3
- Parameters detected: Lane width, curvature, curvature derivative, heading angle, lateral offset
- Output: Coefficients for left and right lane boundaries, lane width, confidence level
This polynomial representation is compact and computationally efficient, suitable for highway driving where lanes follow smooth, predictable curves.
3D-LaneNet: Neural 3D Lane Detection
In 2019, Mobileye researchers Noa Garnett, Rafi Cohen, Tomer Pe'er, Roee Lahav, and Dan Levi published "3D-LaneNet" (ICCV 2019), introducing the first end-to-end neural network for 3D multiple lane detection from a single image.
Key Innovations
Intra-Network IPM (Inverse Perspective Mapping): The network includes an internal projection layer that converts image-view features into top-view (BEV) features, enabling dual-representation information flow. The network processes features in both views simultaneously, allowing image-view features to capture appearance details while top-view features capture spatial relationships.
Anchor-Based Lane Representation: Rather than detecting lane pixels and then fitting curves, the network uses an "anchor-per-column" output representation that directly predicts lane positions. This replaces heuristic post-processing (clustering, outlier rejection) with a learned end-to-end pipeline, casting lane estimation as an object detection problem.
3D Prediction: The network directly predicts lanes in 3D -- not just 2D image coordinates, but 3D road coordinates including elevation variations (uphills, overpasses, road undulations). This was the first approach to predict 3D lane layouts from monocular images without assuming constant lane width or flat road surfaces.
Follow-up: 3D-LaneNet+
Mobileye researchers subsequently published "3D-LaneNet+: Anchor Free Lane Detection using a Semi-Local Representation" (2020), which improved on the original by removing the need for predefined anchors and using a semi-local representation that better handles complex lane topologies.
Lane Graph Topology
Modern Mobileye systems go beyond individual lane boundary detection to construct a full lane graph:
- Lane connectivity: Which lanes connect to which at intersections, merges, and splits
- Lane type: Driving lane, turn lane, bus lane, bike lane, shoulder
- Lane direction: Direction of travel for each lane
- Lane boundaries: Type (solid, dashed, double solid, Botts' dots) and color (white, yellow, blue)
- Lane changes: Where lane changes are permitted or prohibited
Intersection Handling
Intersections present unique challenges because lane markings often disappear or become ambiguous. Mobileye addresses this through:
- REM Roadbook: The HD map provides pre-mapped intersection topology, including the connection routes between entrance and exit lanes
- Implicit lane inference: When markings are absent, the system infers lane structure from road geometry, curb positions, and vehicle trajectories observed in the Roadbook
- Traffic signal association: Determining which signal controls which lane at complex intersections
Lane Detection Without Lane Lines
At CES 2025, Mobileye indicated that its advanced systems no longer rely solely on painted lane lines for autonomous driving. The REM Roadbook provides lane-level geometry independent of paint markings, and the perception system can determine lane structure from:
- Road edges and curbs
- Vehicle behavior patterns from the Roadbook
- Road geometry (width, curvature)
- Traffic infrastructure (signals, signs, islands)
9. Traffic Light and Sign Detection
Traffic Sign Recognition (TSR)
Traffic sign recognition was one of Mobileye's original ADAS capabilities, shipping on the EyeQ1 in 2008. It has evolved into a comprehensive, multi-layered recognition system.
Detection and Classification Pipeline
Detection: A CNN-based detector identifies candidate sign regions in the camera image, handling varying sizes (from distant highway signs to nearby street signs), partial occlusion, and diverse environmental conditions (rain, glare, night)
Classification: Detected sign regions are classified into specific sign types through a multi-network pipeline:
- Shape-based classification: Geometric shape (circle, triangle, octagon, diamond, rectangle) provides initial categorization
- Appearance-based classification: CNN classifiers determine the specific sign meaning (speed limit value, warning type, regulatory instruction)
- OCR (Optical Character Recognition): For signs containing text (speed limit numbers, street names, variable message signs), OCR networks read the text content
- Signature-based classification: A novel Mobileye approach where signs are matched against a database of visual "signatures." This allows the system to be updated post-deployment when new sign types are introduced -- the vehicle receives OTA updates with new sign signatures without requiring full model retraining
Relevancy determination: A critical innovation -- the system determines which lane each sign applies to. A speed limit sign on a highway offramp should not affect the main lanes. The relevancy algorithm considers:
- Sign position relative to lane boundaries
- Sign orientation and facing direction
- Sign mounting height and type (overhead, post-mounted, pole-mounted)
- Road geometry context
Intelligent Speed Assist (ISA)
Mobileye launched the world's first camera-only Intelligent Speed Assist system certified across all 27 EU countries plus Norway, Switzerland, and Turkey, meeting the EU General Safety Regulation (GSR) mandate effective July 2022 for new models and July 2024 for existing production lines.
The ISA system integrates five technologies:
| Technology | Function |
|---|---|
| Traffic Sign Recognition | Identifies explicit speed limit signs |
| Implicit Sign Recognition | Detects signs indicating speed changes (school zones, construction, highway entrances/exits) |
| Traffic Sign Relevancy | Determines which lane a sign applies to |
| Signature-Based Classification | Future-proof recognition of new sign types via OTA updates |
| Road-Type Classification | Deep neural network determines road type (highway, urban, rural) to infer applicable speed limit |
The system was validated by six independent labs across five European countries, achieving >90% correct speed limit determination across total distance and >80% per road type (exceeding EU minimums).
Training Data
The TSR system is trained on "tens of millions of video clips, encompassing an enormous variety of parameters and features of roadways from around the world," covering diverse sign designs across different countries and continents.
Traffic Light Detection
Traffic light detection follows a similar pipeline with additional challenges:
- Detection: Locate traffic signal heads in the image, distinguishing them from other light sources (tail lights, streetlights, reflections)
- State classification: Determine the active state: red, yellow, green, flashing, arrow (and arrow direction)
- Association: Determine which signal controls which approach lane -- a particularly difficult problem at complex intersections with multiple signal heads
- Relevancy: Confirm the signal applies to the ego vehicle's current path
The REM Roadbook pre-maps traffic signal locations, types, and lane associations, providing a strong prior that helps resolve ambiguities in real-time detection.
10. Pedestrian/VRU Detection
ADAS Heritage
Pedestrian detection is one of Mobileye's deepest areas of expertise. The EyeQ2 (2010) introduced pedestrian detection as a key capability, enabling automatic emergency braking (AEB) for pedestrian collision avoidance. Today, approximately 125-150 million vehicles worldwide use Mobileye's pedestrian detection technology.
The Insurance Institute for Highway Safety (IIHS) found that AEB can reduce the rate of pedestrian collisions by over a quarter. Mobileye estimates that if vehicles involved in 2020 pedestrian collisions had been equipped with AEB, approximately 1,700 lives could have been saved in the United States alone.
Multi-Method Detection Architecture
Mobileye employs multiple parallel detection algorithms operating simultaneously on camera feeds, with each method designed to catch cases the others might miss:
Classic Pattern Recognition
The foundational method: a trained classifier (CNN-based) identifies and classifies objects and road users based on learned visual patterns. The network has been trained on millions of annotated pedestrian examples across diverse conditions (day, night, rain, partial occlusion, varying clothing, various poses).
Full Image Detection
For larger objects in close proximity to the vehicle, a wide-area detection method covers the full image frame rather than scanning local regions. This catches pedestrians who are very close (and therefore very large in the image) or who are at unusual positions (e.g., crossing from behind a parked vehicle).
Segmentation Method
A semantic segmentation network labels individual pixels and groups of pixels to identify smaller elements in the driving environment, including pedestrians and cyclists. This method excels at detecting partially occluded pedestrians (only legs visible under a vehicle, only upper body visible behind a bollard).
Top-View Free Space Method
Distinguishes road users from the road surface itself in the bird's-eye-view representation. A pedestrian standing on the road creates a distinct free-space boundary that this method detects even when the person is at an unusual angle or partially blended with the background.
Wheel Detection
A specialized algorithm that classifies vehicles by identifying their wheels. While primarily for vehicle detection, it also contributes to distinguishing between pedestrians and vehicles in ambiguous situations.
ViDAR-Based 3D Detection
The ViDAR system creates a 3D point cloud from camera data. Pedestrians appear as distinctive 3D clusters in this representation, providing an additional detection channel that is fundamentally different from 2D image-based methods.
Specialized VRU Detection
Beyond standard pedestrian detection, Mobileye has developed dedicated algorithms for:
- Baby strollers: Detecting strollers even when the pushing adult is not visible
- Wheelchairs: Recognizing wheelchair users with their distinctive profile
- Open car doors: Detecting when a parked car's door is opening into the path of travel
- Cyclists: Dedicated bicycle and e-scooter detection
- Motorcyclists: Detection at extended ranges
Behavioral Prediction
Mobileye's pedestrian detection goes beyond mere detection to behavioral analysis:
- Orientation estimation: Determining which direction a pedestrian is facing
- Posture analysis: Standing, walking, running, crouching, bending
- Gesture recognition: Detecting hand signals, waving, pointing
- Intent prediction: Using orientation, posture, and trajectory history to predict whether a pedestrian is likely to step into the road
Euro NCAP Compliance
Mobileye's pedestrian detection supports OEM compliance with Euro NCAP VRU (Vulnerable Road User) testing protocols:
- AEB Pedestrian scenarios test detection and braking from speeds of 10 km/h upward
- Day and night scenarios are required
- The system must detect pedestrians crossing from both sides, including partially occluded scenarios
- Recent protocols require detection of cyclists and other VRUs
11. Object Detection Architecture
Design Philosophy
Mobileye uses multiple specialized neural network architectures rather than a single monolithic model. Each network is optimized for its specific task and the computational profile of EyeQ accelerators.
Multi-Scale Detection
Object detection operates across the feature pyramid produced by the shared backbone:
- Small/distant objects (pedestrians at 100+ m, small debris): Detected at high-resolution feature levels where spatial detail is preserved
- Medium objects (vehicles at 20-80 m): Detected at mid-resolution levels balancing detail and context
- Large/close objects (adjacent vehicles, trucks): Detected at low-resolution levels where large receptive fields capture the full object
Detection Head Architecture
Mobileye's detection heads follow a region-based or anchor-based paradigm (similar in spirit to CenterPoint-style detection), with outputs including:
- 2D bounding box: Location in image coordinates
- 3D bounding box parameters: Depth, 3D dimensions (height, width, length), orientation angle
- Classification: Multi-class output (car, truck, bus, motorcycle, bicycle, pedestrian, animal, debris, cone, barrier)
- Confidence score: Per-class probability
- Attributes: Brake lights, turn signals, hazard lights (for vehicles); orientation, posture (for pedestrians)
Teacher-Student Training Framework
Mobileye employs a teacher-student knowledge distillation framework:
- Teacher network: A large, computationally expensive model is trained on the full dataset using high-precision (FP32) computation, potentially running offline on cloud GPUs
- Automatic Ground Truth (Auto GT): Advanced sensors and offline processing generate accurate representations of driving scenarios automatically, reducing manual labeling dependency
- Student network: A smaller, efficient model is trained to mimic the teacher's outputs while being tailored for specific EyeQ hardware configurations (EyeQ4 vs. EyeQ6H vs. EyeQ Ultra)
This approach enables Mobileye to deploy different-complexity models across different product tiers while maintaining accuracy: the EyeQ4 student model is simpler but benefits from knowledge distilled from a much more capable teacher.
Quantization
All inference models are quantized for Int8 execution on XNN accelerators:
- Quantization-Aware Training (QAT): During training, fake quantization nodes simulate the precision loss of Int8, allowing the network to adapt its weights to maintain accuracy under quantization
- Post-Training Quantization (PTQ): For faster deployment, models can be quantized after training with calibration on representative data
- EyeQ6H is specifically "designed for Int8 computation" -- the hardware is optimized for this precision level
12. Object Tracking
Camera-Only Tracking
In the camera subsystem, object tracking maintains consistent identity and trajectory estimation for detected objects across frames:
Association
Each detection in a new frame is matched to an existing track using:
- Position prediction: Kalman filter or similar state estimator predicts where each tracked object should appear in the new frame
- Appearance matching: Feature vectors extracted by the detection network are compared between tracks and new detections
- 3D motion consistency: The predicted 3D trajectory must be physically plausible
State Estimation
For each tracked object, the tracker maintains:
- 3D position (x, y, z) in ego-vehicle coordinates
- 3D velocity (vx, vy, vz)
- 3D acceleration estimates
- Heading angle and yaw rate
- Object dimensions
- Classification history
- Confidence/track quality
Multi-Camera Tracking
With 11-13 cameras providing 360-degree coverage, objects transition between camera views as they move around the vehicle. Mobileye's tracker must:
- Re-identify objects when they move from one camera's field of view to another
- Maintain continuous tracks through camera transitions
- Handle temporary occlusion (object briefly hidden behind another vehicle, pole, etc.)
- Manage track initiation (new objects entering the scene) and termination (objects leaving)
Radar/LiDAR-Augmented Tracking
In the radar/LiDAR subsystem (Chauffeur and Drive platforms), tracking operates independently on active-sensor detections:
- Radar provides direct radial velocity measurements for each detection, greatly simplifying velocity estimation
- LiDAR provides precise 3D point clouds that enable robust association even when objects are closely spaced
- The active-sensor tracker produces its own independent track list, which is compared to (but not fused with) the camera tracker's output at the planning layer
Temporal Consistency
The tracker enforces temporal consistency constraints:
- Objects cannot teleport -- position changes must be physically plausible given estimated velocity
- Classifications should be stable -- a vehicle should not suddenly become a pedestrian (classification history is used to smooth transient misclassifications)
- Dimensions should be consistent -- an object's estimated size should not vary wildly between frames
13. Semantic Segmentation
Per-Pixel Classification
Mobileye's semantic segmentation network assigns a class label to every pixel in each camera image. On the EyeQ6H, this runs at over 1,000 FPS -- enabling multiple segmentation passes per perception cycle, or the use of higher-resolution inputs.
Segmentation Classes
Mobileye's segmentation network covers a comprehensive set of urban and highway driving classes:
| Category | Classes |
|---|---|
| Road surface | Drivable road, non-drivable road, crosswalk, speed bump |
| Lane markings | Solid line, dashed line, double line, Botts' dots, stop line, directional arrow |
| Vehicles | Car, truck, bus, motorcycle, bicycle |
| VRUs | Pedestrian, cyclist, rider |
| Infrastructure | Curb, sidewalk, barrier, guardrail, fence, wall |
| Vegetation | Tree, bush, grass |
| Traffic elements | Traffic sign, traffic light, pole, utility structure |
| Construction | Cone, temporary barrier, construction zone |
| Sky | Sky region |
| Terrain | Unpaved ground, gravel, dirt |
Architecture
The segmentation network uses a fully convolutional architecture (no fully connected layers) that:
- Consumes multi-scale features from the shared backbone
- Produces a dense prediction at the target resolution
- Can operate at multiple resolutions depending on the task (full-resolution for lane markings, lower resolution for coarse scene understanding)
The EyeQ6H's achievement of >1,000 FPS on pixel-labeling networks (compared to 91 FPS on EyeQ5) represents more than a 10x improvement, enabling denser and more frequent segmentation than previous generations.
Applications
Semantic segmentation feeds into multiple downstream tasks:
- Free space estimation: The road surface class defines drivable area
- Lane detection: Lane marking classes provide input to lane geometry estimation
- Object detection validation: Segmentation provides a complementary signal to bounding-box detection
- Scene understanding: Overall scene composition (urban, highway, rural, parking lot) is inferred from the distribution of semantic classes
Point Density Improvement
The EyeQ6H improves pixel segmentation capabilities through a dynamic neural network architecture that achieves more than double the point density compared to the EyeQ4M era. This means finer-grained boundary detection, more precise lane marking localization, and better separation of closely spaced objects.
14. Occupancy Prediction
Mobileye's Approach to 3D Scene Representation
While Tesla popularized the "occupancy network" concept at AI Day 2022, Mobileye's approach to 3D scene representation predates this and takes a different architectural path aligned with Compound AI principles.
Camera-Based 3D Occupancy
Mobileye's perception stack constructs a 3D understanding of the scene through multiple complementary representations:
- ViDAR Point Clouds: Dense 3D point cloud predictions from camera images (see Section 4) provide explicit 3D occupancy information -- regions with predicted points are occupied, regions without are free
- BEV Feature Maps: Multi-camera features are lifted into bird's-eye-view space, where occupancy can be inferred from feature density and classification
- Semantic Voxels: The combination of depth estimation and semantic segmentation produces a semantically labeled 3D voxel representation of the scene
Free Space as Inverse Occupancy
Mobileye's free space detection (Section 7) is effectively the complement of occupancy prediction -- it determines where the vehicle can drive, which is equivalent to determining where obstacles are not. The multiple free-space algorithms together produce a comprehensive occupancy map.
RSS-Compatible Occupancy
A critical distinction in Mobileye's approach: the occupancy representation must be compatible with RSS computations. RSS requires:
- Discrete object representations (position, velocity, dimensions) for each road user
- Well-defined geometric relationships between the ego vehicle and each object
- Conservative uncertainty handling (absence of detection does not mean the space is clear)
This means Mobileye's occupancy representation goes beyond generic voxel grids to maintain object-level semantics necessary for formal safety reasoning.
Integration with REM
The REM Roadbook provides a static occupancy prior -- the map tells the system where permanent structures (buildings, barriers, poles) are located. The real-time perception system then detects dynamic occupancy (vehicles, pedestrians, temporary obstacles) on top of this static layer, enabling efficient change detection and reducing the compute burden of full 3D reconstruction.
15. Depth Estimation
Multi-Method Depth Estimation
Mobileye employs multiple complementary methods for estimating depth (distance to objects and surfaces) from cameras:
Monocular Depth Estimation
A deep neural network predicts per-pixel depth from a single camera image. The network learns depth cues from:
- Perspective: Objects appear smaller with distance
- Occlusion: Closer objects occlude farther ones
- Texture gradients: Road texture density increases with distance
- Known object sizes: Learned priors about typical sizes of cars, people, signs
- Atmospheric perspective: Distant objects have reduced contrast
Mobileye holds patents on joint depth estimation and semantic segmentation from a single image (see Section 21), where the system simultaneously predicts both depth and semantic labels for each pixel, with the two tasks mutually reinforcing each other.
Multi-Camera Stereo
With 11-13 cameras on SuperVision/Drive, overlapping fields of view between adjacent cameras create stereo pairs:
- Front narrow + front wide cameras provide stereo depth for the forward direction
- Side-front + side-rear cameras provide stereo depth for lateral regions
- The geometric baseline between cameras is known from vehicle design, enabling triangulated depth computation
This approach provides metric depth without learned priors -- it is purely geometric, making it a valuable redundant signal alongside monocular estimation.
Structure from Motion (SfM)
As the vehicle moves, each camera captures the scene from slightly different viewpoints across successive frames. This temporal parallax enables depth estimation through structure from motion:
- Ego-motion is estimated from visual odometry, IMU, and wheel speed sensors
- Feature correspondences between frames are triangulated using the known ego-motion
- This produces a sparse 3D point cloud that is densified through interpolation
SfM is particularly effective for static scene elements (road surface, buildings, infrastructure) and provides an independent depth estimate that does not rely on learned networks.
ViDAR Dense Depth
The ViDAR system (Section 4) produces the densest depth representation, predicting a full 3D point cloud from camera images. This combines the advantages of monocular depth learning (dense predictions, works for dynamic objects) with geometric reasoning (multi-view consistency).
Depth Fusion
The multiple depth estimates are fused to produce a robust, comprehensive depth map:
- Monocular depth provides dense initialization everywhere in the image
- Stereo depth provides precise metric calibration where overlapping cameras exist
- SfM provides temporal validation for static scene elements
- ViDAR integrates all cues into a unified 3D representation
Patent: Estimating Distance Using Monocular Camera Sequences (US8164628B2)
Filed by Mobileye Technologies Ltd in 2006, this foundational patent covers estimating distance to an object using a sequence of images recorded by a monocular camera. The method uses the changing apparent size and position of objects across frames, combined with known camera motion, to compute metric distances.
16. Temporal Processing
Multi-Frame Aggregation
Single-frame perception is inherently noisy and incomplete. Mobileye's perception stack aggregates information across multiple frames to improve accuracy, consistency, and completeness.
Feature-Level Temporal Fusion
Multi-frame features are aggregated in BEV space:
- Features from consecutive frames are warped to compensate for ego-motion (using estimated vehicle odometry)
- Warped features are aggregated through learned temporal fusion modules (e.g., deformable convolution, attention mechanisms)
- The aggregated features contain temporal context that individual frames lack: motion information, occluded region filling, and noise reduction
Detection-Level Temporal Smoothing
Object detections are smoothed across frames through the tracking system:
- Transient false detections are suppressed (an object must be detected consistently across multiple frames to be confirmed)
- Transient missed detections are bridged (tracking maintains an object's predicted state through brief occlusion)
- Classification is stabilized (the most frequent classification over recent frames is used)
Map-Level Temporal Accumulation
The REM system provides the ultimate temporal accumulation: information from millions of vehicles over months and years is aggregated into the Roadbook. This provides a robust prior that supplements real-time single-vehicle perception.
Ego-Motion Estimation
Accurate ego-motion estimation is essential for temporal processing -- without knowing precisely how the vehicle moved between frames, multi-frame aggregation degrades. Mobileye estimates ego-motion from:
- Visual odometry (feature matching across camera frames)
- Vehicle dynamics sensors (wheel speed, steering angle, IMU)
- REM map-based localization (matching observed landmarks to the Roadbook)
STAT Transformer for Temporal Efficiency
Mobileye's STAT (Sparse Typed Attention) transformer architecture (described in Section 19) is used for temporal processing. By organizing tokens into structured groups with "manager tokens" enabling controlled communication, STAT achieves 100x efficiency improvements over standard transformer attention, making temporal multi-frame processing practical on EyeQ hardware.
17. RSS Integration with Perception
Perception as RSS Input
The RSS (Responsibility-Sensitive Safety) model operates on the output of perception. It does not process raw sensor data -- it consumes a structured representation of the driving scene.
Required Perception Outputs for RSS
| Perception Output | RSS Usage |
|---|---|
| Object position (x, y, z) | Input to safe distance calculation |
| Object velocity (vx, vy) | Used in longitudinal and lateral safe distance formulas |
| Object classification | Determines applicable RSS rules (vehicle vs. pedestrian vs. cyclist) |
| Object dimensions | Defines the spatial extent for collision geometry |
| Lane model | Determines lateral relationships (same lane, adjacent lane, oncoming) |
| Free space | Constrains available escape maneuvers |
| Traffic signals/signs | Determines right-of-way assignments |
| Ego position and velocity | One side of every constellation pair |
Constellation Formation
For each detected object, RSS creates a "constellation" -- a pairwise geometric relationship between the ego vehicle and that object. The constellation captures:
- Longitudinal distance and relative velocity
- Lateral distance and relative velocity
- Lane relationship (same lane, adjacent, oncoming, crossing)
- Right-of-way status (who has priority)
- Visibility status (is the object in a blind spot or occluded area?)
Safe Distance Computation
For each constellation, RSS computes the minimum safe distance using the mathematical formulas:
Longitudinal safe distance (d_min_longitudinal): Ensures that even if the leading vehicle brakes at maximum deceleration instantaneously, the ego vehicle -- after a bounded response time rho -- can brake to a stop without collision.
Parameters:
- v_r: rear (ego) vehicle velocity
- v_f: front vehicle velocity
- rho: response time (perception + actuation latency)
- a_max_accel: maximum acceleration during response time
- a_min_brake: minimum braking deceleration of ego vehicle
- a_max_brake: maximum braking deceleration of lead vehicle
Lateral safe distance (d_min_lateral): An analogous formula accounting for lateral velocities, lateral response time, and lateral braking capabilities.
Proper Response
When the measured distance drops below d_min, the situation is classified as "dangerous" and RSS computes a "proper response" -- acceleration limits [a_min, a_max] for both longitudinal and lateral dimensions that will restore safety. The proper response guarantees that the ego vehicle will not cause a collision.
Aggregation
All per-object proper responses are combined into a single actuation constraint. The planner must generate trajectories within this constraint envelope. Any trajectory satisfying all constraints is provably safe with respect to RSS.
RSS Dependency on Perception Quality
RSS's safety guarantees are only as good as the perception that feeds it. This creates specific requirements:
- Recall: Perception must not miss safety-critical objects. A missed pedestrian means RSS cannot create a constellation for them, invalidating the safety guarantee. True Redundancy addresses this: if either subsystem detects the object, RSS will account for it.
- Position accuracy: RSS safe distance formulas are sensitive to position error. Overestimating distance to a hazard can lead to insufficient braking. The 5-10 cm localization accuracy from REM helps ground the perception in metric coordinates.
- Latency: The response time rho in RSS includes perception latency. Lower perception latency allows shorter rho values, which translate to smaller required safe distances and more agile driving behavior.
RSS Rule 4: Occluded Regions
RSS Rule 4 ("Be cautious with limited visibility") specifically addresses perception limitations. When perception cannot see into an occluded region (behind a parked truck, around a blind corner), RSS assumes a hazard may exist there and computes a conservative speed limit that would allow the vehicle to stop if a hazard emerges.
This is a formal way of encoding "absence of evidence is not evidence of absence" into the driving policy.
18. VLSA (Vision Language Safety Agent)
Compound AI Architecture: Fast-Think / Slow-Think
Mobileye's autonomy software architecture, unveiled across CES 2025 and 2026, replaces monolithic end-to-end approaches with a modular Compound AI System (CAIS). The core principle is the separation of driving intelligence into two processing frequencies:
Fast-Think System (~10 Hz)
- Responsibility: All safety-critical, reflexive decisions
- Content: RSS enforcement, immediate obstacle avoidance, emergency braking, lane keeping
- Compute location: Entirely on-vehicle (EyeQ chips)
- Characteristics: Deterministic, formally verifiable, low latency
- Models: Purpose-built perception networks (CNNs, transformers) + formal RSS rules
Slow-Think System (~1 Hz)
- Responsibility: Scene-level reasoning, complex semantic understanding, edge-case resolution
- Content: Interpreting ambiguous situations, understanding construction zones, navigating unusual intersections, predicting complex multi-agent interactions
- Compute location: On-vehicle + cloud (leveraging powerful VLMs at lower frequency)
- Characteristics: Stochastic, probabilistic, higher latency acceptable
- Models: Vision-Language Models (VLMs) and VLSA
VLSA: Vision-Language-Semantic-Action Model
VLSA is the slow-think component. It is a vision-language model that processes deep scene semantics, described by Mobileye as functioning "like an adult accompanying a young driver in complex driving situations."
What VLSA Does
- Processes scene semantics: Understands not just what objects are present, but what they mean in context (e.g., a police officer directing traffic overrides traffic lights; a school bus with flashing lights requires a full stop)
- Provides structured semantic guidance: Outputs structured information to the planning system -- not direct vehicle control commands, but semantic annotations like "this is a construction zone requiring lane shift left" or "the pedestrian group at the crosswalk is likely to cross"
- Resolves edge cases: Handles situations that fast-think cannot resolve through reflexive rules alone: unusual road configurations, ambiguous signage, complex social driving situations
What VLSA Does NOT Do
- Does NOT sit in the safety loop: Safety remains in the fast-think system governed by RSS. If VLSA produces incorrect guidance, the fast-think system will still maintain safety.
- Does NOT control the vehicle directly: VLSA provides semantic context; the fast-think system converts this into safe trajectories.
- Does NOT need to run in real-time: Operating at ~1 Hz (or even lower for some decisions), VLSA can use larger, more capable models than would be possible at 10 Hz.
Cloud Connectivity
VLSA can partially run in the cloud, calling powerful vision-language models at low frequency. This enables:
- Use of much larger models than could fit on-vehicle
- Continuous model improvement via cloud updates without vehicle hardware changes
- In many cases, VLSA can replace human remote operators for edge-case resolution
ACI: Artificial Community Intelligence
Complementing VLSA, Mobileye developed ACI -- a self-play-based framework for training driving policy:
- Approach: Uses sensing-state simulation (abstract state representations) rather than photorealistic imagery
- Map integration: Places simulated agents (cars, pedestrians, buses) onto real road layouts from the REM Roadbook
- Behavioral diversity: Each agent can exhibit millions of possible driving behaviors, including rare and reckless ones
- Scale: Generates billions of simulated driving hours overnight
- Purpose: Trains driving policy to handle complex multi-agent interactions, understanding how one decision affects the behavior of others and how that chain of reactions feeds back to the ego vehicle
ACI is built on multi-agent reinforcement learning, where the key insight is that driving is a social activity -- each participant's actions influence others' responses.
19. Model Efficiency
The Efficiency Challenge
Mobileye faces a unique constraint in the autonomous driving industry: its perception stack must run on custom silicon consuming 5-40W (EyeQ6L at ~4W to EyeQ Ultra at <100W), while competitors like Waymo run on data-center-class hardware consuming hundreds of watts. This constraint drives extreme optimization at every level.
Hardware-Software Co-Design
Mobileye's fundamental efficiency advantage comes from designing the EyeQ hardware and the perception software together:
- XNN accelerators are designed for the specific matrix sizes, sparsity patterns, and data flow requirements of Mobileye's neural networks
- PMA CGRA cores map exactly to the computational patterns of Mobileye's vision processing algorithms
- Memory hierarchy is tuned for the access patterns of automotive perception (sequential frame processing, feature pyramid construction, multi-scale detection)
This co-design means EyeQ chips deliver more useful perception computation per watt than general-purpose alternatives, even though their raw TOPS rating is lower.
Quantization Strategy
All inference runs at Int8 precision on XNN accelerators:
- Quantization-Aware Training (QAT): Networks are trained with simulated Int8 quantization in the forward pass, allowing weights to adapt to precision loss during training
- Post-Training Quantization (PTQ): For rapid deployment, pre-trained models are calibrated on representative automotive datasets to determine optimal quantization parameters
- The EyeQ6H achieves a 0.5 ms latency on ResNet-50 at Int8, compared to 1.64 ms on the NVIDIA Jetson AGX Orin at FP16 -- despite the Orin having ~8x the raw TOPS rating
STAT: Sparse Typed Attention Transformer
Mobileye's most significant efficiency innovation for transformer-based models is STAT (Sparse Typed Attention):
- Problem: Standard transformer self-attention has O(n^2) complexity, making it prohibitively expensive for automotive perception on power-constrained hardware
- Solution: STAT organizes tokens into structured groups with defined types (e.g., "regular tokens" and "manager tokens"). Rather than every token attending to every other token, communication flows through the manager tokens, which serve as information hubs
- Example: 300 regular tokens + 32 manager (link) tokens, where regular tokens only attend to their local group and the manager tokens, dramatically reducing attention computation
- Result: 100x efficiency improvement over standard transformers without compromising accuracy
- Impact: Makes transformer-based perception practical on EyeQ6H hardware, enabling attention-based temporal fusion, cross-camera feature aggregation, and global scene reasoning at automotive frame rates
Knowledge Distillation
Mobileye uses teacher-student distillation to deploy efficient models:
- A large "teacher" network is trained on cloud GPUs with full-precision computation
- A smaller "student" network, sized for specific EyeQ hardware, is trained to match the teacher's outputs
- The student benefits from the teacher's superior feature representations without requiring the teacher's compute budget at inference time
This enables Mobileye to deploy different complexity models across product tiers: a simpler student for EyeQ4-based ADAS, a more capable student for EyeQ6H-based SuperVision, without retraining from scratch.
Automatic Ground Truth (Auto GT)
Mobileye uses "advanced sensors and offline processing" to automatically generate accurate ground truth labels for training data, reducing dependence on manual annotation. This enables:
- Faster iteration on model training
- More consistent labeling quality
- Scaling to the massive 200+ PB dataset Mobileye has collected
Multi-Task Network Sharing
Multiple perception tasks share a common backbone feature extractor:
- Object detection, lane detection, free space estimation, semantic segmentation, and depth estimation all consume features from the same backbone
- The backbone computation (the most expensive part) is amortized across all tasks
- Task-specific heads are lightweight and add minimal overhead
Benchmark Results: EyeQ6H vs. NVIDIA Jetson AGX Orin
| Benchmark | EyeQ6H (34 TOPS, Int8) | Jetson AGX Orin (275 TOPS, FP16) |
|---|---|---|
| ResNet-50 single-stream latency | 0.5 ms | 1.64 ms |
| EfficientViT-B1 (224x224, 9.1M params) | 0.564 ms | 1.48 ms |
| EfficientViT-B2 (224x224, 24M params) | 0.932 ms | 2.63 ms |
| Pixel-labeling NN throughput | >1,000 FPS | Not published |
Despite having 8x fewer raw TOPS, EyeQ6H consistently achieves 2-3x lower latency on representative perception workloads.
20. Calibration
The Calibration Challenge
With 11-13 cameras mounted on a vehicle, precise calibration is essential -- a 0.1-degree error in camera orientation translates to a multi-meter error at 100 m distance. Mobileye addresses calibration at multiple levels.
Factory Calibration
Initial camera calibration is performed during vehicle manufacture or ADAS installation:
- Intrinsic parameters: Focal length, principal point, distortion coefficients -- characterize the camera optics
- Extrinsic parameters: Position and orientation of each camera relative to the vehicle body -- characterize the mounting
Automatic Self-Calibration
Mobileye's systems include automatic calibration that runs during normal driving, eliminating the need for calibration targets or special procedures:
- Procedure: Drive the vehicle normally for 5-15 minutes
- Method: The system uses vanishing points (from road markings, building edges, and other structured elements in the scene) and road plane estimation to determine camera orientation relative to the road surface
- Parameters recovered: Camera pitch (tilt), yaw (rotation), and roll relative to the road plane; camera height above road surface
- Online refinement: Calibration is continuously refined during driving, compensating for:
- Suspension changes due to load
- Tire pressure variations
- Temperature-induced mechanical drift
- Windshield camera re-mounting after service
Fleet-Scale Calibration
With 150+ million Mobileye-equipped vehicles, calibration operates at fleet scale:
- REM data quality depends on accurate calibration across millions of harvesting vehicles
- The cloud aggregation pipeline statistically detects and compensates for poorly calibrated vehicles
- Consistent landmark observations from many vehicles create a calibration cross-check: if one vehicle's observations are systematically offset, its calibration is flagged
Multi-Camera Extrinsic Calibration
For multi-camera systems (SuperVision, Drive), cross-camera calibration is critical:
- Overlapping fields of view between adjacent cameras provide mutual calibration constraints
- Feature correspondences in overlapping regions constrain relative camera poses
- The vehicle's motion (known from odometry) provides additional constraints through temporal structure-from-motion
Calibration for REM
The REM harvesting pipeline includes calibration-aware processing:
- Landmark positions are corrected for estimated calibration errors before transmission
- The cloud pipeline further aligns observations from multiple vehicles, effectively averaging out individual calibration errors
- This fleet-scale statistical averaging produces map accuracy (5-10 cm) that exceeds any individual vehicle's calibration precision
21. Key Patents
Mobileye Vision Technologies Ltd. and Mobileye Technologies Ltd. collectively hold thousands of patents spanning the full perception stack. Below are representative patents across key perception areas.
Object Detection and Obstacle Avoidance
| Patent | Title | Year | Key Innovation |
|---|---|---|---|
| US7113867B1 | System and method for detecting obstacles to vehicle motion and determining time to contact therewith using sequences of images | 2006 | Foundational monocular obstacle detection using image sequences; time-to-contact computation |
| US8164628B2 | Estimating distance to an object using a sequence of images recorded by a monocular camera | 2006 | Monocular distance estimation from temporal image sequences |
| US9205776B2 | Vehicle vision system using kinematic model of vehicle motion | — | Kinematic model integration with vision for motion prediction |
Free Space and Drivable Area
| Patent | Title | Key Innovation |
|---|---|---|
| Multiple patents | Systems for mapping road segment free spaces for autonomous vehicle navigation | Determine lateral free space regions adjacent to road segments; update navigation models across fleet |
| Multiple patents | Determining drivable free-space for autonomous vehicles | Camera-based boundary detection with classification (curb, vehicle, barrier); traversable vs. non-traversable boundary types |
| Multiple patents | Identifying probable road edge locations under snow cover | Free space estimation when lane markings are obscured |
Lane Detection and Mapping
| Patent | Title | Year | Key Innovation |
|---|---|---|---|
| US9665100B2 | Sparse map for autonomous vehicle navigation | 2016 | Sparse map structure with lane-level geometry; polynomial representations of road features |
| US9946260 | Sparse map autonomous vehicle navigation | 2016 | Navigation using sparse map with data density < 1 MB/km |
| US11086334 | Crowdsourcing a sparse map for autonomous vehicle navigation | 2017 | Fleet-scale crowdsourced map data collection, alignment, road surface information, and vehicle localization using lane measurements |
| US20180024568A1 | Crowdsourcing a sparse map for autonomous vehicle navigation | 2017 | Landmark-based localization, 3D spline curve trajectories, road signature profiles |
Depth Estimation and Semantic Segmentation
| Patent | Title | Key Innovation |
|---|---|---|
| US10019657B2 | Joint depth estimation and semantic segmentation from a single image | Simultaneously estimates depth and semantic labels using coarse-to-fine pipeline with global templates and local segment analysis |
| US20160350930A1 | Joint depth estimation and semantic segmentation from a single image | Machine learning approach merging global and local depth/semantic layouts |
Traffic Sign Recognition
| Patent | Title | Key Innovation |
|---|---|---|
| US20080137908A1 | Detecting and recognizing traffic signs | 2008 |
Navigation and Mapping
| Patent | Title | Key Innovation |
|---|---|---|
| CN112384760A | System and method for autonomous vehicle navigation | Lane marking mapping, directional arrow mapping, selective road information harvesting based on data quality, free space mapping, traffic light mapping |
Key Patent Statistics
- Amnon Shashua: 94+ patents in computer vision, ADAS, and autonomous driving
- Mobileye Vision Technologies Ltd.: Hundreds of assigned patents
- Core areas: Monocular and multi-camera depth estimation, crowdsourced HD map construction, formal safety envelope computation, low-power CNN accelerator architectures, camera-based free-space estimation, real-time semantic segmentation, traffic sign recognition, pedestrian detection
- European Inventor Award: Amnon Shashua was a finalist (2019, European Patent Office) for his patent portfolio
22. Key Publications
Foundational Papers by Amnon Shashua
Multi-View Geometry and Tensor Theory
| Paper | Venue | Year | Significance |
|---|---|---|---|
| Projective Structure from Uncalibrated Images: Structure from Motion and Recognition | IEEE TPAMI, Vol. 16(8), pp. 778-790 | 1994 | Foundational work on recovering 3D structure from uncalibrated cameras |
| Trilinearity of Three Perspective Views and Its Associated Tensor | ICCV, Boston | 1995 | Introduced the trifocal tensor -- a fundamental construct relating three views of a scene (with M. Werman) |
| Relative Affine Structure: Canonical Model for 3D from 2D Geometry and Applications | IEEE TPAMI, 18(9):873-883 | 1996 | Canonical representations for multi-view reconstruction (with N. Navab) |
| Duality of Multi-Point and Multi-Frame Geometry: Fundamental Shape Matrices and Tensors | ECCV | 1996 | Dual theory of multi-view geometry (with D. Weinshall and M. Werman) |
| Trilinear Tensor: The Fundamental Construct of Multiple-view Geometry and its Applications | AFPAC Workshop, Kiel | 1997 | Comprehensive treatment of the trilinear tensor |
| Tensor Embedding of the Fundamental Matrix | SMILE Workshop | 1998 | Embedding the fundamental matrix in the trifocal tensor (with S. Avidan) |
Photometry and Visual Recognition
| Paper | Venue | Year | Significance |
|---|---|---|---|
| Geometry and Photometry in 3D Visual Recognition | MIT PhD Dissertation | 1992 | Foundational thesis on separating geometric and photometric factors in visual recognition |
| On Photometric Issues in 3D Visual Recognition from a Single 2D Image | IJCV, Vol. 21, pp. 99-122 | 1997 | Analysis of illumination effects on 3D object recognition |
These early papers established Shashua as a leading figure in multi-view geometry -- the mathematical framework for understanding how 3D scenes relate to their 2D projections. This theoretical foundation directly underlies Mobileye's camera-based 3D perception: structure from motion, stereo depth estimation, and multi-camera calibration are all practical applications of Shashua's multi-view geometry research.
RSS and Safety Papers
| Paper | Authors | Year | Venue | Significance |
|---|---|---|---|---|
| On a Formal Model of Safe and Scalable Self-driving Cars | Shalev-Shwartz, Shammah, Shashua | 2017 | arXiv:1708.06374 | Introduces RSS; foundational formal safety model |
| Safe, Multi-Agent, Reinforcement Learning for Autonomous Driving | Shalev-Shwartz, Shammah, Shashua | 2016 | arXiv:1610.03295 | Multi-agent RL framework; deep RL for driving strategy |
| Responsibility-Sensitive Safety (extended) | Mobileye | 2022 | arXiv:2206.03418 | Comprehensive technical specification of RSS |
| Implementing the RSS Model on NHTSA Pre-Crash Scenarios | Mobileye | — | Technical report | RSS validated across all 37 NHTSA scenario types |
| A Safety Architecture for Self-Driving Systems | Mobileye | — | Whitepaper | Full SDoV safety architecture with PGF framework |
Perception Papers by Mobileye Researchers
| Paper | Authors | Year | Venue | Significance |
|---|---|---|---|---|
| 3D-LaneNet: End-to-End 3D Multiple Lane Detection | Garnett, Cohen, Pe'er, Lahav, Levi | 2019 | ICCV | First end-to-end 3D lane detection from single image; intra-network IPM; anchor-based lane representation |
| 3D-LaneNet+: Anchor Free Lane Detection using a Semi-Local Representation | Efrat, Bluvstein, Oron, Levi, Garnett, Shlomo | 2020 | ML4AD Workshop | Improved 3D lane detection without anchor dependency |
Machine Learning Theory
| Paper | Authors | Year | Significance |
|---|---|---|---|
| Understanding Machine Learning: From Theory to Algorithms | Shalev-Shwartz, Ben-David | 2014 | Authoritative textbook on ML theory; foundational reference for ML practitioners |
Researcher Profiles
- Prof. Amnon Shashua (CEO): 160+ publications, 94+ patents, h-index cited by 30,342 researchers. Research spans multi-view geometry, photometric stereo, illumination cone geometry, visual recognition, and deep learning for autonomous driving.
- Prof. Shai Shalev-Shwartz (CTO): Machine learning theorist; co-authored foundational textbook on ML theory. Leads ADAS/AV/RSS/REM development. Research includes online learning, reinforcement learning for autonomous driving, and formal safety models.
- Noa Garnett: Lead researcher in 3D lane detection and perception.
- Dan Levi: Senior researcher in depth estimation and perception architectures.
Open-Source Projects
- intel/ad-rss-lib: C++ library implementing the RSS model for autonomous vehicles; includes situation analysis, response computation, and integration examples
23. Perception Evolution: EyeQ1 through EyeQ Ultra
EyeQ1 (2008) -- Birth of Camera Perception
- Process: 180 nm
- Silicon: Sampled in 2004; commercial launch 2008
- Perception capabilities:
- Lane Departure Warning (LDW)
- Forward Collision Warning (FCW)
- Traffic Sign Recognition (TSR)
- Adaptive Headlight Control (AHC)
- Architecture: Single camera processing; hand-crafted vision algorithms (Haar cascades, HOG features, edge detection); minimal ML
- Significance: Proved that a single camera + custom silicon could deliver production-grade ADAS. This was radical in an era when ADAS was synonymous with expensive radar systems.
EyeQ2 (2010) -- Pedestrian Detection
- Performance: 6x EyeQ1
- Key perception advance: Pedestrian detection -- the first EyeQ to reliably detect people in camera images
- Enabled features: Automatic Emergency Braking (AEB) for pedestrian scenarios; urban operation
- Architecture: Early pattern recognition classifiers; beginning of ML-based approaches
- Significance: Opened the door to Euro NCAP pedestrian AEB testing and scoring, driving massive OEM adoption
EyeQ3 (2014) -- Surround Perception
- Performance: 6x EyeQ2 (~0.25 TOPS)
- Key perception advance: Multi-camera input -- the first EyeQ to accept inputs from surround-view camera systems
- Enabled features: Safety "cocoon" around the vehicle; L2 capability
- Architecture: Multiple camera streams processed; more sophisticated classifiers
- Significance: Transition from single-camera ADAS to surround perception; enabled early highway assist features
EyeQ4 (2018) -- Deep Learning at Scale
- Process: 28 nm FD-SOI (STMicro)
- Performance: 2.5 TOPS at ~4.5 W
- Key perception advance: Deep learning inference -- first EyeQ with dedicated CNN acceleration
- Architecture: 4 MIPS CPU cores + proprietary vision processing accelerators; fuses up to 8 sensors
- Enabled features: Multi-sensor fusion; higher-resolution processing; improved detection accuracy through CNNs
- Significance: The mass-market workhorse -- tens of millions shipped. Brought deep learning from research to production ADAS at automotive scale and power budget.
EyeQ5 (2021) -- Autonomous-Grade Perception
- Process: 7 nm FinFET (TSMC)
- Performance: 24 TOPS peak; <5 W typical
- Key perception advance: Full autonomous perception pipeline -- sufficient compute for L2++ (SuperVision) with two chips
- Architecture: 8 CPU cores, 18 vision processor cores, 4 accelerator classes (XNN, PMA, VMP, MPC); fuses up to 20 sensors
- Enabled features: SuperVision hands-free highway driving; ViDAR; full semantic segmentation; 3D lane detection; dense depth estimation
- Perception benchmark: 91 FPS on pixel-labeling neural networks
- Significance: Enabled SuperVision deployment at production scale (Zeekr). Proved that camera-only hands-free driving was achievable on <10W silicon.
EyeQ6L (2024) -- Cost-Optimized Next-Gen ADAS
- Process: 7 nm
- Performance: 5 DL TOPS (int8); 4.5x EyeQ4M compute at half the die area
- Key perception advance: Higher-resolution front camera processing (8 MP with 120-degree lateral FOV -- 20-degree improvement over EyeQ4M); improved pixel segmentation with more than double the point density
- Enabled features: Enhanced L1-L2 ADAS with better detection range and accuracy
- Significance: Cost-effective replacement for EyeQ4 in new designs; nominated volumes grew 3.5x in 2025 vs. 2024
EyeQ6H (2025) -- High-Performance Perception
- Process: 7 nm
- Performance: 34 DL TOPS (int8); 3x EyeQ5H compute at 25% more power
- Key perception advance: 1,000+ FPS on pixel-labeling neural networks (10x improvement over EyeQ5); transformer model support; integrated ISP, GPU, video encoder
- Architecture: 8 cores / 32 threads; built-in ISP eliminates external ISP chip; GPU enables bird's-eye-view display and parking visualization
- Enabled features: Surround ADAS on single chip; SuperVision with 2x EyeQ6H; Chauffeur with 3x; Drive with 4x; driver monitoring; video recording; automated parking
- Significance: The universal building block -- a single chip type scales from Surround ADAS to L4 through chip count. EyeQ Kit SDK opens development to third parties.
EyeQ Ultra (2025+) -- Full L4 on a Single Chip
- Process: 5 nm (TSMC)
- Performance: 176 TOPS; <100 W
- Key perception advance: Single-chip processing of both True Redundancy subsystems -- camera-only + radar/LiDAR perception, HD map processing, and driving policy all on one chip
- Architecture: 12 RISC-V cores (dual-threaded), 64 accelerator cores (16 XNN CNN accelerators, 8 PMA CGRA, VMP, MPC), Arm GPU (256 GFLOPS), VPU, ISP, H.264/H.265 encoding
- Enabled features: Full L4 autonomous driving for consumer vehicles (not just robotaxis); complete perception-to-actuation pipeline on a single SoC
- Significance: Brings L4 autonomy cost down to consumer vehicle economics. The RISC-V ISA transition signals Mobileye's long-term architectural direction away from MIPS.
Perception Capability Progression Summary
| Generation | Year | TOPS | Key Perception Milestone |
|---|---|---|---|
| EyeQ1 | 2008 | — | Lane detection, FCW from single camera |
| EyeQ2 | 2010 | — | Pedestrian detection and AEB |
| EyeQ3 | 2014 | ~0.25 | Multi-camera surround perception |
| EyeQ4 | 2018 | 2.5 | Deep learning CNN inference at scale |
| EyeQ5 | 2021 | 24 | Full autonomous perception pipeline; ViDAR; SuperVision |
| EyeQ6L | 2024 | 5 | 2x segmentation density; 8MP wide-FOV camera |
| EyeQ6H | 2025 | 34 | 1,000+ FPS segmentation; transformer support; Surround ADAS to L4 |
| EyeQ Ultra | 2025+ | 176 | Single-chip L4; both True Redundancy subsystems |
The 20-year progression from EyeQ1 to EyeQ Ultra represents a roughly 700x increase in deep learning compute, from hand-crafted vision algorithms detecting lane markings at 180 nm to transformer-based 3D scene understanding at 5 nm -- all while maintaining the automotive power and cost constraints that define Mobileye's approach.
Appendix: Architecture Diagrams
True Redundancy Data Flow
CAMERA ARRAY (11-13 cameras)
|
+--------------------+
| Camera Perception |
| (EyeQ XNN/PMA) |
| - Object detection |
| - Lane detection |
| - Free space |
| - Depth/ViDAR |
| - Segmentation |
| - TSR/TFL |
+--------------------+
|
Camera World Model
|
+--------------------+
| RSS Safety Check |
| (Camera channel) |
+--------------------+
|
v
+--------------------+
| PLANNING LAYER |<---- REM Roadbook
| - Trajectory gen. |<---- VLSA guidance
| - Comfort opt. |
| - RSS envelope |
+--------------------+
^
|
+--------------------+
| RSS Safety Check |
| (Radar/LiDAR ch.) |
+--------------------+
|
Radar/LiDAR World Model
|
+--------------------+
| Radar/LiDAR |
| Perception |
| (EyeQ VMP/XNN) |
| - Point cloud det. |
| - Object detect. |
| - Free space |
| - Tracking |
+--------------------+
|
IMAGING RADAR (BSR/BSRC) + LiDAREyeQ Accelerator Workload Map
+------------------------------------------------------------------+
| EyeQ SoC |
| |
| +----------+ +----------+ +----------+ +----------+ |
| | XNN | | PMA | | VMP | | MPC | |
| | CNN/DL | | CGRA | | SIMD/VLIW| | Barrel- | |
| | Accel. | | Dataflow | | Vector | | threaded | |
| +----------+ +----------+ +----------+ +----------+ |
| | | | | |
| Object det. Feature Radar FFT Tracking |
| Segmentation extraction Beamforming State mgmt |
| Depth est. Image proc. Signal proc. Control flow |
| Lane det. DNN layers Ultrasonic Data marshal |
| Sign class. Sensor proc. processing Post-proc. |
| |
| +----------+ +----------+ +----------+ +----------+ |
| | CPU | | ISP | | GPU | | Video | |
| | RISC-V | | Image | | Arm GPU | | H.264/5 | |
| | General | | Signal | | Render | | Encode | |
| +----------+ +----------+ +----------+ +----------+ |
| | | | | |
| RSS compute Debayer BEV display Recording |
| Planning HDR tone map Parking vis. DMS video |
| Scheduling Distortion Driver mon. OTA upload |
| correction |
+------------------------------------------------------------------+Sources: Mobileye corporate technology pages, CES 2025/2026 presentations by Prof. Amnon Shashua, Mobileye AI Day presentations, EyeQ chip specification pages and benchmark data, Mobileye blog posts on True Redundancy/REM/imaging radar/ISA/pedestrian detection, IEEE/CVF publications (3D-LaneNet ICCV 2019), arXiv papers (RSS 1708.06374, 2206.03418; Safe Multi-Agent RL 1610.03295), Google Patents (US7113867B1, US8164628B2, US9665100B2, US9946260, US10019657B2, US11086334, US20150206015A1, US20160350930A1, US20180024568A1, US20190286153A1, CN112384760A), WikiChip EyeQ5 specifications, EE Times/Electronic Design technical articles, AWS Mobileye case studies, Amnon Shashua academic publications database (Hebrew University/Stanford).