Tesla FSD Perception Stack: Comprehensive Technical Deep Dive
Last Updated: March 2026
Table of Contents
- Vision-Only Philosophy
- Camera Configuration
- Image Signal Processing & Raw Data Pipeline
- Calibration
- HydraNet / Backbone Architecture
- BEV (Bird's Eye View) Transformer
- Occupancy Networks
- Temporal Module
- Object Detection & 3D Bounding Boxes
- Depth Estimation
- Lane Detection & Road Geometry
- Traffic Light & Sign Detection
- Semantic Segmentation
- Object Tracking
- End-to-End Architecture (v12+)
- Neural Network Planner Integration
- Auto-Labeling Pipeline
- Data Engine
- Model Compilation & Inference
- Key Architectural Evolution (HW2 to AI5)
- Model Sizes & Performance Numbers
- Patents
- Published Talks & Presentations
1. Vision-Only Philosophy
The Core Thesis
Tesla is the only major autonomous vehicle company pursuing a pure camera-only perception approach -- no LiDAR, no radar (removed 2021--2023), and no ultrasonic sensors (removed 2022--2023). The rationale, articulated most clearly by Andrej Karpathy (Tesla's former Sr. Director of AI, 2017--2022) and reinforced by Elon Musk, rests on a first-principles argument:
"Humans drive with vision alone. A sufficiently capable neural network, given the same visual input as a human driver, should be able to match or exceed human driving performance."
This argument was formalized by Karpathy in his CVPR 2021 keynote, where he demonstrated that Tesla's new vision-only approach for Autopilot had higher precision and recall than the prior sensor-fusion approach that combined cameras with radar.
Technical Justification
1. Information Density
Cameras capture far richer information than any other sensor modality:
- Texture, color, and contextual information that point-cloud systems cannot detect (e.g., text on road signs, traffic light colors, brake lights, turn signals)
- At 5.4 MP per camera (HW4), each frame contains roughly 43 million pixels of information across 8 cameras per timestep
- LiDAR point clouds are sparse by comparison -- typically 100K--300K points per scan, with no color or texture information
2. Scalability
Karpathy's central argument at CVPR 2021: camera-based perception scales in a way that LiDAR-based systems cannot.
- HD LiDAR maps are expensive to build and maintain, and become stale quickly as the road environment changes
- Vision-based perception requires no pre-mapped infrastructure -- it generalizes to any road on Earth
- Tesla's fleet of 9+ million vehicles acts as a distributed data-collection platform, generating the equivalent of 500 years of driving data every day -- impossible to replicate with a LiDAR-equipped test fleet
3. Cost Efficiency
A camera module costs approximately $10--20; a high-end automotive LiDAR costs $500--10,000+. Tesla's camera suite for the entire vehicle costs a fraction of a single LiDAR unit, enabling deployment at consumer price points.
4. Sensor Contention Elimination
A recurring Musk argument: when LiDAR/radar and cameras disagree, the system must arbitrate. This "sensor contention" introduces a fundamental ambiguity. With cameras only, the neural network receives a single, unified modality and learns to interpret it end-to-end.
Acknowledged Limitations
| Limitation | Description | Mitigation |
|---|---|---|
| Low-light / Night | Cameras degrade in darkness; LiDAR is light-invariant | 120 dB HDR sensors (IMX490); IR-capable optics; 12-bit raw data with 4,096 brightness levels |
| Direct sun / Glare | Saturation and lens flare | Multi-exposure HDR; ISP tone mapping; sun visor occlusion handled via temporal persistence |
| Heavy rain / Snow / Fog | Reduced visibility | Temporal aggregation (remembers environment from before degradation); occupancy persistence from prior frames |
| Depth accuracy at range | Monocular depth estimation degrades with distance | Narrow FOV telephoto camera (250 m range); multi-frame triangulation; stereo from overlapping camera views |
| No active ranging | Cannot measure distance to featureless surfaces (e.g., flat walls) | SDF-based occupancy prediction learns surface geometry from training data |
The "Bitter Lesson" Alignment
Tesla's research philosophy explicitly follows Rich Sutton's "Bitter Lesson": general methods that leverage computation and data at scale consistently outperform hand-engineered domain-specific approaches. Rather than engineering specialized sensor pipelines for each modality, Tesla bets that a sufficiently large neural network, trained on sufficiently large data, will learn to extract all necessary information from cameras.
2. Camera Configuration
HW3 Camera System (2019--2023)
Eight external cameras plus one cabin-facing camera, all using the ON Semiconductor (Aptina) AR0136AT CMOS image sensor.
| Camera | Position | FOV | Max Range | Resolution | Frame Rate |
|---|---|---|---|---|---|
| Narrow Forward | Windshield, center top | ~35 deg | 250 m | 1280 x 960 (1.2 MP) | 36 fps |
| Main Forward | Windshield, center top | ~50 deg | 150 m | 1280 x 960 (1.2 MP) | 36 fps |
| Wide Forward | Windshield, center top | ~150 deg (fisheye) | 60 m | 1280 x 960 (1.2 MP) | 36 fps |
| Left B-Pillar | Driver-side B-pillar | ~90 deg | 80 m | 1280 x 960 (1.2 MP) | 36 fps |
| Right B-Pillar | Passenger-side B-pillar | ~90 deg | 80 m | 1280 x 960 (1.2 MP) | 36 fps |
| Left Repeater | Left front fender (turn signal housing) | ~90 deg | 80 m | 1280 x 960 (1.2 MP) | 36 fps |
| Right Repeater | Right front fender (turn signal housing) | ~90 deg | 80 m | 1280 x 960 (1.2 MP) | 36 fps |
| Rear | Above license plate | ~130 deg | 50 m | 1280 x 960 (1.2 MP) | 36 fps |
| Cabin | Above rearview mirror | -- | -- | -- | IR-capable |
AR0136AT Sensor Specifications:
- Pixel size: 3.75 um
- 12-bit HDR output
- RCCC (Red-Clear-Clear-Clear) color filter array on most cameras
- Rolling shutter
Three Forward Cameras are co-located behind the windshield in a tri-focal cluster:
- The narrow camera provides long-range perception (up to 250 m) for highway-speed object detection
- The main camera covers the primary driving field of view
- The wide camera captures the full intersection scene, nearby vehicles, and pedestrians entering from the side
HW4 (AI4) Camera System (2023--present)
| Camera | Position | FOV | Resolution | Sensor |
|---|---|---|---|---|
| Main Forward | Windshield, center | Wide (~120 deg) | 2896 x 1876 (5.4 MP) | Sony IMX963 (custom IMX490 variant) |
| Narrow Forward | Windshield, center | Telephoto | 2896 x 1876 (5.4 MP) | Sony IMX963 |
| Wide Forward | Windshield, center (dummy/inactive on some models) | Fisheye | -- | Position retained for HW3 compatibility |
| Left B-Pillar | Driver-side B-pillar | ~90 deg (side + forward) | 2896 x 1876 (5.4 MP) | Sony IMX490 |
| Right B-Pillar | Passenger-side B-pillar | ~90 deg (side + forward) | 2896 x 1876 (5.4 MP) | Sony IMX490 |
| Left C-Pillar | Driver-side C-pillar (new position) | ~90 deg (side + rearward) | 2896 x 1876 (5.4 MP) | Sony IMX490 |
| Right C-Pillar | Passenger-side C-pillar (new position) | ~90 deg (side + rearward) | 2896 x 1876 (5.4 MP) | Sony IMX490 |
| Rear | Above license plate | Wide fisheye | 2896 x 1876 (5.4 MP) | Sony IMX490 |
| Cabin | Above rearview mirror | Wide | -- | IR + visible light |
Sony IMX490/IMX963 Sensor Specifications:
- Resolution: 5.4 MP (2896 x 1876)
- Pixel size: 3.0 um
- HDR: 120 dB dynamic range (on-sensor HDR via sub-pixel architecture)
- LED flicker mitigation (critical for reading electronic road signs and traffic lights at high frame rates)
- 12-bit raw output with 4,096 brightness levels per pixel
- Red-tinted lens coatings for improved HDR and low-light performance
Key HW3-to-HW4 Camera Changes:
| Feature | HW3 | HW4 |
|---|---|---|
| Resolution per camera | 1.2 MP | 5.4 MP (4.5x) |
| Total pixels per frame (8 cams) | ~9.8 MP | ~43.2 MP |
| Sensor type | ON Semi AR0136AT | Sony IMX490 / IMX963 |
| Dynamic range | ~80 dB | ~120 dB |
| Color filter array | RCCC | RCCC (custom variant) |
| Flicker mitigation | No | Yes |
| Fender repeater cameras | Yes (front fender) | No (replaced by C-pillar) |
| C-pillar cameras | No | Yes (new position) |
| Forward camera count | 3 active | 2 active + 1 dummy |
| Camera input capacity | 8 | Up to 13 (future-proofed) |
| Front bumper cameras | No | Added in 2025 lineup update |
Camera Placement Rationale
The C-pillar cameras (HW4) replace the fender-mounted repeaters (HW3), providing better rearward-lateral coverage. The B-pillar cameras were retained but now look more forward-and-side, while C-pillar cameras handle side-and-rearward views. This arrangement eliminates blind spots in the rear quarter and provides better overlap between camera fields of view for stereo-like depth estimation.
3. Image Signal Processing & Raw Data Pipeline
Traditional ISP (HW3 On-Chip)
The HW3 FSD chip contains a dedicated Image Signal Processor (ISP) with the following specifications:
| Parameter | Specification |
|---|---|
| Pipeline depth | 24-bit internal processing |
| Throughput | Up to 1 billion pixels per second |
| Camera serial interface (CSI) | Up to 2.5 billion pixels per second input capacity |
| Tone mapping | Yes -- exposes details in shadows and bright spots |
| Noise reduction | Yes -- spatial and temporal |
| HDR processing | Multi-exposure merge |
| Video encoder | H.265 (HEVC) for dashcam, cloud clip logging |
The traditional ISP pipeline performs: demosaicing, white balance, color correction, noise reduction, sharpening, tone mapping, dynamic range compression, lens distortion correction, and compression (e.g., JPEG/H.265).
The ISP Bypass: Tesla's Raw Vision Approach
Tesla has taken a radical departure from conventional image processing. Rather than feeding ISP-processed images to the neural network, Tesla bypasses the ISP and feeds raw sensor data directly into the neural network.
Why bypass the ISP?
Traditional ISP processing compresses 12-bit raw data (4,096 brightness levels) down to 8-bit RGB (256 levels). This compression discards information that is critical for autonomous driving:
| Data Stage | Bit Depth | Brightness Levels | Information |
|---|---|---|---|
| Raw sensor output | 12-bit | 4,096 per pixel | Full photon count; maximum dynamic range |
| After ISP (standard) | 8-bit | 256 per pixel | Compressed; tuned for human viewing |
| Neural network receives | 12-bit raw | 4,096 per pixel | All information preserved |
RCCC Color Filter Array:
Most Tesla cameras use a non-standard RCCC (Red-Clear-Clear-Clear) color filter array instead of the conventional RGGB Bayer pattern:
- Three "clear" (unfiltered) sub-pixels capture raw photon counts across the full visible spectrum, maximizing light sensitivity
- One red-filtered sub-pixel is sufficient for detecting red traffic lights, brake lights, and emergency vehicle colors
- The RCCC configuration prioritizes luminance resolution and low-light sensitivity over color accuracy -- because the neural network does not need color-accurate images for driving; it needs maximum information about the scene geometry, edges, and motion
Raw Data Pipeline:
Photons -> RCCC Sensor (12-bit, 4096 levels)
-> No demosaicing
-> No color correction
-> No dynamic range compression
-> No JPEG encoding
-> Raw 12-bit data -> Neural Network inputThe neural network learns to interpret the raw RCCC mosaic pattern directly. This means:
- The network implicitly learns its own "demosaicing"
- The network decides what is "relevant" in the image, not a hand-tuned ISP
- In extreme lighting transitions (tunnels, sunrise/sunset), the raw data preserves far more recoverable information than an ISP-processed image would
Practical Impact:
The IMX490 sensor's 120 dB HDR combined with 12-bit raw output allows the neural network to perceive detail simultaneously in deep shadows and bright highlights. A standard ISP would compress this into a range optimized for human viewing on an 8-bit display -- losing precisely the information needed for safe autonomous driving in challenging lighting.
4. Calibration
The Calibration Problem
Tesla ships 9+ million vehicles with cameras that are:
- Installed with manufacturing tolerances (slight position/angle variations)
- Subject to shift over time from vibrations, temperature cycling, and minor impacts
- Viewing through windshields with varying optical properties
For the neural network to produce consistent results across the entire fleet, all cameras must present a standardized view of the world.
Online Calibration Neural Network
Tesla solves this with a calibration neural network that runs as the first stage of the perception pipeline:
- Camera Rectification Transform: Each of the 8 cameras is warped by a learned transformation into a synthetic virtual camera with standardized intrinsic and extrinsic parameters
- Fleet Consistency: After rectification, the image from any given camera position on any Tesla in the fleet should look the same, regardless of manufacturing variations
- Continuous Update: The calibration is not a one-time process -- it updates continuously as the vehicle drives, compensating for any drift in camera alignment
Calibration Process for New Vehicles:
| Stage | Method | Duration |
|---|---|---|
| Factory calibration (2025+) | Automated: vehicle drives autonomously ~2 km on factory grounds | Minutes |
| Post-delivery (legacy) | Manual driving on roads with clear lane markings | 20--25 miles typical; up to 100 miles maximum |
| Ongoing | Continuous background refincement while driving | Perpetual |
Technical Implementation:
The rectification transform converts all raw images into a common virtual camera coordinate system before they enter the backbone network. This is a geometric operation (homography + lens distortion correction) whose parameters are inferred by the calibration neural network from visual features (vanishing points, lane line parallelism, horizon position).
After rectification, the images are passed to the RegNet backbone. This two-stage approach (calibrate first, then extract features) ensures that the learned features in the backbone are invariant to camera mounting variations.
Extrinsic Calibration Sources
The calibration network estimates the 6-DOF extrinsic pose (position + orientation) of each camera relative to the vehicle body frame using:
- Vanishing point geometry from parallel lines (lane markings, building edges)
- Horizon detection for pitch and roll estimation
- Multi-camera consistency constraints where overlapping FOVs must agree on 3D geometry
- IMU data for ground-truth orientation reference during calibration
5. HydraNet / Backbone Architecture
Overview
HydraNet is Tesla's multi-task learning architecture, introduced at AI Day 2021. The name references the mythological Hydra -- a single body (shared backbone) with multiple heads (task-specific decoders).
Architecture Pipeline
8 Raw Camera Images
-> Calibration Neural Net (rectification to virtual camera)
-> RegNet Backbone (per-camera feature extraction)
-> BiFPN (multi-scale feature fusion)
-> Transformer Module (image space -> BEV vector space)
-> Feature Queue (temporal caching)
-> Video Module / Spatial RNN (temporal aggregation)
-> Task-Specific Trunks + Heads (detection, segmentation, etc.)Stage 1: RegNet Backbone
Tesla replaced an earlier ResNet-50-based backbone with RegNets (Regularized Networks), which provide better accuracy-latency tradeoffs through a simplified design-space approach.
Multi-Scale Feature Extraction:
Each camera image (1280 x 960 on HW3) is processed independently through the RegNet backbone, producing a feature pyramid at multiple resolutions:
| Feature Level | Spatial Resolution | Channel Count | Purpose |
|---|---|---|---|
| Level 1 (finest) | 160 x 120 | Low (~64) | Fine details: lane markings, text, small objects |
| Level 2 | 80 x 60 | Medium (~128) | Mid-range features: vehicle shapes, signs |
| Level 3 | 40 x 30 | Higher (~256) | Larger structures: building outlines, road geometry |
| Level 4 (coarsest) | 20 x 15 | Highest (~512) | Full-scene context: scene type, spatial layout |
The key tradeoff: low-level features have high spatial resolution but limited semantic context, while high-level features have rich semantic context but coarse spatial resolution.
Stage 2: BiFPN (Bi-directional Feature Pyramid Network)
After the RegNet backbone extracts multi-scale features, they are fused through a BiFPN -- a weighted bi-directional feature pyramid network:
- Bi-directional flow: Information flows both top-down (high-level context enriches low-level detail) and bottom-up (fine details propagate to contextual features)
- Weighted fusion: The network learns the importance of each feature scale through learnable weights, with weight normalization for training stability
- Efficiency optimization: Single-input nodes are removed to reduce computation
- Cross-scale communication: Each scale can directly exchange information with adjacent scales, enabling the detection head for small, distant objects to benefit from high-level scene understanding
Stage 3: Multi-Camera Fusion
After per-camera feature extraction and BiFPN fusion, the features from all 8 cameras must be combined into a single unified representation. Tesla uses a transformer-based multi-camera fusion module:
- Key-Value Generation: Each camera's BiFPN features generate key and value vectors
- Query Generation: A raster matching the desired output space (BEV grid or 3D volume) is tiled with positional encodings and encoded via MLP into query vectors
- Cross-Attention: Queries attend to keys across all 8 cameras simultaneously, pulling relevant values from whichever camera(s) observe each spatial location
- Output: A unified multi-camera feature map in the desired output coordinate system
Stage 4: Task-Specific Heads
The fused features branch into multiple trunks (shared per-task-group computation) and terminals (task-specific output layers):
| Task Head | Output | Purpose |
|---|---|---|
| Object Detection | 3D bounding boxes + class labels + velocities | Detect vehicles, pedestrians, cyclists, etc. |
| Lane Lines | Polyline geometry in BEV + lane type | Lane boundary detection |
| Road Edges | Boundary curves | Drivable area delimitation |
| Road Surface | 3D height map | Ground plane estimation |
| Drivable Space | Binary mask in BEV | Where the vehicle can physically drive |
| Traffic Lights | Bounding box + state (color, arrow, relevance) | Traffic signal interpretation |
| Traffic Signs | Class + position | Speed limits, stop signs, yield signs |
| Depth | Per-pixel depth map | Monocular depth estimation |
| Semantic Segmentation | Per-pixel class labels | Road, sidewalk, vegetation, building, etc. |
| Velocity Estimation | Per-object velocity vectors | Dynamic object motion |
Key Design Benefit -- Task Decoupling:
Each head can be fine-tuned independently without affecting other tasks. If traffic light detection needs improvement, the traffic light head can be retrained while the backbone and other heads remain frozen. This dramatically accelerates iteration speed for a team of ~20 engineers all working on a single neural network simultaneously.
Scale of the Modular Stack
At peak complexity (pre-v12), the HydraNet system comprised:
- 48 distinct neural networks operating in concert
- Producing 1,000 distinct output tensors per timestep
- Running on 2 NPUs across the dual FSD chips
6. BEV (Bird's Eye View) Transformer
The Fundamental Problem
Cameras produce 2D images, but driving requires understanding 3D space. The BEV transformer solves the 2D-to-3D transformation problem: converting multi-camera 2D image features into a unified top-down (bird's eye view) spatial representation where distances, sizes, and spatial relationships are metrically accurate.
Architecture
Tesla's BEV transformer, first detailed at AI Day 2021, uses a spatial cross-attention mechanism:
Step 1: Initialize the BEV Grid
A regular 2D grid in BEV space (top-down view, centered on the vehicle) is initialized. Each grid cell represents a physical location in the world (e.g., 0.5 m x 0.5 m resolution covering an area around the vehicle).
Step 2: Positional Encoding
Each BEV grid cell is assigned positional encodings using:
- Sinusoidal functions (sine and cosine at different frequencies) encoding the (x, y) position of each cell in the physical world
- These positional encodings are processed through an MLP to produce a set of query vectors -- one per BEV grid cell
Step 3: Image Feature Key-Value Pairs
For each of the 8 cameras:
- The BiFPN features are projected into key and value vectors
- The keys encode "what spatial information is available at each position in this camera's image"
- The values encode "the actual feature content at that position"
Step 4: Cross-Attention
The BEV queries attend to the image keys across all cameras simultaneously:
Attention(Q, K, V) = softmax(Q * K^T / sqrt(d_k)) * V
where:
Q = BEV positional queries (what 3D location am I asking about?)
K = Image feature keys (where in the images can I find information?)
V = Image feature values (what does the image say about that location?)The attention weights learn the geometric mapping from BEV locations to camera pixel locations. Through training, the network discovers which camera pixels correspond to which 3D world positions, effectively learning the camera projection geometry implicitly.
Step 5: Multi-Head Attention
Multiple attention heads operate in parallel, each specializing in different aspects of the projection:
- Some heads may focus on near-field mapping
- Others on long-range correspondences
- Others on cross-camera overlap regions where stereo-like depth cues are available
Step 6: Output
The result is a dense BEV feature map -- a top-down spatial representation where each cell contains rich learned features about that location in the world. This BEV feature map is the primary intermediate representation consumed by downstream tasks (occupancy prediction, lane detection, object detection, planning).
Why Transformers for BEV?
Before Tesla popularized this approach (sparking the academic BEVFormer line of work), BEV projection was done using explicit geometric transformations (inverse perspective mapping, depth-based lifting). These approaches required known camera intrinsics/extrinsics and explicit depth estimates.
The transformer-based approach is superior because:
- It learns the projection rather than computing it from calibration parameters, making it robust to calibration errors
- It handles occluded and ambiguous geometry by aggregating information across multiple cameras and multiple attention heads
- It naturally handles varying camera configurations across hardware generations (HW3 vs HW4) by learning different attention patterns
Relationship to BEVFormer
Tesla's approach predates and inspired the academic BEVFormer paper (ECCV 2022, by Zhiqi Li et al.), which formalized similar ideas with:
- Spatial cross-attention between BEV queries and multi-camera image features
- Temporal self-attention between current and previous BEV features
- ResNet-101-DCN + FPN backbone (vs. Tesla's RegNet + BiFPN)
Tesla's implementation likely differs in many specifics (not publicly documented), but the core principle -- learned spatial cross-attention from images to BEV -- is shared.
7. Occupancy Networks
Introduction
Occupancy Networks were introduced at CVPR 2022 (workshop keynote by Ashok Elluswamy) and detailed at AI Day 2022 (September 2022). They represent Tesla's approach to general 3D scene understanding -- replacing the need for explicit object detection for many safety-critical decisions.
Core Concept
Rather than detecting objects and fitting bounding boxes, the occupancy network divides the 3D space around the vehicle into a dense grid of voxels (volumetric pixels -- small cubes) and predicts whether each voxel is free or occupied.
This approach handles:
- Arbitrary geometry that bounding boxes cannot capture (ladders on trucks, side trailers, overhanging structures, construction equipment)
- Novel objects never seen in training (the network only needs to predict "something is here," not "this is a bicycle")
- Continuous surfaces (curbs, walls, guardrails) that are poorly represented by individual bounding boxes
Architecture
8 Camera Images
-> RegNet + BiFPN Backbone (per-camera features)
-> Spatial Attention Module (image features -> 3D occupancy feature volume)
- Inputs: Image Key, Image Value, 3D Spatial Queries
-> Temporal Fusion (merge with t-1, t-2, t-3, ... feature volumes)
-> Deconvolution Layers
-> Outputs: Occupancy Volume + Occupancy FlowStep-by-step (per Tesla patent US20240185445A1):
- Camera Input (Step 210): Eight camera feeds are captured simultaneously
- Featurization (Step 220): RegNet + BiFPN extracts and fuses multi-scale features; generates multi-camera query embeddings
- 3D Reconstruction (Step 230): A transformer aggregates overlapping camera views into a unified 3D representation using 3D spatial queries on 2D featurized image data
- Temporal Fusion (Step 240): The 3D representation at timestamp t is fused with representations from t-1, t-2, t-3 to produce spatial-temporal features
- Deconvolution (Step 250): Mathematical operations reverse convolution effects, transforming fused features back to voxel space
- Volume Output (Step 260): Generates per-voxel occupancy predictions
Voxel Resolution
| Parameter | Value |
|---|---|
| Default voxel size | 33 cm per edge (~1 foot cubes) |
| Refined voxel size (near ego, occupied) | 10 cm per edge |
| Adaptive refinement | Voxels near the vehicle or on occupied surfaces are dynamically subdivided |
| Sub-voxel analysis | Trilinear interpolation estimates occupancy within partially-occupied voxels |
| Resolution increase (v13) | 8x increase in voxel resolution vs. initial implementation |
Output Categories
The occupancy network produces four categories of output per voxel:
| Output | Description |
|---|---|
| Occupancy Volume | Binary or probabilistic designation: occupied (1) or free (0) for each voxel |
| Occupancy Flow | Velocity vector for each occupied voxel -- how fast and in what direction the mass is moving |
| Shape Information | Surface geometry of the occupied region, using regression to identify object shapes |
| Semantic Labels | Classification: car, truck, pedestrian, street curb, building, road surface, vegetation, etc.; plus static vs. moving |
Signed Distance Fields (SDF)
Rather than binary occupied/free predictions, Tesla uses Signed Distance Fields for smoother, more precise geometry:
- For any point in the 3D grid, the model predicts its distance to the nearest solid surface
- Positive values: point is outside the object (in free space)
- Negative values: point is inside the object
- Zero: point is exactly on the surface
- The SDF provides continuous, sub-voxel geometry -- vastly more precise than binary occupancy
Performance
| Metric | Value |
|---|---|
| Inference speed | >100 FPS (3x faster than camera frame rate) |
| Memory efficiency | "Super memory efficient" (Tesla's characterization) |
| Temporal persistence | Maintains 3D "Voxel Map" across time; tracks objects even when temporarily occluded |
Occupancy Flow: Motion Understanding
The flow output uses color encoding for visualization:
- Red: Forward-moving voxels
- Blue: Backward-moving voxels
- Grey: Stationary voxels
This enables the system to:
- Distinguish parked cars from moving cars without explicit object detection
- Predict where occupied regions will be in the near future
- Handle objects whose velocity is difficult to estimate from a single frame (e.g., objects moving laterally)
How Occupancy Networks Replace LiDAR
| Capability | LiDAR | Tesla Occupancy Network |
|---|---|---|
| 3D structure | Direct measurement (sparse points) | Predicted (dense voxels) |
| Range | ~200 m (automotive grade) | Entire camera range (~250 m narrow, ~60 m wide) |
| Refresh rate | 10--20 Hz | >100 Hz |
| Novel object detection | Natural (any physical object reflects) | Learned from training data; generalizes via SDF |
| Texture/color | None | Full (from camera images) |
| Weather robustness | Moderate (degrades in rain/fog) | Moderate (degrades in visual occlusion) |
| Cost per unit | $500--$10,000+ | $0 (software on existing cameras) |
Temporal Persistence and Occlusion Handling
The occupancy map is persistent across time -- the vehicle maintains a 3D voxel map of its surroundings that updates incrementally:
- When a pedestrian walks behind a parked car, their last known position and velocity are retained in the occupancy map
- The system continues to predict the occluded pedestrian's likely position based on their trajectory
- When the pedestrian re-emerges, the prediction is validated and updated
8. Temporal Module
The Video Problem
A single-frame perception system cannot handle:
- Occlusion (objects hidden behind other objects)
- Velocity estimation (requires multiple frames)
- Road geometry prediction beyond the current visible extent
- Scene understanding in ambiguous situations (is that a shadow or a pothole?)
Tesla addresses this with a sophisticated temporal processing pipeline that turns the perception system from a "snapshot camera" into a "video understanding system."
Feature Queue
The feature queue is the temporal memory of the perception system, caching features from recent timesteps:
What is cached per timestep:
- Multi-camera fused features (output of the multi-camera transformer)
- Ego-vehicle kinematics (position, velocity, heading from IMU + odometry)
- Positional encodings (encoding the vehicle's world position at that timestep)
Two types of queues with different push rules:
| Queue Type | Push Rule | Purpose |
|---|---|---|
| Time-based queue | Push every ~27 milliseconds (36 Hz) | Handle occlusion: if an object disappears behind an obstacle, the time-based queue remembers it was there moments ago |
| Space-based queue | Push every 1 meter of vehicle travel | Road geometry prediction: road markings and road edges from 50 m behind the vehicle are used to predict the geometry ahead |
The feature queue concatenates: multi-camera features + kinematics + positional encodings for each cached timestep.
Spatial RNN (Recurrent Neural Network)
The video module consumes the feature queue through a spatial RNN -- one of Tesla's most innovative architectural choices:
Architecture:
- RNN cells are organized as a 2D lattice representing the two-dimensional surface the vehicle drives on
- Each cell in the lattice has a hidden state that tracks various aspects of the road at that location:
- Lane centers
- Road edges
- Lane lines
- Road surface characteristics
Update Rules:
- Hidden state cells are updated only when the car is nearby or has visibility of that region
- Kinematics (from IMU) are used to integrate the vehicle's position into the hidden feature grid, so that features are properly registered in world coordinates as the car moves
- This selective update is highly efficient: only a small fraction of the 2D lattice is updated at each timestep
Why Spatial RNN (not just temporal attention)?
A standard temporal attention module processes features frame-by-frame. The spatial RNN instead maintains a persistent map-like representation anchored in world coordinates. As the vehicle drives, the RNN builds up a spatial memory of the environment -- similar to how a human driver remembers the road they passed moments ago.
Kinematics Integration
The IMU provides real-time kinematics (position, velocity, acceleration, angular rates) that are critical for temporal alignment:
- Features from frame t-1 must be registered to the vehicle's current position at frame t
- Without kinematics, temporal aggregation would fail due to ego-motion
- The kinematics are fed both to the feature queue (for positional encoding) and to the spatial RNN (for hidden state position tracking)
However, accurate alignment ultimately relies on the trained transformer network, not just raw IMU data -- the network learns to compensate for IMU noise and drift.
Temporal Processing in v13+
FSD v13 introduced a 10-second recursive video buffer:
- The system maintains the last 10 seconds of video context
- At 36 fps, this represents ~360 frames of temporal memory
- The occupancy network was upgraded to use video instead of single-timestep images, enabling:
- Robustness to temporary occlusions
- Prediction of occupancy flow (object trajectories)
- Better depth estimation through multi-frame triangulation
- Scene understanding from motion parallax
9. Object Detection & 3D Bounding Boxes
Detection in Image Space vs. Vector Space
Tesla's object detection operates in two stages:
Stage 1 -- Image-Space Detection:
- Each camera's features (after backbone + BiFPN) produce a detection raster
- The raster contains 1 bit per position indicating whether there is an object at that location
- Additional attributes per detection: class, 2D bounding box, partial 3D information
Stage 2 -- Vector-Space 3D Detection:
- After multi-camera fusion and BEV projection, objects are detected in the unified 3D vector space
- Each detection includes:
- 3D bounding box (position, dimensions, orientation in world coordinates)
- Object class
- Velocity vector (speed + heading)
- Acceleration estimate
- Existence probability
3D Bounding Box Estimation from Cameras
Estimating 3D bounding boxes from 2D camera images requires solving two problems: what the object is and where it is in 3D space.
Depth estimation for bounding boxes:
- Monocular depth cues: apparent size, texture gradient, ground-plane constraints, learned priors
- Multi-frame triangulation: tracking the same object across consecutive frames as the ego vehicle moves provides stereo-like depth
- Multi-camera overlap: where multiple cameras see the same object, geometric triangulation provides precise depth
- Temporal aggregation: velocity integration from optical flow constrains depth
Training approach:
- Validation vehicles equipped with auxiliary sensors (LiDAR, radar) capture ground-truth 3D measurements
- By tracking objects across frames, the system correlates visual features with precise 3D positions
- This generates a massive, highly accurate training dataset that the fleet-deployed vision-only network learns from
Detected Object Classes
Vehicles:
- Sedan / passenger car
- Minivan / SUV
- Pickup truck
- Small truck / box truck
- Tractor-trailer / semi
- Bus
- Motorcycle
- Emergency vehicles (ambulance, fire truck -- added v14)
- Garbage truck (added v14)
- Street sweeper (added v14)
- Golf cart (added v14)
Vulnerable Road Users (VRUs):
- Pedestrians (with walking animation for moving)
- Cyclists / bicycles
- Baby carriages / strollers
- Skateboarders
- Animals (dogs and similarly-sized animals)
Road Infrastructure:
- Traffic cones
- Construction barrels
- Debris / unidentified obstacles (added FSD Beta 10.69.2)
- Garbage/recycling bins
- Poles
Foveated Processing Optimization:
To manage computational cost on fixed hardware, Tesla uses a foveated rendering approach for detection:
- A high-resolution crop of the horizon region is processed at full resolution for detecting distant, small objects
- A downsampled version of the rest of the image is processed for closer, larger objects
- The two processed views are fused, providing long-range precision without processing every pixel at maximum resolution
This is analogous to foveated rendering in VR, but applied in reverse -- the system focuses computational resources on the region where driving decisions depend on detecting distant objects.
10. Depth Estimation
The Core Challenge
Without active ranging sensors (LiDAR, radar, sonar), Tesla must estimate depth from 2D camera images alone. This is one of the most technically challenging aspects of the vision-only approach.
Methods Used
1. Self-Supervised Monocular Depth
The primary depth estimation approach uses self-supervised learning:
- During training, the network learns to predict depth from single images by exploiting the geometric consistency of consecutive frames
- If the depth prediction is correct, warping a previous frame to the current viewpoint (using the predicted depth and known ego-motion) should reproduce the current frame
- The photometric loss between the warped and actual frame drives the depth learning
- No ground-truth depth labels are required for this training signal
Advantages:
- Scales to Tesla's massive fleet data (billions of frames)
- No per-frame depth labels needed
- Captures scene-level depth (road surface, buildings, vegetation) not just object depth
2. Multi-Frame Triangulation
By tracking features or objects across multiple frames as the vehicle moves:
- The ego-motion between frames creates a baseline (similar to stereo vision)
- Known features visible in multiple frames can be triangulated to estimate their 3D position
- The longer the temporal baseline, the more accurate the depth estimate (up to a point)
This is particularly effective for:
- Stationary objects (buildings, parked cars, signs) where ego-motion provides the stereo baseline
- Slow-moving objects where the object's own motion is small relative to ego-motion
3. Multi-Camera Stereo
Several camera pairs have overlapping fields of view:
- Left and right B-pillar cameras create a wide stereo baseline
- Forward cameras overlap with side cameras at close range
- B-pillar and C-pillar cameras (HW4) have lateral overlap
Where cameras share overlapping views, classical stereo matching principles provide direct depth measurements.
4. Learned Depth Priors
The neural network learns strong priors about depth from training data:
- Apparent object size (a car at 100 m appears much smaller than at 10 m)
- Ground-plane geometry (road markings converge at the vanishing point)
- Texture gradients (road texture becomes finer with distance)
- Atmospheric perspective (distant objects appear hazier)
- Object class constraints (pedestrians are ~1.7 m tall; stop signs are ~0.75 m wide)
5. Training with Ground-Truth Sensors
Tesla's validation engineering vehicles are equipped with high-precision auxiliary sensors (including LiDAR and radar, ironically) to capture ground-truth depth measurements. These ground-truth datasets are used to:
- Supervise the depth estimation network during training
- Validate the self-supervised depth predictions
- Generate pseudo-LiDAR labels for the auto-labeling pipeline
The production vehicles do not have these sensors -- they learn to infer depth from cameras alone using the knowledge distilled from the ground-truth-equipped training fleet.
Depth Estimation Accuracy
Per Tesla's claims and independent analyses, the vision-based depth estimation achieves precision close to that of auxiliary sensors for objects within the typical driving envelope (~0--200 m). Accuracy degrades:
- At extreme range (>200 m) where monocular cues become ambiguous
- For featureless surfaces (blank walls, snow-covered ground)
- In low-texture environments
11. Lane Detection & Road Geometry
Lane Detection Architecture
Lane detection in Tesla's perception stack operates in BEV/vector space, not in image space:
- Image Features: RegNet + BiFPN extract per-camera features including lane line responses
- BEV Projection: The transformer module projects lane features from image space into the top-down BEV representation
- Spatial RNN Memory: The space-based feature queue (push every 1 m of travel) remembers lane markings from the recent past -- markings from 50+ meters behind the vehicle constrain the predicted lane geometry ahead
- Lane Head: A task-specific head predicts lane geometry in BEV coordinates
Lane Representation
Rather than per-pixel segmentation of lane markings, Tesla predicts lane geometry as structured polylines and graphs in vector space:
| Output | Description |
|---|---|
| Lane centerlines | Polyline geometry for each lane center |
| Lane boundaries | Left and right edge polylines per lane |
| Lane types | Solid, dashed, double, yellow, white |
| Road edges | Curbs, barriers, guardrails, grass/dirt edges |
| Road surface | 3D height map of the drivable surface |
| Drivable space | Binary mask of physically traversable area |
Lane Connectivity Network
A particularly innovative component is the Lane Connectivity Network, which uses transformer-based autoregressive blocks to understand road layouts:
- Similar to how a language model generates text token-by-token, the lane connectivity network predicts the road graph node-by-node
- It reasons about intersection topology: which lanes connect to which at an intersection
- It predicts the number of lanes, their connectivity, merge/split points, and turn lanes
- This enables the vehicle to plan routes through complex intersections without HD maps
Road Geometry Prediction
Beyond lane markings, the perception stack predicts:
- Ground surface height: The 3D height of the road surface, enabling the vehicle to handle hills, dips, speed bumps, and uneven terrain
- Road curvature: Predicted ahead of the visible extent using spatial memory
- Road semantics: Travel space, parking areas, driveways, turn pockets
No HD Maps Required
A critical design decision: Tesla's lane detection works without pre-built HD maps. This is in stark contrast to Waymo, Cruise, and other AV companies that rely on centimeter-accurate maps. Tesla's lane network predicts road geometry on-the-fly from visual cues, making it operational on any road worldwide.
12. Traffic Light & Sign Detection
Traffic Light Detection
Traffic lights are handled by a dedicated task head in the HydraNet architecture:
Detection Pipeline:
- Full-resolution features from the backbone (especially fine-grained Level 1 features at 160x120) are used to localize small, distant traffic lights
- The detection head outputs:
- Bounding box around each traffic light
- State classification: Red, yellow, green, flashing, off
- Arrow detection: Left, right, straight, U-turn
- Relevance determination: Which traffic light applies to the ego vehicle's lane
Relevance is the hardest sub-problem: at a complex intersection, there may be 10+ traffic lights visible, but only 1--2 are relevant to the ego vehicle's current lane and intended direction. The network learns relevance from training data showing which light the human driver obeyed.
LED Flicker Mitigation (HW4):
- LED traffic lights flicker at frequencies that can cause them to appear off in individual camera frames
- The Sony IMX490 sensor includes built-in LED flicker mitigation
- This was a known issue with HW3 cameras (AR0136AT), partially mitigated in software
Traffic Sign Detection
Traffic signs are detected and classified by a separate task head:
| Sign Type | Detection Capability |
|---|---|
| Speed limit signs | Detected and value extracted; displayed to driver |
| Stop signs | Detected; vehicle behavior triggered |
| Yield signs | Detected; context-aware behavior |
| Road name signs | Detected; text on predetermined road words recognized |
| Construction signs | Detected; associated with construction zones |
| School zone signs | Detected; speed reduction triggered |
| No-turn signs | Detected; route planning constraint |
End-to-End Handling (v12+)
In the end-to-end architecture (FSD v12+), traffic light and sign detection are not separate explicit modules. Instead:
- The unified model learns to respond to traffic lights and signs implicitly through imitation learning
- Engineers can still "prompt" the model to output auxiliary traffic light/sign predictions for debugging and safety verification
- The model's internal representations still encode traffic light state, but this is a learned feature rather than an explicit detection head
13. Semantic Segmentation
Segmentation Outputs
Tesla's perception stack performs dense semantic segmentation in both image space and BEV space:
Image-Space Segmentation (per camera):
| Class | Description |
|---|---|
| Road surface | Paved driving surface |
| Lane markings | Solid lines, dashed lines, stop bars, crosswalks, arrows, chevrons |
| Curbs | Raised edges delimiting the road |
| Sidewalks | Pedestrian walkways |
| Vegetation | Trees, bushes, grass |
| Buildings | Permanent structures |
| Sky | Above the horizon |
| Vehicles | Cars, trucks, buses, motorcycles |
| Pedestrians | Persons |
| Cyclists | Bicycles with riders |
| Traffic infrastructure | Poles, signs, lights |
BEV-Space Segmentation:
| Class | Description |
|---|---|
| Drivable area | Where the vehicle can physically drive |
| Road body | The paved road surface (displayed white on gray in FSD visualization) |
| Road edges | Boundaries of the drivable area (displayed red in FSD visualization) |
| Lane boundaries | Separations between lanes |
| Crosswalks | Pedestrian crossing zones |
| Parking spaces | Detected parking spots |
3D Semantic Occupancy Grid
The occupancy network extends segmentation into 3D:
- Each occupied voxel receives a semantic label
- Classes include: vehicle, pedestrian, curb, road surface, building, vegetation, low obstacle, generic occupied
- Static vs. dynamic classification per voxel
- This 3D segmentation is the primary input for safe navigation -- the planner knows not just where objects are, but what category they belong to
Road Marking Detection
Tesla's system detects and classifies an extensive set of road markings:
- Single and double yellow/white lines (continuous and dashed)
- Stop bars
- Crosswalks
- Road arrows (turn, straight, merge)
- Road chevrons
- Bicycle lane markings
- Railroad crossings
- Handicap parking symbols
- Text on roads (from a predetermined vocabulary)
14. Object Tracking
Multi-Camera Multi-Object Tracking
Object tracking in a camera-only system is challenging because:
- Objects move between cameras as they (or the ego vehicle) move
- There is no direct range measurement to disambiguate similar-looking objects
- Occlusions can cause objects to temporarily disappear
Tracking Architecture
Tesla's tracking operates in the 3D vector space (BEV coordinates), not in individual camera images:
- Detection in 3D: Objects are detected in the unified BEV/occupancy representation, providing 3D position estimates
- Data Association: Across timesteps, detections are associated with existing tracks using:
- Predicted position (from previous velocity + kinematics model)
- Appearance features (learned embedding similarity)
- Size and class consistency
- Track Maintenance: Each tracked object maintains:
- Position history (trajectory)
- Velocity and acceleration estimates
- Class label (with confidence)
- Existence probability
- Occlusion Handling: When an object is occluded:
- The temporal persistence of the occupancy map retains its last known position and velocity
- The track continues to predict the object's position based on its last known trajectory
- When the object re-appears, it is re-associated with the persisted track
Multi-Camera Association
When an object transitions from one camera's FOV to another:
- The 3D vector space representation provides a common coordinate frame
- An object detected by the left B-pillar camera at position (x, y, z) in world coordinates is the same object detected by the rear camera at the same (x, y, z)
- No explicit "handoff" between cameras is needed -- the fusion happens in the BEV transformer
Tracking in End-to-End (v12+)
In the end-to-end architecture, tracking is implicit:
- The temporal module (10-second video buffer) provides the network with object persistence information
- The network learns to maintain internal representations of tracked objects through its recurrent/temporal structure
- Explicit track IDs may no longer be maintained; instead, the model reasons about "the world state" holistically
15. End-to-End Architecture (v12+)
The Paradigm Shift
FSD v12 (November 2023) marked the most radical architectural change in Tesla's Autopilot history: replacing the modular perception + planning pipeline with a single end-to-end neural network.
Before (Modular, v11 and earlier):
Cameras -> Perception (HydraNet) -> Intermediate Representations
-> Planning (MCTS + Rules) -> Control Commands- ~300,000 lines of C++ code for planning and control
- Explicit rules for every driving scenario (e.g., "stop 3 seconds at stop signs")
- Human engineers manually encoded driving behavior
- Planning was a hybrid symbolic-learning system
After (End-to-End, v12+):
Cameras (8 cams x N frames) -> Single Neural Network -> Control Commands
(steering angle, acceleration, braking)- ~2,000--3,000 lines of "glue code" for network activation, safety monitors, and vehicle interface
- Driving behavior learned from 10+ million human driving video clips
- No explicit rules; behavior emerges from data
- Fully differentiable end-to-end
How It Works
Input:
- Raw pixel data from 8 cameras (12-bit RCCC sensor data)
- Temporal context from the last 10 seconds of video (~360 frames)
- Navigation data (desired route/destination)
- Vehicle kinematics (speed, steering angle, IMU data)
- Audio input (v14+, for emergency vehicle siren detection)
Single Neural Network:
- Processes all inputs through a unified architecture
- Internal representations include BEV features, occupancy predictions, lane understanding, and object awareness -- but these are learned latent representations, not explicitly engineered modules
- Gradients flow all the way from the control output back to the sensor inputs, optimizing the entire pipeline holistically
Output:
- Steering angle
- Acceleration/throttle command
- Braking command
- Turn signal activation
- (Future: horn, hazard lights)
Training Approach
Behavioral Cloning (Imitation Learning):
- The network is trained to replicate expert human driving behavior
- Training data: millions of hours of human driving from Tesla's fleet, graded by driver quality
- The network learns the mapping: (visual input, route) -> control output
- High-quality drivers' data is weighted more heavily
Reinforcement Learning:
- Applied on top of the imitation-learned base policy
- Reward function evaluates:
- Positive: safe navigation, smooth lane changes, comfortable ride
- Negative: traffic violations, unsafe proximity, uncomfortable maneuvers, hesitant behavior
- Particularly important for safety-critical edge cases underrepresented in the imitation data
Interpretability Despite End-to-End
Despite being a "black box" end-to-end model, the system maintains interpretability:
- Engineers can "prompt" the model to output auxiliary predictions: 3D occupancy, road boundaries, objects, signs, traffic lights, etc.
- These auxiliary outputs are for debugging and safety verification only -- they do not directly control vehicle behavior
- Natural language querying (v14+): engineers can ask the model why it made a certain decision
- The model outputs language descriptions of its reasoning (potentially via an integrated VLA -- Vision Language Action -- model, though Tesla has not confirmed this)
Version Progression
| Version | Key Architecture Changes |
|---|---|
| v12.0 (Nov 2023) | First end-to-end; replaces C++ planner with neural network for city driving |
| v12.4 (Jun 2024) | Camera-based driver monitoring; improved E2E model |
| v12.5 (Aug 2024) | E2E extended to highway driving (previously highway used modular stack); larger model |
| v13.0 (Nov 2024) | Temporal-voxel transformer; 10-second recursive video buffer; HW4 only; 3x parameters vs v12; 8x voxel resolution; native HW4 camera resolution |
| v13.3 (2025) | Single large Vision Transformer for entire pipeline |
| v14.0 (Oct 2025) | ~10x parameter count vs v12; auto-regressive transformers; audio input; extended context |
| v14.2 (Nov 2025) | 95% reduction in hesitant behaviors; refinements |
16. Neural Network Planner Integration
Phase 2: Modular Planning (2021--2023)
In the modular architecture, perception outputs fed into a separate planning system:
Perception Outputs -> Planning Inputs:
| Perception Output | How Planning Used It |
|---|---|
| 3D Vector Space (objects, lanes, signs) | Scene graph for rule-based and learned decisions |
| Occupancy Volume (free/occupied voxels) | Collision checking for candidate trajectories |
| Occupancy Flow (voxel velocities) | Prediction of future obstacle positions |
| Lane Graph (centerlines, boundaries, topology) | Route-following and lane change opportunities |
| Traffic Light State (color, relevance) | Stop/go decisions |
| Drivable Space (BEV mask) | Feasibility constraints for trajectory generation |
Monte-Carlo Tree Search (MCTS) + Neural Network Planner:
- Generate multiple candidate trajectories from the current vehicle state
- For each trajectory, the neural network scores it using a cost function:
- Collision probability (from occupancy predictions)
- Comfort (jerk, lateral acceleration)
- Human-likeness (similarity to human driving behavior)
- Intervention likelihood
- Travel time / efficiency
- MCTS explores the trajectory tree to select the optimal path
- Selected trajectory is converted to steering, throttle, and brake commands
Limitations of this approach:
- Manual rules were required for edge cases (e.g., nudging around double-parked cars)
- ~300,000 lines of C++ encoded these rules
- Adding new driving behaviors required engineering effort, not just data
- Perception errors could not be compensated by planning (no gradient flow)
Phase 3: End-to-End Planning (v12+)
In the end-to-end architecture, the separation between perception and planning dissolves:
- No explicit intermediate representations are passed between modules
- The neural network's internal activations contain perception-like features (BEV, occupancy, object representations), but these are latent -- they emerge from training, not engineering
- Gradients flow end-to-end: errors in planning (e.g., hesitating at an intersection) propagate back through the entire network, improving perception representations that are relevant for that driving scenario
- Joint optimization: perception learns to extract exactly the features that planning needs, rather than extracting "general" features that may not capture driving-relevant nuances
Intermediate Representations in End-to-End
Even in the end-to-end model, internal representations are structured:
- The model still forms a BEV-like spatial representation internally
- Occupancy-like activations exist in the middle layers
- Object-like features emerge in intermediate representations
- But these are not constrained to match predefined formats -- the model discovers the most useful representation for driving
This can be verified by probing intermediate layers: when engineers add auxiliary loss functions (e.g., "also predict occupancy from layer N"), the model produces accurate occupancy maps -- confirming that the internal representation encodes rich 3D scene understanding.
17. Auto-Labeling Pipeline
Overview
Tesla's auto-labeling pipeline is one of the most technically sophisticated and strategically important components of the perception system. It transforms raw fleet driving data into the labeled training datasets that the neural network learns from -- with minimal human intervention.
Pipeline Stages
Fleet Vehicles (raw data) -> Cloud Upload -> Offline Processing ->
4D Reconstruction -> Auto-Label Generation -> Human QA ->
Training DatasetStage 1: Raw Data Ingestion
Fleet vehicles upload:
- 45-second to 1-minute video clips from all 8 cameras
- IMU data (accelerometer + gyroscope)
- GPS coordinates
- Wheel odometry
- Vehicle CAN bus data (speed, steering angle, brake pressure)
Stage 2: Offline Neural Network Processing
The key insight: offline processing has access to both past AND future frames, unlike real-time processing. Tesla runs a much heavier, more accurate neural network than could ever run in real-time on the vehicle:
- This offline model has unlimited compute budget
- It can process frames bidirectionally (forward and backward in time)
- It produces "near-perfect" labels as a first pass
As Karpathy described at CVPR 2021: "Using a much heavier model than you could ever use in production to do a first stab at data labeling offline, to then be cleaned up by a human, is very powerful."
Stage 3: 4D Vector Space Reconstruction
This is Tesla's most innovative labeling technique:
- Multi-camera fusion into 3D: All 8 cameras' video streams are fused into 3D point clouds using structure-from-motion and learned depth
- Multi-frame temporal alignment: Multiple frames are aligned using ego-motion and SLAM to build temporally consistent 3D reconstructions
- 4D Vector Space (3D + time): The reconstruction exists in 4D -- 3D geometry evolving over time
- Single label, many views: A single label placed in the 4D vector space automatically projects into all 8 camera views across all frames -- making each labeling effort 100x more efficient than per-image labeling
Stage 4: Multi-Trip Aggregation
Multiple Tesla vehicles driving through the same location at different times contribute to a shared reconstruction:
- Observations from multiple vehicles are aligned using road features (lane lines, road edges, landmarks)
- Static elements (buildings, signs, road geometry) are reinforced with each additional pass
- Moving objects are identified by their temporal inconsistency across trips -- they appear in some passes but not others
- The aggregated reconstruction becomes progressively more accurate with more data
- Fleet averaging solves individual-trip problems: blurred images, rain, fog, partial occlusion are averaged out across many observations
Stage 5: Moving Object Trajectory Extraction
For dynamic objects:
- Temporal inconsistencies reveal which objects are moving
- Full kinematic trajectories (position, velocity, acceleration over time) are extracted
- Even objects that were only partially visible or temporarily occluded can be fully reconstructed from multi-frame data
- These trajectories become training labels for the object detection and tracking heads
Stage 6: Human Quality Assurance
Auto-generated labels are spot-checked by Tesla's labeling team:
- High-confidence auto-labels are accepted without human review
- Edge cases, novel situations, and low-confidence predictions are flagged for human review
- The human labelers work in the 4D vector space representation, making corrections highly efficient
Generative Gaussian Splatting for Ground Truth
Tesla has developed a custom ultra-fast Gaussian Splatting system for 3D scene reconstruction:
| Feature | Specification |
|---|---|
| Speed | ~220 milliseconds per scene |
| Initialization | Not required (unlike standard 3DGS) |
| Dynamic objects | Can model moving objects |
| Joint training | Can be jointly trained with the end-to-end AI model |
| Camera views | Generates all 8 camera views simultaneously |
| Quality | Superior visual fidelity to standard Gaussian Splatting approaches |
Traditional 3D Gaussian Splatting struggles with driving scenes because vehicle motion is approximately linear (small stereo baseline between frames). Tesla's custom approach overcomes this limitation, producing high-quality 3D reconstructions from limited viewpoint variation.
Uses:
- Generating photo-realistic ground truth labels for training
- Creating synthetic training data with controlled variations
- Debugging by allowing engineers to virtually "fly through" reconstructed driving scenes
- Validating perception outputs against known 3D geometry
NeRF (Neural Radiance Fields) for Validation
Tesla uses NeRFs as an offline validation tool:
- Predicted occupancy volumes are compared against NeRF-reconstructed 3D scenes
- Discrepancies indicate perception errors that need targeted data collection and retraining
18. Data Engine
The Closed-Loop System
Tesla's Data Engine is the flywheel that continuously improves the perception stack:
Deploy Model -> Fleet Driving -> Shadow Mode + Triggers ->
Identify Weaknesses -> Collect Targeted Data -> Auto-Label ->
Retrain Model -> Validate -> Deploy Updated ModelShadow Mode
What it is: The FSD neural network runs silently in the background on all Tesla vehicles (not just FSD subscribers), comparing the system's predicted driving action against the human driver's actual action.
How it works:
- FSD receives the same camera inputs as the human driver
- FSD computes what it would do (steering, acceleration, braking)
- If FSD's prediction diverges significantly from the human's action, a disagreement event is flagged
- The flagged clip (video + telemetry) is uploaded to Tesla's servers
Scale: With 9+ million vehicles running Shadow Mode, Tesla effectively has the world's largest passive data-gathering network for autonomous driving.
Trigger-Based Data Collection
Tesla uses 221+ manually-implemented triggers (as of CVPR 2021) to identify scenarios worth collecting:
| Trigger Category | Examples | Priority |
|---|---|---|
| Hard clips | AEB activation, sudden steering correction, collision | Highest -- novel and safety-critical |
| Soft clips | Model prediction diverges from human action | High -- systematic weaknesses |
| Shadow disagreements | Background FSD would have driven differently | Medium -- general improvement |
| Novelty detection | Object detector encounters unprecedented input | High -- data distribution gaps |
| Uncertainty estimation | Model outputs low-confidence prediction | Medium -- model improvement |
| Scenario-based | Unprotected left turns, construction zones, school zones | Configurable -- targeted collection |
| Deep-learning queries | Specific objects (bears, construction equipment) or situations (driving into sun, tunnel entry/exit) | Configurable -- specific gap filling |
| Sensor contention | Detection inconsistency between cameras | High -- system validation |
Clip Mining
Active learning identifies the most informative training examples from the massive pool of uploaded clips:
- Searches for scenarios similar to known failure modes
- Prioritizes rare, safety-critical situations underrepresented in training data
- Examples: unusual intersection geometry, rare weather conditions, novel obstacle types
- Can search across the entire fleet's historical data for matching situations
Data Scale
| Metric | Value |
|---|---|
| Fleet size | 9+ million vehicles |
| FSD subscribers | 1.1 million (Q4 2025) |
| Data collection vehicles | All Tesla vehicles (via Shadow Mode) |
| Video clips processed | 400,000 per second (fleet-wide) |
| Driving data equivalent generated daily | ~500 years |
| Training data per cycle | 1.5+ petabytes |
| Training video clips | 10+ million curated clips |
19. Model Compilation & Inference
FSD Chip NPU Architecture (HW3)
The Neural Processing Unit is the largest and most important component on the FSD chip:
| Parameter | Specification |
|---|---|
| NPU count per chip | 2 |
| Chips per vehicle | 2 (dual redundant) |
| Total NPUs per vehicle | 4 |
| MAC array size | 96 x 96 = 9,216 MACs per NPU |
| MAC array type | Independent single-cycle feedback loops (NOT a systolic array; no inter-cell data shifting) |
| Data precision | 8-bit integer multiply with 32-bit integer accumulate |
| Clock speed | 2 GHz (production frequency) |
| Peak performance per NPU | 36.86 TOPS (INT8) |
| Total performance per chip | 73.7 TOPS |
| Total performance per vehicle | ~144 TOPS (dual chip) |
| Local SRAM cache | 32 MiB per NPU, highly banked |
| Cache read bandwidth | 384 bytes/cycle (256B data + 128B weights) |
| Cache peak bandwidth | 786 GB/s per NPU |
| Write-back throughput | 128 bytes/cycle to SRAM |
| NPU power | 7.5 W per NPU (~21% of chip power budget) |
| Power efficiency | ~4.9 TOPS/W |
| ISA | 8 instructions total (2 DMA, 3 dot-product variants, scale, element-wise add) |
| Instruction width | 32 to 256 bytes |
Processing Pipeline:
Data loaded from DRAM -> SRAM cache -> MAC array -> SIMD unit
(sigmoid, tanh, argmax) -> Pooling unit (2x2, 3x3) ->
Write-combine buffer -> SRAM (NO DRAM interaction after initial load)The key design principle: once data is loaded into the 32 MiB SRAM, all compute happens in SRAM -- no DRAM interaction until the final output. This eliminates the memory bandwidth bottleneck that limits many neural network accelerators.
Model Compilation
Tesla uses a custom neural network compiler (not TensorRT):
Two-Stage Compilation:
- Coarse Pass: Topology mapping -- partitions the neural network into tile-based workloads, mapping each tile's weight matrices to specific SRAM banks
- Fine Pass: Weight pruning and quantization, producing a highly optimized binary for the target chip revision
Compiler Capabilities:
- Layer fusion: Combines conv-scale-activation-pooling operations to maximize data reuse
- SRAM bank allocation: The compiler maps individual SRAM banks, controlling memory layout at the hardware level
- Quantization-aware compilation: Models are pre-quantized to INT8 during compilation
- Hardware-specific optimization: Different binaries for HW3, HW4, and future AI5
Quantization Strategy
| Aspect | Detail |
|---|---|
| Training precision | FP32 / BF16 (in data center on GPUs) |
| Deployment precision | INT8 (on-vehicle NPU) |
| Quantization method | Quantization-aware training (QAT) |
| Weight storage | 8-bit integers |
| Activation storage | 8-bit integers |
| Accumulation | 32-bit integers |
Quantization-Aware Training (QAT):
- During training in the data center, the network is trained with simulated INT8 quantization
- The model learns to be robust to the precision loss of INT8 representation
- This produces higher-quality INT8 models than post-training quantization
Multi-Hardware Deployment
Tesla's patent "System and method for adapting a neural network model on a hardware platform" describes how a single high-precision model is adapted for different hardware tiers:
| Hardware | Model Adaptation |
|---|---|
| HW3 | Aggressive pruning + INT8 quantization; lighter model variants; constrained to ~144 TOPS budget |
| HW4 | Full model with INT8 quantization; 3--5x more FLOPS budget; native 5.4 MP resolution |
| AI5 (future) | Full model with potentially mixed precision; 10x HW4 FLOPS; up to 800 W power envelope |
Inference Latency
| Version | Hardware | Approximate Latency |
|---|---|---|
| FSD v12 | HW3 | Runs within frame budget at 36 fps (~28 ms per frame) |
| FSD v13 | HW3 | Does NOT fit; HW3 stays on v12.6 |
| FSD v13 | HW4 | Runs within frame budget (3x parameters vs v12) |
| FSD v14 | HW4 | Runs within budget (10x parameters vs v12; likely uses efficient architectures like MoE) |
On-Vehicle Deployment
- Containerization: Trained and compiled model is packaged into a lightweight inference package
- Canary Release: Internal test vehicles receive the update first; telemetry monitored for anomalies
- Staged Rollout: ~10% of fleet per day, monitoring for regressions
- OTA Delivery: Models distributed via over-the-air software updates
20. Key Architectural Evolution (HW2 to AI5)
HW1: Mobileye Era (2014--2016)
| Aspect | Detail |
|---|---|
| Chip | Mobileye EyeQ3 (40 nm) |
| Cameras | 1 forward-facing |
| Perception | Mobileye's proprietary mono-camera perception |
| Capabilities | Lane keeping, basic forward collision warning |
| Tesla control | Minimal -- Mobileye provided a black-box solution |
| Limitation | Tesla had no ability to modify or improve the perception stack |
HW2 / HW2.5: NVIDIA Era (2016--2019)
| Aspect | Detail |
|---|---|
| Chip | NVIDIA Drive PX2 (Parker SoC, 16 nm) |
| Performance | ~24 TOPS |
| Cameras | 8 external cameras (first full surround vision) |
| Radar | 1 forward-facing radar |
| Ultrasonics | 12 ultrasonic sensors |
| Perception | Tesla's first in-house neural networks; per-camera CNNs |
| Architecture | Early convolutional networks; no multi-camera fusion; per-camera processing |
| Frame rate | ~110 FPS processing capacity |
| Limitation | Insufficient compute for real-time multi-camera fusion or BEV transformers |
HW3: Tesla's Own Silicon (2019--2023)
| Aspect | Detail |
|---|---|
| Chip | Tesla FSD Chip (14 nm Samsung, dual chip) |
| Performance | 144 TOPS (dual chip) |
| Cameras | 8 x 1.2 MP (AR0136AT) at 36 fps |
| Perception | HydraNet (shared backbone + multi-task heads); BEV transformer; occupancy networks |
| Architecture | RegNet backbone, BiFPN, transformer-based BEV projection, spatial RNN |
| Radar | Removed May 2021 |
| Ultrasonics | Removed Oct 2022 |
| Peak version | FSD v12.6 (end-to-end, but constrained by compute budget) |
| Limitation | Cannot run v13+ models (3x+ parameters); stuck on v12.6 branch |
Key Architectural Innovations on HW3:
- First deployment of BEV transformers in production vehicles (2021)
- First deployment of occupancy networks in production vehicles (2022)
- First end-to-end neural network for autonomous driving (v12, 2023)
- Spatial RNN for persistent environmental memory
- Multi-camera transformer fusion
HW4 (AI4): Next-Generation Silicon (2023--present)
| Aspect | Detail |
|---|---|
| Chip | Tesla FSD Chip 2 (7 nm Samsung, dual chip) |
| Performance | 3--8x HW3 (~360--1,150 TOPS estimated) |
| Memory | 16 GB GDDR6 (2x HW3); 224 GB/s bandwidth (3.3x HW3) |
| Storage | 256 GB NVMe (4x HW3) |
| Cameras | 7--8 x 5.4 MP (Sony IMX490/IMX963) |
| Camera inputs | Up to 13 supported |
| Perception | Full v13/v14 E2E model; native 5.4 MP resolution; 3--10x parameter models |
Key Perception Changes Enabled by HW4:
- Native 5.4 MP processing: HW3 was limited to 1.2 MP; HW4 runs FSD at full sensor resolution (FSD v13.2.1 was the first version to use native resolution for all cameras)
- Temporal-Voxel Transformer: v13's 10-second video buffer requires substantially more compute and memory than HW3 can provide
- 8x voxel resolution increase: Only possible with HW4's increased TOPS and memory bandwidth
- Larger model capacity: v14's 10x parameter model requires HW4's 3--5x FLOPS increase
Initial HW4 Software Challenge: HW4 initially ran FSD by emulating HW3 -- downsizing the 5.4 MP camera images to 1.2 MP and running the HW3 model. It took approximately 6 months before HW4-specific neural networks were trained and deployed. This highlights the challenge of training new models for a new sensor configuration while maintaining fleet compatibility.
AI5: The Next Leap (Planned 2027)
| Aspect | Detail |
|---|---|
| Process | 5 nm (TSMC, Arizona) |
| Performance | ~10x HW4 (estimated 2,000+ TOPS) |
| Power | Up to 800 W when processing complex environments (vs. 100 W HW3, 160 W HW4) |
| Capability | Inference + training capable on-vehicle |
| Status | Production pushed to early 2027; design complete |
| Significance | Planned as the "last hardware iteration installed in vehicles" |
AI5 Perception Implications:
- Potentially enables on-device fine-tuning (not just inference)
- May support full-precision inference (FP16 or BF16) rather than INT8 quantization
- Could run perception models 10x+ larger than current v14
- Camera input support for even higher resolution sensors or additional cameras
Software-Hardware Divergence (Late 2024+)
Starting November 2024, Tesla began shipping different FSD versions for different hardware:
- HW3 vehicles: capped at FSD v12.6.x (no upgrade path to v13 or v14)
- HW4 vehicles: receive FSD v13.x and v14.x
- This represents a permanent architectural fork -- HW3 cannot run the models that HW4 enables
21. Model Sizes & Performance Numbers
Known Parameter Counts
| Version | Approximate Parameters | Source |
|---|---|---|
| Pre-v12 HydraNet (48 networks) | ~10 million total (across all heads) | Industry estimates |
| FSD v12 E2E model | ~10 million (initial E2E baseline) | Industry reporting |
| FSD v13 | ~30 million (3x v12) | Tesla release notes |
| FSD v14 (initial) | ~45 million (4.5x v12) | Ashok Elluswamy |
| FSD v14 (full) | ~100 million (10x v12) | Musk / Elluswamy |
| Future AI5 models | ~1 billion+ (speculative) | Extrapolation from trends |
Note: These numbers are remarkably small by LLM standards (GPT-4 has ~1.7 trillion parameters). This reflects the efficiency of vision models and the extreme latency constraints of real-time autonomous driving.
Hardware Compute Budgets
| Hardware | TOPS (INT8) | Memory | Memory BW | Power |
|---|---|---|---|---|
| HW3 (dual chip) | 144 | 8 GB LPDDR4 | 68 GB/s | ~72 W |
| HW4 (dual chip) | ~360--1,150 (est.) | 16 GB GDDR6 | 224 GB/s | ~160 W |
| AI5 (planned) | ~2,000+ (est.) | TBD | TBD | up to 800 W |
Inference Characteristics
| Metric | Value |
|---|---|
| Camera frame rate | 36 fps |
| Per-frame latency budget | ~28 ms (at 36 fps) |
| Occupancy network speed | >100 fps (3x+ faster than cameras) |
| ISP throughput | 1 billion pixels/sec |
| Camera input bandwidth | Up to 2.5 billion pixels/sec |
| Total pixels processed per second (HW3) | ~353 million (8 cams x 1.2 MP x 36 fps) |
| Total pixels processed per second (HW4) | ~1.56 billion (8 cams x 5.4 MP x 36 fps) |
Convolutional Operations Dominance
On the HW3 NPU, convolutional operations account for 98.1% of all operations -- reflecting the heavy use of conv-based architectures (RegNet backbone, BiFPN, deconvolution layers). The remaining 1.9% covers attention mechanisms, normalization, and non-linear activations.
22. Patents
Key Perception-Related Tesla Patents
| Patent / Application | Title | Key Technical Focus |
|---|---|---|
| US20240185445A1 | Artificial Intelligence Modeling Techniques for Vision-Based Occupancy Determination | Core occupancy network patent; describes camera-to-voxel pipeline, 33 cm default voxel size with 10 cm refinement, SDF-based shape prediction, temporal fusion, four categories of occupancy output |
| WO2025193615 | AI Modeling Techniques for Vision-Based High-Fidelity Occupancy Determination and Assisted Parking | Extended occupancy patent for parking applications; sub-voxel accuracy; camera-only operation |
| WO2024073033A1 | Automated Data Labeling System | Auto-labeling pipeline using fleet data; 3D environment reconstruction; automated label generation from multi-trip aggregation |
| WO2019245618 | Data Pipeline and Deep Learning System for Autonomous Driving | Multi-layered image processing; preserves sensor data fidelity; avoids compression/downsampling that reduces signal quality |
| US20230057509A1 | Vision-Based Machine Learning Model for Autonomous Driving with Adjustable Virtual Camera | Calibration and virtual camera system; adjustable viewpoint for training data augmentation |
| System and Method for Obtaining Training Data (Karpathy, sole inventor) | Fleet-sourced training data | Trigger classifiers on intermediate neural network results determine which sensor data to transmit from vehicles to cloud |
| System and Method for Adapting a Neural Network Model on a Hardware Platform | Multi-hardware model deployment | Quantization-aware training; compiler toolchains for CPU/GPU/NPU optimization; single model adapted for HW3/HW4/AI5 |
| Systems and Methods for Training Machine Models with Augmented Data | Training data augmentation | Augmented camera images for generalization; improved robustness for object detection |
| Estimating Object Properties Using Image Data | Object property estimation | Depth, size, velocity estimation from camera images |
Patent Coverage Areas
Tesla's perception patent portfolio covers:
- 3D Occupancy Prediction: Camera-to-voxel transformation, SDF-based rendering, adaptive resolution
- Auto-Labeling: Fleet-sourced data, trigger-based collection, 4D reconstruction, automated labeling
- Neural Network Optimization: Hardware-adaptive deployment, quantization, pruning, compiler optimization
- Image Processing: Raw data pipeline, ISP optimization, multi-resolution processing
- Depth Estimation: Vision-based distance detection, monocular and stereo approaches
- Data Augmentation: Synthetic data generation, augmented training images
- FSD Visualization: 3D rendering of perception outputs for driver display
- Calibration: Virtual camera systems, online extrinsic calibration
23. Published Talks & Presentations
Major Presentations
Autonomy Day (April 2019)
| Detail | Content |
|---|---|
| Presenters | Elon Musk, Pete Bannon, Stuart Bowers |
| Key reveals | HW3 FSD Chip architecture; 144 TOPS; custom silicon strategy; dual-chip redundancy; ISP with 1B pixel/sec throughput |
| Perception content | Overview of camera-based perception; plans for end-to-end learning |
AI Day 2021 (August 2021)
| Detail | Content |
|---|---|
| Presenters | Andrej Karpathy (vision), others (Dojo, planning, bot) |
| Key perception reveals | HydraNet multi-task architecture; RegNet backbone with multi-scale features (160x120 to 20x15); BiFPN for feature fusion; transformer-based multi-camera fusion; BEV projection with positional encodings (sine/cosine via MLP); feature queue (time-based at 27 ms, space-based at 1 m); spatial RNN with 2D lattice hidden state; calibration neural network for fleet-wide virtual camera normalization; 48 networks producing 1,000 output tensors |
| Technical depth | Most detailed public disclosure of Tesla's perception architecture to date |
Andrej Karpathy -- CVPR 2021 Workshop on Autonomous Vehicles (June 2021)
| Detail | Content |
|---|---|
| Title | "Tesla's Vision-Only Approach to Autonomous Driving" |
| Key technical details | Vision-only superiority over sensor fusion (higher precision and recall); 221 manually-implemented triggers for fleet data collection; auto-labeling with heavyweight offline networks; 4D vector space labeling (100x efficiency); compute cluster: 80 nodes x 8 A100 GPUs = 5,760 GPUs, 10 PB NVMe, 640 Tbps networking; team of ~20 engineers training one neural network full-time |
| Scalability argument | HD LiDAR maps are unscalable; vision generalizes to any road |
Andrej Karpathy -- CVPR 2022 Workshop (June 2022)
| Detail | Content |
|---|---|
| Key technical details | Scaling training data; advanced auto-labeling techniques; multi-trip aggregation for 4D reconstruction; fleet averaging for robust 3D scene reconstruction |
AI Day 2022 (September 2022)
| Detail | Content |
|---|---|
| Presenters | Tesla AI team |
| Key perception reveals | Occupancy Networks (3D voxel predictions); occupancy flow; SDF-based geometry; >100 FPS performance; planning architecture (MCTS + neural network planner); cost function details |
| Significance | First public description of occupancy networks; sparked academic research wave |
Ashok Elluswamy -- CVPR 2023 (Keynote)
| Detail | Content |
|---|---|
| Title | "Building Foundational Models for Robotics at Tesla" (CVPR workshop) |
| Key content | First external presentation after Karpathy's departure; occupancy network architecture details; vision-based perception for both vehicles and robots |
Ashok Elluswamy -- ICCV 2024 (October 2024)
| Detail | Content |
|---|---|
| Title | "Building Foundational Models for Robotics at Tesla" |
| Key technical details | End-to-end FSD architecture confirmed; "billions of input tokens" from cameras, navigation maps, kinematic data; "Niagara Falls of data" from fleet (500 years of driving daily); neural world simulator generating 8 camera feeds simultaneously; Gaussian Splatting for 3D debugging; same architecture transfers to Optimus robot; auxiliary output probing for interpretability (3D occupancy, road boundaries, objects, signs, traffic lights); natural language querying for decision explanation |
| Multi-modal inputs | Camera video + navigation data + vehicle motion state + (later) audio |
| Output diversity | Panoramic segmentation, 3D occupancy, 3D Gaussian rendering, language output, action inference |
| Significance | First time Tesla publicly confirmed the full end-to-end architecture externally; most detailed post-Karpathy technical disclosure |
Ashok Elluswamy -- 2025 Presentations
| Detail | Content |
|---|---|
| Topics covered | Neural world simulator details; unified FSD + Optimus architecture; Generative Gaussian Splatting (~220 ms, no initialization needed, models dynamic objects); FSD v14 auto-regressive transformers; audio input integration; model scaling roadmap |
Key Technical Blog Posts and Analyses
| Source | Title/Topic | Key Contribution |
|---|---|---|
| ThinkAutonomous | "Tesla's HydraNet - How Tesla's Autopilot Works" | Detailed multi-stage pipeline breakdown; temporal processing; task head architecture |
| ThinkAutonomous | "A Look at Tesla's Occupancy Networks" | Architecture details; SDF explanation; NeRF validation; fleet averaging |
| ThinkAutonomous | "Tesla's Transition from Modular to E2E Deep Learning" | MCTS planning details; perception-planning integration; joint training |
| WikiChip Fuse | "Inside Tesla's Neural Processor in the FSD Chip" | NPU MAC array (96x96); ISA (8 instructions); SRAM architecture; power analysis |
| Kimbo Chen | "Tesla AI Day - Vision" | Feature queue specifics (27 ms time, 1 m space push rules); BEV positional encoding details; spatial RNN lattice architecture |
Appendix: Architecture Diagram (Text)
Phase 2 Architecture (2021--2023, Modular)
+-----------+
| 8 Cameras |
| (36 fps) |
+-----+-----+
|
+-----v-----+
| Calibration|
| Neural Net |
| (rectify) |
+-----+-----+
|
+-----------v-----------+
| RegNet Backbone |
| (per-camera, multi- |
| scale: 160x120 -> |
| 20x15, ~64-512 ch) |
+-----------+-----------+
|
+-----------v-----------+
| BiFPN |
| (weighted bi- |
| directional FPN) |
+-----------+-----------+
|
+-----------v-----------+
| Multi-Camera |
| Transformer Fusion |
| (cross-attention, |
| positional enc.) |
+-----------+-----------+
|
+-----------v-----------+
| Feature Queue |
| (time: 27ms, |
| space: 1m push) |
+-----------+-----------+
|
+-----------v-----------+
| Spatial RNN |
| (2D lattice, |
| kinematics-aligned) |
+-----------+-----------+
|
+------------------+------------------+
| | |
+----v----+ +------v------+ +------v------+
| Object | | Occupancy | | Lane/Road |
| Detect. | | Network | | Geometry |
| Head | | Head | | Head |
+---------+ +-------------+ +-------------+
| | |
+----v----+ +------v------+ +------v------+
| Traffic | | Depth | | Drivable |
| Lights | | Estimation | | Space |
| Head | | Head | | Head |
+---------+ +-------------+ +-------------+
| | |
+------------------+------------------+
|
+-----------v-----------+
| 3D Vector Space |
| (unified scene |
| representation) |
+-----------+-----------+
|
+-----------v-----------+
| MCTS + Neural Net |
| Planner |
+-----------+-----------+
|
+-----------v-----------+
| Vehicle Controls |
| (steering, throttle, |
| brake) |
+--------- ------------+Phase 3 Architecture (v12+, End-to-End)
+------------------+
| 8 Cameras x |
| N frames (10 sec)|
| + Navigation |
| + Kinematics |
| + Audio (v14+) |
+--------+---------+
|
+--------v---------+
| |
| Single Unified |
| Neural Network |
| |
| (Vision Trans- |
| former based, |
| auto-regressive|
| transformer, |
| ~100M params |
| v14) |
| |
| Internal: |
| - BEV features |
| - Occupancy |
| - Object repr. |
| - Lane repr. |
| - Traffic state |
| |
+--------+---------+
|
+-------------+-------------+
| | |
+----v---+ +-----v----+ +----v----+
|Steering| |Accel/ | | Brake |
|Angle | |Throttle | | Command |
+--------+ +----------+ +---------+
Auxiliary Outputs (for debugging/safety):
- 3D Occupancy map
- Detected objects
- Lane boundaries
- Traffic light states
- Natural language explanations (v14+)Glossary
| Term | Definition |
|---|---|
| BEV | Bird's Eye View -- top-down spatial representation centered on the vehicle |
| BiFPN | Bi-directional Feature Pyramid Network -- multi-scale feature fusion mechanism |
| E2E | End-to-End -- single differentiable model from sensor input to control output |
| HydraNet | Tesla's multi-task learning architecture with shared backbone and task-specific heads |
| ISP | Image Signal Processor -- hardware that converts raw sensor data to viewable images |
| MAC | Multiply-Accumulate -- fundamental operation in neural network inference |
| MCTS | Monte-Carlo Tree Search -- planning algorithm that explores trajectory trees |
| NPU | Neural Processing Unit -- specialized hardware accelerator for neural network inference |
| QAT | Quantization-Aware Training -- training with simulated lower precision |
| RCCC | Red-Clear-Clear-Clear -- color filter array used on Tesla cameras |
| RegNet | Regularized Network -- efficient CNN backbone architecture |
| SDF | Signed Distance Field -- continuous function representing distance to nearest surface |
| TOPS | Tera Operations Per Second -- measure of neural network accelerator performance |
| VRU | Vulnerable Road User -- pedestrians, cyclists, and similar road users |
This document was compiled from publicly available information including Tesla AI Day 2021 and 2022 presentations, Andrej Karpathy's CVPR 2021 and 2022 workshop talks, Ashok Elluswamy's CVPR 2023 and ICCV 2024 presentations, Tesla patent filings (US and WIPO), WikiChip hardware analyses, independent teardown reports, technical blog analyses (ThinkAutonomous, NotATeslaApp, AutopilotReview), and industry reporting as of March 2026.