Wayve Perception Stack: Exhaustive Deep Dive

Last updated: March 2026

End-to-End Philosophy: Why Wayve Rejects Modular Perception
Learned Perception: What Is Learned End-to-End vs. Modular
GAIA-1 World Model Perception
GAIA-2 Perception Evolution
GAIA-3 Perception Evolution
LINGO-2 Perception: Vision-Language-Action
Camera-Centric Architecture
Sensor Fusion
Depth Estimation
Semantic Understanding
Uncertainty Estimation
PRISM-1 Scene Reconstruction
Temporal Processing
Geographic Generalization
No-HD-Map Perception
Self-Supervised Learning
Foundation Model Architecture
Training Pipeline
Calibration
Key Publications
Key Patents

1. End-to-End Philosophy: Why Wayve Rejects Modular Perception {#1-end-to-end-philosophy}

The Fundamental Divergence

Wayve's perception philosophy represents a radical departure from the modular autonomous driving paradigm. Where companies like Waymo, Aurora, and Cruise decompose driving into a sequential pipeline of perception, prediction, and planning -- each a separate, hand-engineered module with explicit interfaces -- Wayve replaces the entire stack with a single neural network trained end-to-end.

This is not simply an architectural choice. It is a philosophical position rooted in Alex Kendall's academic research: that human-defined intermediate representations impose information bottlenecks that fundamentally limit system performance.

Why Separate Perception Is a Bottleneck

In a traditional AV1.0 stack, the perception module must compress the full richness of sensory data into a fixed, human-defined vocabulary: bounding boxes, lane lines, traffic light states, semantic labels. This compression introduces three critical failure modes:

Information loss at module boundaries. The perception module must decide what is "relevant" before passing data downstream. Objects or scene features that do not fit predefined categories are discarded, even if they are safety-critical (e.g., an unusual road obstacle, a construction barrier arrangement never seen in the label taxonomy).
Error propagation. Errors in perception cascade downstream through prediction and planning. A missed detection in the perception module cannot be recovered later. As Kendall et al. demonstrated in "Concrete Problems for Autonomous Vehicle Safety" (IJCAI 2017), propagating uncertainty through the pipeline is essential, but modular systems struggle to do this coherently.
Loss of high-dimensional context. Bounding boxes and semantic labels are low-dimensional summaries of rich visual information. The texture of a road surface, the body language of a pedestrian, the subtle signs of a car about to change lanes -- all of this is lost when perception is compressed into predefined categories.

Wayve's Alternative: Emergent Representations

Instead of human-defined intermediate representations, Wayve generates what they call "emergent AI representations" -- abstract, high-dimensional feature vectors that are learned end-to-end and optimized directly for the driving task. These representations are not interpretable in human terms (they are not "bounding boxes" or "lane lines"), but they maximize the information available for producing safe driving actions.

As Wayve describes it: rather than using human-defined concepts, the system generates "abstract representations of the environment generated by AI through mathematical transformations that are optimally informative for maximizing the learning objective."

This approach is philosophically aligned with the observation in large language models that scaling model size and data diversity yields emergent capabilities that were never explicitly programmed. Wayve bets that a single, sufficiently large neural network trained on diverse driving data will develop internal representations that are richer than anything a human engineer could design.

Comparison to Competitors

Company	Perception Approach	Explicit Perception Outputs	End-to-End Training
Wayve	Unified end-to-end model; perception implicit in latent space	Auxiliary outputs decoded from latent states for interpretability only	Full end-to-end from sensors to trajectory
Waymo	Historically modular (dedicated LiDAR/camera fusion, detection, tracking); converging toward E2E elements	Explicit 3D bounding boxes, tracks, semantic labels as module outputs	Increasingly end-to-end, but retains module boundaries
Aurora	Fully modular with HD maps; Aurora Driver maintains clear module boundaries	Explicit perception outputs from FirstLight LiDAR sensor fusion	Not end-to-end
Tesla	Evolved from modular to end-to-end with FSD v12+; philosophically closest to Wayve	Historically explicit (bounding boxes in occupancy network); now more implicit	FSD v12+ uses end-to-end for planning; vision backbone still somewhat modular
Mobileye	Modular with RSS safety framework; crowdsourced maps	Explicit perception outputs; SuperVision uses camera-first approach	Not end-to-end

What Wayve Means by "End-to-End"

It is important to be precise about what Wayve means by end-to-end. The system is not a single monolithic function from pixel values to steering angle. Rather, it is a differentiable neural network with structured internal components:

A vision backbone that processes multi-camera images
Spatial-temporal reasoning modules
A motion planning head that outputs a trajectory

What makes it end-to-end is that:

All components are trained jointly to optimize driving performance
No hand-coded interfaces or fixed representations exist between components
The internal representation is free to learn whatever features are most useful for driving
Gradients flow from the driving loss all the way back through the entire model

The auxiliary outputs (depth, semantics, flow) are decoded from intermediate latent states as additional training signals and for interpretability, but they are not used in the decision pipeline itself.

2. Learned Perception: What Is Learned End-to-End vs. Modular {#2-learned-perception}

The Dual Regime: End-to-End Core + Auxiliary Decoders

Wayve's perception is organized into two regimes:

Regime 1: End-to-End Learned (Core Pipeline)

The following perception tasks are learned implicitly within the end-to-end driving model. They are not separate modules -- they are capabilities that emerge from the latent representation optimized for driving:

3D scene understanding -- the spatial layout of the driving environment
Dynamic object recognition and tracking -- understanding of other road users and their behavior
Road structure understanding -- lanes, intersections, road edges, without HD maps
Traffic state comprehension -- traffic lights, signs, right-of-way
Predictive understanding -- anticipation of how the scene will evolve
Ego-motion understanding -- where the vehicle is and how it is moving

These are not decoded or evaluated as separate outputs. They exist as capabilities embedded in the model's latent state, evidenced by the model's ability to drive safely through complex scenarios.

Regime 2: Auxiliary Decoders (Interpretability and Training Signals)

The following perception outputs are decoded from the model's intermediate latent states. They serve two purposes: (a) providing additional training signals that accelerate learning (multi-task learning), and (b) enabling human interpretability and safety monitoring.

Auxiliary Output	Training Signal Type	Purpose
Semantic segmentation	Supervised (labeled data)	Interpretability; inductive bias for scene understanding
Traffic light state detection	Supervised (labeled data)	Safety-critical output; explicit verification
Depth estimation	Self-supervised (geometric consistency)	Geometric understanding; 3D scene structure
Surface normals	Self-supervised	Geometric reasoning about road and obstacle surfaces
Optical flow / motion estimation	Self-supervised (frame-to-frame correspondence)	Dynamic scene understanding
Future prediction	Self-supervised	Anticipatory driving; world model capabilities

Multi-Task Learning: Uncertainty-Weighted Loss Functions

The training of these auxiliary tasks alongside the primary driving task uses a multi-task learning framework directly inspired by Alex Kendall's seminal CVPR 2018 paper, "Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics" (Kendall, Gal, Cipolla).

In this framework, each task's loss function is weighted by a learned homoscedastic uncertainty parameter. Rather than manually tuning the relative weight of depth loss vs. segmentation loss vs. driving loss, the model learns the optimal weighting automatically. Tasks with higher inherent noise (aleatoric uncertainty) receive lower weight, preventing noisy signals from dominating the gradient.

This principled multi-task approach was demonstrated to learn per-pixel depth regression, semantic segmentation, and instance segmentation from a monocular input simultaneously -- precisely the combination needed for autonomous driving perception.

Five Training Objectives

Wayve's "Reimagining an Autonomous Vehicle" paper (Hawke, E, Badrinarayanan, Kendall, 2021) and associated blog posts describe five distinct training objectives combined in the multi-task framework:

Imitation learning -- learning to mimic expert human driving behavior from recorded data
Reinforcement learning -- learning from safety driver interventions and corrective actions
Safety driver corrections -- learning from post-intervention corrective driving
Dynamics modeling and future prediction -- learning to predict future states from off-policy data
Computer vision representations -- learning semantic, geometric, and motion representations via supervised and self-supervised signals

What Makes This Different from Modular Multi-Task

In a modular system, multi-task learning might train a shared backbone to produce multiple outputs (detection, segmentation, depth), but those outputs are then consumed by downstream modules with fixed interfaces. In Wayve's system, the multi-task outputs are auxiliary -- the primary output is the driving trajectory, and the internal representation is free to develop features that do not correspond to any of the auxiliary tasks if they are useful for driving.

3. GAIA-1 World Model Perception {#3-gaia-1-world-model-perception}

How Perception Is Implicit in World Modeling

GAIA-1 (Generative AI for Autonomy, 9.1 billion parameters) is Wayve's first-generation generative world model. While it is not a perception model per se -- it is a generative model that produces video -- it demonstrates a deep form of implicit perception. To generate realistic driving videos, the model must have learned to perceive and understand:

3D scene geometry -- objects maintain correct size, perspective, and occlusion relationships
Dynamic behavior -- vehicles accelerate, brake, and turn realistically
Temporal consistency -- scenes evolve coherently over time
Environmental conditions -- weather, lighting, road surfaces are rendered accurately
Causal relationships -- the generated future depends correctly on the conditioned ego-actions

This "perception through generation" is a key insight: a model that can accurately generate the future of a driving scene must have internalized a representation of that scene's structure, dynamics, and semantics.

Architecture

GAIA-1 consists of two components:

Component 1: World Model (6.5B parameters)

An autoregressive transformer that predicts the next set of image tokens in a sequence, conditioned on three modalities:

Video encoder (0.3B parameters): Discretizes each video frame using a VQ-VAE (Vector Quantized Variational Autoencoder). Each frame (resized to 9:16 aspect ratio) is encoded into 576 discrete tokens drawn from a learned codebook. The VQ approach converts continuous pixel data into a sequence of discrete symbols, enabling the transformer to treat video generation as a sequence prediction problem analogous to language modeling.
Text encoder: Discretizes and embeds natural language descriptions of driving scenarios into the shared representation space.
Action encoder: Projects scalar action values (steering angle, throttle, brake) into the shared representation space via learned projections.

All three modalities are projected into a shared representation space and temporally aligned. The autoregressive transformer then predicts future image tokens conditioned on this multimodal context.

Component 2: Video Diffusion Decoder (2.6B parameters)

A denoising video diffusion model that translates the predicted discrete image tokens back into pixel space. Critically, this operates on sequences of frames (not individual frames) to ensure temporal consistency. The diffusion process models frame sequences jointly, preventing temporal discontinuities that would arise from independent per-frame generation.

Perception Capabilities Demonstrated

Capability	Evidence
3D geometry understanding	Generated scenes maintain correct perspective, depth ordering, and object scaling
Occlusion reasoning	Objects correctly appear and disappear as they move behind other objects
Dynamic agent modeling	Other vehicles and pedestrians move with physically plausible dynamics
Environmental understanding	Accurate rendering of weather, lighting, road conditions from text prompts
Action-conditioned prediction	Ego-vehicle behavior correctly responds to input action tokens
Scene composition	Complex multi-agent urban scenes generated with correct spatial relationships

Training

Specification	Value
Total parameters	~9.1B (6.5B world model + 2.6B decoder)
World model training	15 days on 64x NVIDIA A100 GPUs
Video decoder training	15 days on 32x NVIDIA A100 GPUs
Training data	4,700 hours of proprietary London driving data (2019--2023)
Video encoder parameters	0.3B
Tokens per frame	576 discrete tokens
Aspect ratio	9:16

4. GAIA-2 Perception Evolution {#4-gaia-2-perception-evolution}

Architectural Leap: Discrete Tokens to Continuous Latent Diffusion

GAIA-2 represents a fundamental architectural shift from GAIA-1. Where GAIA-1 used discrete VQ tokens and autoregressive prediction, GAIA-2 moves to a continuous latent space with a latent diffusion model. This has significant implications for perception quality.

Video Tokenizer Architecture

The video tokenizer is a space-time factorized transformer with an asymmetric encoder-decoder design:

Encoder (85M parameters):

Input: raw video frames
Two downsampling convolutional blocks:
- First block: stride 2x8x8, embedding dimension 512
- Second block: stride 2x2x2, embedding dimension 512
24 spatial transformer blocks (512 dimensions, 16 attention heads)
Final convolution (stride 1x2x2) projecting to 2L channels for Gaussian distribution parameters (mean and standard deviation)
Total compression: 384x (32x spatial, 8x temporal, latent dimension L=64)
Encoder maps 8 frames to a single temporal latent independently (no temporal attention in encoder)

Decoder (200M parameters):

Linear projection from latent dimension to 512
First upsampling block (stride 1x2x2)
16 space-time factorized transformer blocks (with both spatial and temporal attention)
Second upsampling (stride 2x2x2) + 8 additional transformer blocks
Final upsampling (stride 2x8x8) to 3 RGB channels
Key asymmetry: decoder jointly decodes 3 temporal latents to 24 frames, using temporal context for consistency
Rolling inference: for long sequences, overlapping strides generate new frames conditioned on previously generated frames in a sliding window fashion

Training: 300,000 steps, batch size 128, on 128 NVIDIA H100 GPUs. Losses include L1/L2 pixel reconstruction, LPIPS perceptual loss, DINO feature distillation, and KL divergence, with GAN fine-tuning for visual quality.

Latent World Model (8.4B parameters)

The world model is a space-time factorized transformer trained via flow matching (a more stable alternative to standard diffusion):

22 transformer blocks with hidden dimension C=4096 and 32 attention heads
Each block contains:
- Spatial attention (across space and camera views)
- Temporal attention layer
- Cross-attention layer (for conditioning)
- MLP with adaptive layer norm
- Query-key normalization before attention
Flow matching objective: predicts velocity targets v_{t+1:T} = x_{t+1:T} - epsilon_{t+1:T}
Training: 460,000 steps, batch size 256, on 256 NVIDIA H100 GPUs
Uses bimodal logit-normal time distribution for training schedule

Conditioning Mechanisms

GAIA-2 implements sophisticated conditioning that demonstrates deep scene understanding:

Conditioning Type	Implementation	Perception Implication
Ego-action (speed, curvature)	Adaptive layer norm injection (found more accurate than cross-attention)	Model understands ego-dynamics
Camera parameters (intrinsics, extrinsics, distortion)	Three separate learnable embeddings summed into unified encoding; sinusoidal positional encodings	Model understands multi-view geometry
Environmental (weather, time of day, lighting)	Cross-attention conditioning	Model perceives and can control environmental conditions
Road configuration (lanes, speed limits, crossings, intersections)	Cross-attention conditioning	Model understands road structure semantically
Dynamic agents (3D bounding boxes, trajectories)	Cross-attention conditioning with class-based IoU evaluation	Model perceives and can control other agents
Scenario embeddings (from proprietary driving model)	Cross-attention with external latent embeddings	Model leverages driving model's internal representations

Multi-View Generation

GAIA-2 generates up to 5 temporally and spatially synchronized camera views at 448x960 resolution per view. Each camera view is encoded independently, then combined with camera geometry embeddings before transformer processing. This multi-view consistency demonstrates that the model has learned a coherent 3D representation of the scene, not just independent per-camera generation.

Perception Improvements Over GAIA-1

Dimension	GAIA-1	GAIA-2
Spatial fidelity	Discrete VQ tokens (lossy quantization)	Continuous latent space (higher fidelity)
Temporal consistency	Per-frame autoregressive (sequential error accumulation)	Joint sequence diffusion (global temporal coherence)
Multi-view coherence	Single camera view	Up to 5 synchronized views with geometric consistency
Geographic diversity	London only	UK, US, Germany
Scene control granularity	Text + action	Fine-grained control over agents, weather, road config
Agent understanding	Implicit	Explicit 3D bounding box conditioning with class-based metrics

5. GAIA-3 Perception Evolution {#5-gaia-3-perception-evolution}

Scale and Architecture

GAIA-3 (launched December 2, 2025) doubles GAIA-2 to 15 billion parameters and introduces perception-specific advances:

Video Tokenizer (2x GAIA-2 size):

Captures safety-critical spatial and temporal structures that GAIA-2's tokenizer missed
Enhanced fidelity for: subtle pedestrian motion, fast-moving vehicles, road signs, traffic lights, small objects
More faithful representation of real-world physics and causality

Training Scale:

~10x more data than GAIA-2
Data spans 9 countries across 3 continents
5x more compute than GAIA-2

Perception-Specific Innovations

1. Unified Perception-Prediction Representation: GAIA-3 "unified perception, prediction, and scene understanding around a single world representation, creating a feedback loop where improvements in one system directly informed the other." This means the perception capabilities of the world model and the driving model are co-optimized.

2. Safety-Critical Perception Validation: GAIA-3 introduces LiDAR-based validation of generated scenes. Real LiDAR point clouds are overlaid on generated frames to verify that "spatial structure and realism" are preserved during counterfactual scenario generation. This provides a ground-truth check on the world model's implicit perception.

3. World-on-Rails Perturbations: GAIA-3 can alter the ego-vehicle's trajectory while keeping other scene elements consistent, generating counterfactual collision scenarios. This demonstrates that the model has learned to disentangle ego-motion from scene perception -- a deep perception capability.

4. Embodiment Transfer: GAIA-3 re-renders scenes from new sensor configurations using "only a small, unpaired sample from the target rig." This demonstrates that the model's perception is not tied to a specific camera configuration but has learned a sensor-agnostic scene representation.

5. Synthetic-Test Fidelity: GAIA-3 reduced synthetic-test rejection rates fivefold compared to previous generations, indicating that the model's implicit perception of scene structure is approaching the fidelity needed for reliable safety evaluation.

6. LINGO-2 Perception: Vision-Language-Action {#6-lingo-2-perception}

Architecture Overview

LINGO-2 is the world's first closed-loop vision-language-action model (VLAM) tested on public roads. Its perception capabilities are embedded in a two-module architecture:

Module 1: Wayve Vision Model

Processes camera images from consecutive timestamps into a sequence of visual tokens
The exact backbone architecture is proprietary, but based on Wayve's published work (FIERY, MILE, Rig3R), it likely uses a transformer-based vision encoder that processes multi-camera inputs and lifts them into a unified representation

Module 2: Auto-regressive Language Model

Receives visual tokens from the vision model, plus conditioning variables (route, current speed, speed limit)
Trained to jointly predict: (a) a driving trajectory, and (b) commentary text
Bidirectional: language can be both input (instructions) and output (explanations)

How LINGO-2 Perceives Driving Scenes

LINGO-2's perception is demonstrated through its language capabilities. The model can:

Describe what it sees: "There is a cyclist ahead on the left side of the road" -- demonstrating object detection and localization
Explain its decisions: "I am slowing down because the traffic light ahead is red" -- demonstrating traffic state perception
Respond to instructions: "Pull over on the left" -- demonstrating spatial understanding of road structure
Ground language in vision: referential segmentation capabilities link language descriptions to specific image regions

SimLingo and CarLLaVA: Research Extensions

Wayve published two additional vision-language driving models that provide insight into their perception architecture:

CarLLaVA (CARLA Challenge 2024 Winner):

Uses LLaVA VLM with LLaMA backbone
Images split into two halves, independently encoded, concatenated, downsampled, and projected into the LLM
Label-free approach: no BEV, depth, or semantic segmentation labels required
Leverages vision encoder pre-trained on internet-scale vision-language data
Won 1st place in CARLA Autonomous Driving Challenge 2.0 sensor track (458% improvement over prior state-of-the-art)

SimLingo (CVPR 2025 Spotlight):

Vision encoder: InternViT-300M-448px (from InternVL2-1B)
Images split into N tiles of 448x448 pixels, each encoded independently
Pixel unshuffle technique downsamples tokens by 4x (each tile = 256 visual tokens)
LLM backbone: Qwen2-0.5B-Instruct, fine-tuned with LoRA (alpha=64, r=32, dropout=0.1)
Disentangled waypoint representation:
- Temporal speed waypoints: coordinates every 0.25 seconds (for speed control)
- Geometric path waypoints: coordinates every meter (for lateral control)
- This disentanglement yielded 39.9% increase in driving score
Action Dreaming: novel technique generating synthetic instruction-action pairs using a kinematic bicycle model and world-on-rails assumption
Training: 14 epochs on 8x A100 80GB GPUs, 24 hours, 3.1M samples at 4fps

7. Camera-Centric Architecture {#7-camera-centric-architecture}

Camera Configuration

Wayve's core R&D system uses 6 monocular cameras providing 360-degree surround view. The configuration varies by platform:

Platform	Camera Count	Configuration
Core R&D fleet	6 monocular cameras	360-degree surround view
Nissan ProPILOT prototype	11 cameras	Extended coverage with redundancy
Gen 3 L4 robotaxi	Multi-camera surround	Full redundancy for driverless operation
OEM consumer vehicles (2027+)	Flexible (camera-first)	OEM-determined based on vehicle architecture

Why Camera-First

Wayve began with a camera-only sensor suite because:

Information density: Cameras capture color, texture, semantic, and geometric information that LiDAR cannot. A camera image contains orders of magnitude more information per pixel than a LiDAR point.
Cost efficiency: Cameras cost orders of magnitude less than LiDAR sensors, critical for mass-market consumer vehicle deployment.
AI-friendly: Modern vision transformers (ViTs) have demonstrated that 3D understanding can be extracted from 2D images with sufficient training data and model capacity. Kendall's PhD thesis was built on this principle.
Scalability: Every car already has cameras. Adding more cameras is straightforward and cheap; every car cannot economically support LiDAR.
Rapid prototyping: Starting camera-only was "the fastest way to prototype our AV2.0 approach" (Wayve blog).

Vision Backbone Evolution

Wayve's vision backbone has evolved significantly across their research publications:

Early Work (2018-2019):

Small CNNs (4 convolutional layers + 3 fully connected layers, ~10K parameters for the "Learning to Drive in a Day" RL agent)
SegNet-based encoder-decoder for semantic segmentation

MILE Era (2022):

CNN-based image encoder
Depth probability distribution over predefined bins, using camera intrinsics and extrinsics
3D feature voxels projected to BEV via sum-pooling on a predefined grid

FIERY Era (2021):

Multi-camera surround input
Per-pixel depth probability distribution for 3D lifting
Spatial Transformer module for ego-motion compensation in BEV
3D convolutional temporal model

Current Foundation Model:

Transformer-based vision backbone (likely ViT-Large or similar, based on Rig3R's use of ViT-Large)
Multi-camera image features extracted and lifted into 3D
Self-attention mechanisms for spatial and temporal reasoning
"Tens of millions of parameters" in the deployed driving model

Feature Extraction Pipeline

Based on Wayve's published papers, the feature extraction pipeline follows this general flow:

Multi-Camera Images (6 views)
        |
        v
Vision Backbone (per-camera feature extraction)
        |
        v
Depth Probability Distribution (per-pixel depth prediction)
        |
        v
3D Feature Lifting (using depth + camera intrinsics/extrinsics)
        |
        v
BEV Projection (sum-pooling of 3D features onto ground plane grid)
        |
        v
Spatial-Temporal Reasoning (transformer attention / 3D convolutions)
        |
        v
Latent State (compressed 1D vector encoding world state)
        |
        v
Motion Planning Head --> Trajectory
        |
        v
Auxiliary Decoders --> Depth, Semantics, Flow (for interpretability)

BEV Representation

The Bird's-Eye View (BEV) representation is central to Wayve's perception stack. Key details from published papers:

MILE: 3D feature voxels converted to BEV through sum-pooling on a predefined grid. The observation decoder and BEV decoder use StyleGAN-like architecture: prediction starts as a learned constant tensor, progressively upsampled with latent state injected via adaptive instance normalization.
FIERY: BEV spans a 100m x 100m area around the vehicle. Features from surround cameras are lifted to 3D using predicted depth distributions, projected to BEV, and registered to the present reference frame using past ego-motion via a Spatial Transformer module.
OFT (Orthographic Feature Transform): Wayve's 2019 paper (Roddick, Kendall, Cipolla) proposed mapping image features to an orthographic BEV representation without explicit depth estimation. The key insight: "as much reasoning as possible should be performed in this orthographic space rather than directly on the pixel-based image domain. Under this orthographic birds-eye-view representation, scale is homogeneous; appearance is largely viewpoint-independent; and distances between objects are meaningful."

8. Sensor Fusion {#8-sensor-fusion}

Camera-Radar Fusion

Wayve introduced radar to complement cameras starting with their second-generation autonomous driving system. Their fusion approach is fundamentally different from traditional hand-engineered sensor fusion:

Traditional Sensor Fusion (AV1.0):

Manually designed algorithms align LiDAR point clouds with camera images
Hand-coded rules determine which sensor to trust in different conditions
Fixed fusion pipelines with explicit geometric calibration
Failure modes are addressed individually with engineering patches

Wayve's Learned Fusion (AV2.0):

The end-to-end neural network learns to fuse camera and radar data automatically
"Our end-to-end neural network is not constrained by a hand-engineered scene representation. Instead, it learns a representation that best enables our system to leverage the complementary strengths of disparate sensing modalities." (Wayve blog)
Transformer architectures "are very capable of aligning representations between camera and radar data modalities"
The model autonomously learns optimal integration strategies without manual engineering

What Radar Provides That Cameras Cannot

Capability	Camera	Radar	Fusion Benefit
Illumination independence	Dependent on ambient light	Active illumination via RF waves	Robust day/night operation
Direct velocity measurement	Requires multi-frame optical flow estimation	Per-frame Doppler velocity measurement	Precise speed detection of other agents
Weather resilience	Degraded by rain, fog, snow, glare	Different weather phenomenology; complementary strength in inclement weather	Robust all-weather perception
Failure mode correlation	Affected by lens obstruction, sun glare, dynamic range limits	Different hardware failure risks	Uncorrelated failure modes enhance safety
Range measurement	Inferred from learned depth estimation	Direct distance measurement per frame	Complementary depth sources

LiDAR Integration

LiDAR is optional in Wayve's architecture. The core AI system does not require it, but the architecture is sensor-agnostic and can ingest LiDAR data when available:

Used in some development vehicles for ground-truth validation and enhanced perception during R&D
Nissan ProPILOT prototype includes 1 next-gen LiDAR sensor alongside 11 cameras and 5 radars
LiDAR data is used for GAIA-3 validation: real LiDAR point clouds are overlaid on generated scenes to verify spatial consistency
Not required for deployment; positioned as an optional add-on for OEMs who want additional redundancy

Sensor Configuration by Deployment Tier

Tier	Sensors	Use Case
Minimum viable	6 cameras	R&D prototyping, L2+ consumer
Standard consumer	Cameras + automotive radar	L2+/L3 consumer deployment (2027+)
Advanced OEM	11 cameras + 5 radars + 1 LiDAR	Nissan ProPILOT integration
L4 robotaxi	Multi-camera surround + radar + optional LiDAR	Driverless operation with full redundancy

9. Depth Estimation {#9-depth-estimation}

Alex Kendall's Foundational Depth Work

Depth estimation is arguably the single perception task most central to Alex Kendall's academic career, and this expertise is deeply embedded in Wayve's technology.

GC-Net: End-to-End Learning of Geometry and Context for Deep Stereo Regression (ICCV 2017)

Kendall's GC-Net (Geometry and Context Network) was a seminal contribution to stereo depth estimation:

Proposed a novel architecture for regressing disparity from rectified stereo images
Used knowledge of the problem's geometry to form a cost volume using deep feature representations
Applied 3D convolutions over the cost volume to incorporate contextual information
Introduced a differentiable soft argmin operation to regress sub-pixel disparity values
This replaced traditional hand-crafted matching cost functions with learned features while maintaining geometric structure

PoseNet: Camera Relocalization (ICCV 2015)

Kendall's PoseNet was the first CNN to regress full 6-DOF camera pose from a single RGB image end-to-end. This work established the principle that deep learning could directly estimate geometric quantities from images without traditional geometric computation pipelines.

Geometric Loss Functions for Camera Pose Regression (CVPR 2017)

Kendall and Cipolla explored novel loss functions based on geometry and scene reprojection error, showing how to automatically learn optimal weighting to simultaneously regress position and orientation. This work underpins Wayve's approach to learning geometric representations.

Wayve's Self-Supervised Depth Estimation

In the deployed system, depth estimation is self-supervised -- it is learned from geometric consistency across views and over time, without requiring ground-truth depth labels from LiDAR or other sources:

Multi-view geometric consistency: With known camera intrinsics and extrinsics, the model learns depth by ensuring that features from different cameras are consistent when projected into 3D space.
Temporal photometric consistency: Using consecutive frames and estimated ego-motion, the model learns depth by ensuring that a pixel in frame t, when warped to frame t+1 using the predicted depth and ego-motion, produces a photometrically consistent image.
Depth probability distributions: Rather than predicting a single depth value per pixel, Wayve's models (MILE, FIERY) predict a probability distribution over depth bins. This captures depth uncertainty and enables soft 3D lifting of features, avoiding hard depth decisions that could propagate errors.

Depth in the Perception Stack

Depth serves multiple roles in Wayve's stack:

Role	How Depth Is Used
3D feature lifting	Predicted depth distributions are used with camera intrinsics/extrinsics to project 2D image features into 3D voxel space for BEV construction (MILE, FIERY)
Auxiliary training signal	Self-supervised depth loss provides geometric inductive bias that accelerates learning of spatial understanding
PRISM-1 reconstruction	Depth estimation is one of the geometric inductive biases used in 4D scene reconstruction
Interpretability	Decoded depth maps allow engineers to verify the model's geometric understanding
Rig3R	Dense per-pixel 3D point prediction with confidence scores provides depth as part of geometric foundation model

Depth Without LiDAR

Wayve's ability to estimate accurate depth from cameras alone -- without LiDAR ground truth during training -- is a direct product of Kendall's PhD research. The key insight from his thesis: "end-to-end deep learning architectures for core computer vision problems including...stereo vision" can be trained with geometric self-supervision, leveraging "underlying geometry of problems such as epipolar geometry for unsupervised learning."

This is why Wayve can operate without LiDAR while competitors like Waymo rely on LiDAR for both perception and training data generation. Wayve's depth estimation is a learned capability, not dependent on an expensive depth sensor.

10. Semantic Understanding {#10-semantic-understanding}

Semantic Segmentation: From SegNet to End-to-End

Wayve's semantic understanding capabilities trace directly to Alex Kendall's work on SegNet:

SegNet (PAMI 2017, Badrinarayanan, Kendall, Cipolla):

A deep fully convolutional encoder-decoder architecture for pixel-wise semantic segmentation
Encoder topologically identical to VGG16's 13 convolutional layers
Decoder uses pooling indices from the encoder's max-pooling layers for non-linear upsampling
Designed for memory efficiency and real-time inference -- critical for on-vehicle deployment
Originally demonstrated on road scene segmentation into 11 classes for autonomous driving

Bayesian SegNet (BMVC 2017, Kendall, Badrinarayanan, Cipolla):

Extended SegNet with uncertainty estimation via Monte Carlo dropout
Produced per-pixel uncertainty maps alongside semantic predictions
Demonstrated 2-3% segmentation improvement from uncertainty modeling
Established the principle that perception outputs should come with confidence estimates

Semantic Segmentation in Current System

In Wayve's current end-to-end driving model, semantic segmentation exists as an auxiliary output decoded from the model's latent representation:

Training signal type: Supervised (requires labeled data)
Purpose: Provides semantic inductive bias during training; enables interpretability monitoring
Not used in decision pipeline: The driving model does not consume explicit semantic labels; instead, semantic understanding is implicit in the learned representation

Scene Understanding Hierarchy

Wayve's model demonstrates understanding across multiple semantic levels:

Pixel-level semantics: Road surface, sidewalk, vegetation, sky, buildings, vehicles, pedestrians, traffic signs, traffic lights
Object-level understanding: Individual road users with implicit tracking (no explicit object detection module)
Scene-level context: Type of road (urban, suburban, highway), intersection type, road complexity
Behavioral semantics: Aggressive vs. cautious drivers, pedestrian intent, cyclist behavior
Cultural semantics: Driving norms, right-of-way conventions, regional traffic patterns (learned from multi-country data)

Evolution from Explicit to Implicit Semantics

Wayve's blog post on driving computer vision (2018) described their perception system predicting "the semantic class of each pixel and the spatial layout of the scene" at 25 Hz on an NVIDIA Drive PX2. By 2022, the emphasis shifted from explicit semantic outputs to emergent representations:

"Deep-convolutional network architectures have replaced human-defined approaches, such as edge detection techniques used for lane detection." (Wayve blog)

The transition from explicit semantic segmentation (SegNet era) to implicit semantic understanding (foundation model era) reflects Wayve's core thesis: learned representations should be optimized for driving, not for human interpretability.

Traffic Light and Sign Perception

Traffic light state detection remains one of the few perception tasks that uses supervised learning with explicit labels in Wayve's system. This is likely because:

Traffic light states have direct safety implications
The binary/categorical nature of traffic light states makes them easy to label
Explicit traffic light detection provides a verifiable safety check

11. Uncertainty Estimation {#11-uncertainty-estimation}

Alex Kendall's Pioneering Contributions

Uncertainty estimation in deep learning-based perception is one of Alex Kendall's most significant scientific contributions, with two foundational papers:

"What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?" (NeurIPS 2017, Spotlight Oral)

This paper, co-authored with Yarin Gal, established the framework for understanding uncertainty in vision systems:

Aleatoric uncertainty: Captures noise inherent in the observations (e.g., sensor noise, ambiguous scene elements). This cannot be reduced with more data. It is further divided into:
- Homoscedastic aleatoric uncertainty: constant for all inputs (task-dependent noise)
- Heteroscedastic aleatoric uncertainty: varies per input (data-dependent noise)
Epistemic uncertainty: Captures uncertainty in the model parameters -- uncertainty that can be reduced with more training data. This is modeled through Bayesian inference over network weights.

The paper demonstrated that combining both types of uncertainty improved performance on semantic segmentation and depth regression tasks.

"Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding" (BMVC 2017)

This paper applied Bayesian deep learning to semantic segmentation:

Used Monte Carlo dropout (MC-Dropout) at test time to approximate Bayesian inference
Multiple forward passes with different dropout masks produce a distribution of predictions
The variance of these predictions estimates model (epistemic) uncertainty
Demonstrated 2-3% segmentation improvement by leveraging uncertainty
Showed that uncertainty maps correlate with actual prediction errors -- high uncertainty regions are indeed where the model makes mistakes

"Concrete Dropout" (NeurIPS 2017)

Co-authored with Yarin Gal and Jiri Hron, this paper proposed:

A continuous relaxation of dropout's discrete masks, enabling the dropout probability itself to be learned through backpropagation
Principled optimization of dropout rates in large models (grid-search over dropout probabilities is prohibitive)
In RL settings, allows agents to adapt uncertainty estimates dynamically as more data is observed

"Concrete Problems for Autonomous Vehicle Safety: Advantages of Bayesian Deep Learning" (IJCAI 2017)

Co-authored with McAllister, Gal, van der Wilk, Shah, Cipolla, and Weller, this paper articulated:

Three core challenges for AV safety: safety, interpretability, and compliance
That Bayesian deep learning addresses all three by quantifying uncertainty
That propagating uncertainty through the AV pipeline allows the system to "avert disaster" by recognizing when its perception is unreliable
That uncertainty estimation enables the system to know when it doesn't know -- a critical safety capability

How Uncertainty Is Used in Wayve's System

While Wayve does not publish full details of their production system's uncertainty handling, their published research and technical philosophy indicate several applications:

1. Perception Confidence:

Depth estimates come with uncertainty (probability distributions over depth bins, not point estimates)
Semantic predictions can be accompanied by per-pixel confidence (via Bayesian/MC-Dropout approaches)
The model can signal when it is uncertain about its perception, triggering safety behaviors

2. Multi-Task Loss Weighting:

Homoscedastic uncertainty is used to automatically weight the multiple loss functions in multi-task training (CVPR 2018 paper)
This prevents noisy tasks from dominating the gradient and allows the model to learn optimal task balancing

3. Anomaly Detection:

High epistemic uncertainty signals out-of-distribution inputs -- scenarios the model has not seen in training
This is critical for the "long tail" problem: the model can recognize when it is in unfamiliar territory

4. Safety Monitoring:

Auxiliary perception outputs (decoded from latent states) can be compared against expected values
Disagreement between auxiliary outputs and the model's behavior flags potential issues

5. Active Learning:

Uncertainty estimates identify the most informative scenarios from fleet data for retraining
High-uncertainty scenarios are prioritized in the training curriculum

6. Rig3R Confidence Scores:

Rig3R predicts dense per-pixel 3D points with confidence scores
These confidence-weighted predictions enable the model to express uncertainty about its 3D reconstruction

Uncertainty Propagation

One of Kendall's key insights (from "Concrete Problems for Autonomous Vehicle Safety") is that uncertainty must be propagated through the entire system, not just estimated at the perception layer. In Wayve's end-to-end architecture, this happens naturally: because there are no hard module boundaries, uncertainty in early perception features flows continuously through the network to influence the motion planning output. The model can produce more conservative trajectories when its internal representation is uncertain.

12. PRISM-1 Scene Reconstruction {#12-prism-1-scene-reconstruction}

Overview

PRISM-1 (Photorealistic Reconstruction In Static and dynamic scenes) is Wayve's scene reconstruction model for creating photorealistic 4D simulations (3D space + time) from camera-only driving data. While primarily a simulation tool, it has deep perception implications.

Core Representation: 3D Gaussian Splatting

PRISM-1 is built on 3D Gaussian Splatting as its primary scene representation (evidenced by characteristic Gaussian artifacts visible in outputs). Each Gaussian primitive encodes:

3D position (mean)
3D covariance (shape/orientation)
Color/appearance (view-dependent)
Opacity

The scene is represented as a collection of these Gaussians, which are rasterized via differentiable splatting for rendering.

Perception Components

PRISM-1 achieves generalization through both geometric and semantic inductive biases:

Geometric Inductive Biases:

Depth estimation: Provides geometric structure for placing Gaussians in 3D
Surface normals: Constrains the orientation of Gaussian primitives
Optical flow: Provides motion supervision for dynamic elements

Semantic Inductive Biases:

Semantic segmentation: Helps disentangle static background from dynamic foreground
Foundation vision model features: Leverages representations from large pre-trained vision models (e.g., DINO/DINOv2) for semantic understanding

Self-Supervised Scene Disentanglement

PRISM-1 separates static and dynamic scene elements in a self-supervised manner:

No explicit labels, scene graphs, or bounding boxes required
The model learns to distinguish static background (buildings, road) from dynamic foreground (vehicles, pedestrians, cyclists)
Handles complex dynamic elements: cyclists, pedestrians, brake lights, opening car doors, road debris
Maintains geometric consistency through implicit scene flow inference

Camera-Only Operation

PRISM-1 operates on camera-only inputs:

No LiDAR required for reconstruction
Generalizes across arbitrary camera rigs without additional sensors
Reconstructs scenes from partially seen or unseen viewpoints via novel view synthesis
Uses only image-level 2D self-supervision without explicit 3D labels

4D Reconstruction Capabilities

Capability	Description
Novel view synthesis	Render scenes from arbitrary camera viewpoints not in original data
Freeze time	Camera pans around the scene while time is frozen
Freeze position	Ego-vehicle stationary, observe temporal motion
Dynamic reconstruction	Cyclists, pedestrians, deformable objects reconstructed
Perception outputs	Depth maps, 3D velocity magnitude from reconstruction
Temporal consistency	Implicit scene flow maintains coherent 4D representation

Relationship to Perception and Ghost Gym

PRISM-1 serves as the reconstruction backbone for Ghost Gym, Wayve's closed-loop neural simulator. The perception connection is bidirectional:

Perception feeds reconstruction: PRISM-1 uses perception outputs (depth, normals, flow, semantics) as geometric and semantic priors for reconstruction
Reconstruction feeds perception training: Ghost Gym generates photorealistic re-simulations of real driving scenarios with modified ego trajectories, providing diverse training data for the perception-driving model
Perception validation: reconstructed scenes can be used to verify that the perception model produces consistent outputs under novel viewpoints and conditions

WayveScenes101 Benchmark

Alongside PRISM-1, Wayve released the WayveScenes101 dataset for benchmarking novel view synthesis in driving:

101 driving scenes from the UK and US
20 seconds per scene, 10 FPS per camera, 5 synchronized cameras
101,000 camera images with poses from COLMAP
Urban, suburban, and highway environments
Various weather and lighting conditions
Evaluation protocol includes held-out camera for off-axis reconstruction quality
Metrics: PSNR, SSIM, LPIPS, FID
Open-source code and data at github.com/wayveai/wayve_scenes

13. Temporal Processing {#13-temporal-processing}

The Critical Role of Time in Driving Perception

Driving is fundamentally a temporal task. A single frame provides a snapshot; understanding driving requires reasoning over time -- predicting where other agents are going, recognizing traffic light transitions, understanding road geometry from parallax, and planning actions that are temporally consistent.

Temporal Processing Across Wayve's Models

FIERY: 3D Convolutional Temporal Model (ICCV 2021)

FIERY processes temporal information through a dedicated 3D convolutional module:

Multiple past frames are lifted to BEV independently
BEV features are registered to the present frame using known ego-motion (via Spatial Transformer)
A 3D convolutional temporal model learns spatio-temporal state from the registered BEV sequence
Future states are predicted via conditional variational inference, with present and future distributions
Produces multimodal future trajectories (multiple plausible futures)

MILE: Recurrent Temporal Dynamics (NeurIPS 2022)

MILE models temporal dynamics with a recurrent neural network (RNN):

Observations are encoded into a compressed BEV latent vector
An RNN predicts the next latent state from the previous state and action
This enables "imagining" diverse and plausible futures
StyleGAN-like decoders reconstruct BEV segmentation from predicted latent states

Probabilistic Future Prediction (ECCV 2020, Hu, Cotter, Mohan, Gurau, Kendall)

This earlier Wayve paper introduced a five-module architecture for temporal perception:

Perception module: learns representation from RGB video with spatio-temporal convolutional module
Dynamics module: models how the world evolves over time
Present/Future Distributions: conditional variational approach for stochastic future prediction
Future Prediction module: decodes predicted future to semantic segmentation, depth, and optical flow
Control module: drives from the learned temporal representation

The paper was the first to jointly predict ego-motion, static scene, and dynamic agent motion in a probabilistic manner.

Video Instance Segmentation: Spatio-Temporal Embedding (2019, Hu, Kendall et al.)

This Wayve paper proposed:

A spatio-temporal embedding loss for temporally consistent video instance segmentation
A 3D causal convolutional network for modeling motion (entirely causal -- no future frame information)
Integration of appearance, motion, and geometry cues (including a monocular self-supervised depth loss)
In the embedding space, video-pixels of the same instance cluster together while being separated from others, enabling natural tracking without complex post-processing
Real-time operation with causal architecture

GAIA-1: Autoregressive Temporal Prediction

GAIA-1 processes temporal information through autoregressive next-token prediction:

All encoders (video, text, action) are temporally aligned to ensure coherent timeline
The transformer predicts future tokens conditioned on all past tokens
Temporal consistency is enforced by the video diffusion decoder, which operates on frame sequences

GAIA-2: Space-Time Factorized Attention

GAIA-2 separates spatial and temporal processing explicitly:

Spatial attention: operates within each frame, attending across space and camera views
Temporal attention: operates across frames, learning temporal dynamics
This factorization is more efficient than full space-time attention while maintaining temporal coherence
The video tokenizer encoder processes 8 frames per temporal latent (temporal compression 8x)
The decoder jointly decodes 3 temporal latents to 24 frames for smooth temporal transitions

LINGO-2: Sequential Token Processing

LINGO-2 processes temporal information through:

The vision model processes camera images from consecutive timestamps into a sequence of tokens
The auto-regressive language model processes these temporal visual tokens alongside conditioning variables
This enables the model to reason about temporal context when predicting driving actions

Foundation Model Temporal Processing

Wayve's current foundation driving model uses transformer-based temporal attention:

Self-attention mechanisms attend across time steps, learning temporal dependencies
The model processes multiple past frames to build a temporal understanding of scene dynamics
Temporal reasoning is not a separate module -- it is integrated into the unified model

14. Geographic Generalization {#14-geographic-generalization}

The Generalization Challenge

Traditional AV systems require per-city HD map creation, rule tuning, and extensive testing before deployment in a new geography. Wayve has been tested in 500+ cities across Europe, North America, and Japan without city-specific fine-tuning.

How Wayve's Perception Generalizes

1. Foundation Model with Universal Backbone: The model is trained on a "universal backbone" -- a foundation model trained on petabyte-scale datasets that "encodes rich, transferable driving behaviors." This backbone learns features that transfer across geographies rather than overfitting to specific cities.

2. Data Diversity Strategy:

Training data from multiple countries (UK, US, Germany, Canada, Japan)
Includes lower-fidelity driving videos from diverse sources
GAIA-3 trained on data spanning 9 countries across 3 continents

3. Cross-Geographic Data Network Effect: Wayve demonstrated that adding geographically diverse data improves performance everywhere. Training on UK and US data together resulted in 3x performance improvement in the UK compared to adding the same volume of UK-only data.

4. Rapid Adaptation:

US deployment: 500 hours of incremental data over 8 weeks achieved UK-equivalent performance
Only 100 hours of data showed "strong improvements" in behavioral competencies
Germany: 3x better zero-shot performance than initial US deployment (benefiting from UK+US training)

Perception Challenges Across Geographies

Challenge	How Wayve Handles It
Driving side (left vs. right)	Learned from data; model adapts to mirror geometry
Traffic sign differences	Learned from visual appearance; no pre-programmed sign database
Intersection rules	Learned from observed behavior (4-way stops, roundabouts, unprotected turns)
Road marking styles	Learned from visual features; no explicit lane line detector
Cultural driving norms	Learned implicitly from driving data ("too nuanced to program manually")
Vehicle platform differences	100 hours of vehicle-specific training for sensor configuration adaptation

Zero-Shot vs. Few-Shot Generalization

Wayve distinguishes between:

Zero-shot generalization: deploying in a new country with no local training data (demonstrated with 3x improvement in Germany due to UK+US training)
Few-shot adaptation: rapid adaptation with a small amount of local data (500 hours for full US equivalence)

This is fundamentally different from the AV1.0 approach, which requires complete per-city re-engineering.

15. No-HD-Map Perception {#15-no-hd-map-perception}

How Wayve Perceives Road Structure Without Prior Mapping

Traditional AV systems rely on pre-built HD maps that encode:

Precise lane geometry (centerlines, boundaries, widths)
Traffic sign and signal locations
Speed limits
Intersection topology
Crosswalk locations
Road surface markings

Wayve's system replaces all of this with learned perception from raw sensor data, augmented only by standard satellite navigation (turn-by-turn directions).

What the Model Perceives in Real-Time

Road Geometry:

The model learns road edges, lane boundaries, and road curvature from visual features and driving behavior
No explicit lane line detector -- road structure is part of the emergent representation
The BEV representation encodes drivable area implicitly

Intersection Topology:

Intersection type, topology, and right-of-way rules are learned from observed driving behavior
The model handles: roundabouts, 4-way stops, T-junctions, unprotected turns, signalized intersections
These capabilities are "not mapped out or explicitly specified" -- they emerge from training data

Traffic Infrastructure:

Traffic light states are detected with supervised learning (one of the few explicitly labeled tasks)
Traffic signs are perceived through the vision backbone's learned features
Speed limits, pedestrian crossings, and road configuration are understood implicitly

Dynamic Road Changes:

Construction zones, temporary road markings, and detours are handled through visual perception
No need to update maps when road conditions change -- the model perceives the current state

The Sat-Nav Interface

The only map-like input to Wayve's system is standard satellite navigation providing turn-by-turn directions. This serves as a high-level routing signal (turn left at next junction, go straight, take the second exit at the roundabout), not as a geometric reference. The model must perceive the road structure in real-time to execute these high-level commands.

Advantages of Map-Free Perception

Instant deployability: no per-city mapping required
Robustness to change: road construction, temporary changes, and map errors do not affect the system
Lower cost: no mapping fleet, no map maintenance infrastructure
Better generalization: the model learns to perceive road structure from any location, not just pre-mapped areas

The Philosophical Connection

Wayve's map-free approach is consistent with their end-to-end philosophy: HD maps are, in essence, a hand-crafted perception cache -- pre-computed perception stored in a database. By replacing this cache with real-time learned perception, Wayve eliminates a major source of rigidity and fragility in the AV stack.

16. Self-Supervised Learning {#16-self-supervised-learning}

The Core Principle

The majority of Wayve's training is self-supervised, meaning models learn from raw, unlabeled driving data without requiring expensive per-frame human annotations. This is a critical competitive advantage: while competitors must pay for millions of labeled frames (bounding boxes, segmentations, lane markings), Wayve's data scales with fleet miles driven, not annotation budget.

Self-Supervised Perception Objectives

1. Depth Estimation (Geometric Consistency)

The model learns depth by ensuring that features from different cameras (spatial multi-view) and consecutive frames (temporal) are geometrically consistent when projected into 3D
Uses camera intrinsics and extrinsics as supervision signals
No ground-truth depth labels (from LiDAR or stereo) are required
Leverages epipolar geometry as an unsupervised learning signal (a direct application of Kendall's PhD thesis findings)

2. Optical Flow (Frame-to-Frame Correspondence)

The model learns pixel-level motion (optical flow) by predicting how pixels move between consecutive frames
This provides a learning signal for understanding dynamic scene elements
Flow prediction is self-supervised: the model must predict frame-to-frame correspondences without labels

3. Ego-Motion Estimation

The model learns to estimate its own motion from odometry signals (wheel encoders, IMU)
This provides a self-supervised signal for understanding the relationship between ego-motion and visual change

4. Future Prediction

The model learns to predict future frames, latent states, or driving scenarios from current observations
This is inherently self-supervised: the future is the "label" for the current observation
GAIA-1/2/3 take this to the extreme, learning to generate entire future driving videos

5. Contrastive Learning and Unsupervised Object Discovery

Wayve has mentioned using "unsupervised object discovery" and "contrastive learning" to reduce manual segmentation labeling requirements
Contrastive methods learn discriminative features by pulling together representations of similar scenes/objects and pushing apart dissimilar ones

6. Image Reconstruction

Video tokenizers (in GAIA models) are trained with self-supervised reconstruction losses (L1, L2, LPIPS perceptual loss)
DINO feature distillation provides additional self-supervised semantic learning
PRISM-1 uses image-level 2D self-supervision without explicit 3D labels

Self-Supervised vs. Supervised Components

Component	Self-Supervised	Supervised	Notes
Depth estimation	Yes (geometric consistency)	No	Core self-supervised output
Surface normals	Yes	No	Derived from depth/geometry
Optical flow	Yes (frame correspondence)	No	Motion understanding
Future prediction	Yes (next frame/state)	No	World modeling
Ego-motion	Yes (odometry signals)	No	Pose estimation
Semantic segmentation	Partially (contrastive learning)	Partially (labeled data)	Hybrid approach
Traffic light detection	No	Yes (explicit labels)	Safety-critical; requires labels
Driving behavior	Yes (imitation of expert data)	No	Self-supervised from recorded driving

The "Revolution Will Not Be Supervised"

Wayve titled one of their key blog posts "The revolution will not be supervised," emphasizing that:

Self-supervised learning eliminates the annotation bottleneck
Human-defined labels impose an information ceiling (you can only learn what you label)
Self-supervised representations can capture nuances that defy categorization
Scaling self-supervised learning with data follows similar power laws to large language models

17. Foundation Model Architecture {#17-foundation-model-architecture}

Transformer-Based Core

Wayve's foundation driving model is transformer-based, using self-attention mechanisms for both spatial and temporal reasoning. While exact architectural details of the production driving model are proprietary, the published research papers provide detailed architectural insight:

Published Architectures

FIERY (ICCV 2021):

CNN-based image encoder (per-camera)
Lift-splat-shoot style 3D lifting with depth probability distributions
Spatial Transformer for BEV registration
3D convolutional temporal model
Probabilistic future prediction with conditional variational inference

MILE (NeurIPS 2022):

CNN-based image encoder
BEV projection via depth bins + sum-pooling
RNN-based temporal dynamics in latent space
StyleGAN-like decoders for observation/BEV reconstruction
1D latent vector encoding world state

Rig3R (NeurIPS 2025 Spotlight):

ViT-Large encoder for image encoding
ViT-Large decoder for multi-view fusion
Patch tokens with 2D sine-cosine positional embeddings
Three prediction heads: pointmap, pose raymap, rig raymap
Joint attention across all images and timesteps
Single forward pass for dense 3D reconstruction, camera pose estimation, and rig calibration

SimLingo (CVPR 2025 Spotlight):

InternViT-300M-448px vision encoder
Qwen2-0.5B-Instruct LLM backbone
Tile-based image encoding (448x448 tiles)
LoRA fine-tuning for efficient adaptation
Disentangled waypoint prediction (temporal speed + geometric path)

GAIA-2 (March 2025):

Video tokenizer: asymmetric space-time factorized transformer (85M encoder / 200M decoder)
World model: 8.4B parameter space-time factorized transformer
22 transformer blocks, hidden dim 4096, 32 attention heads
Flow matching training objective
Cross-attention conditioning with adaptive layer norm

Attention Mechanisms

Wayve's models use several forms of attention:

Attention Type	Where Used	Purpose
Spatial self-attention	Within each frame (GAIA-2, Rig3R)	Understanding spatial relationships between objects
Temporal self-attention	Across frames (GAIA-2)	Understanding temporal dynamics
Cross-camera attention	Across camera views (GAIA-2, Rig3R)	Multi-view geometric consistency
Cross-attention	Conditioning injection (GAIA-2)	Integrating action, metadata, agent information
Adaptive layer norm	Action injection (GAIA-2)	Efficient conditioning for continuous signals
Joint multi-view attention	Rig3R decoder	Fusing spatial, temporal, and geometric cues

How Perception Is Embedded in the Foundation Model

In the foundation driving model, perception is not a separate module but a capability distributed across the network's layers:

Early layers extract low-level visual features (edges, textures, colors) from raw camera images
Middle layers develop increasingly abstract representations (object-like features, spatial structure, depth cues)
Late layers integrate spatial-temporal context for scene-level understanding
The motion planning head reads the final representation to produce a trajectory

The key architectural insight is that no layer boundary corresponds to a "perception/planning boundary." Features at every level contribute to both understanding the scene and deciding how to drive.

Compute Requirements for Inference

Platform	Compute Hardware	Performance
Early R&D (2018)	NVIDIA Drive PX2	Real-time at 25 Hz
Current R&D fleet	NVIDIA GPU compute units	Real-time multi-camera processing
Gen 3 L4 platform	NVIDIA DRIVE AGX Thor (Blackwell, 2000 FP4 TFLOPS)	Full L3/L4 inference
Consumer production	Qualcomm Snapdragon Ride SoC	Energy-efficient on-device AI inference

18. Training Pipeline {#18-training-pipeline}

Infrastructure

Wayve's training infrastructure is built on Microsoft Azure, with NVIDIA GPU hardware:

Component	Specification
Cloud platform	Microsoft Azure (partnership since 2020)
Training GPUs (GAIA-1 era)	64x NVIDIA A100 (world model) + 32x NVIDIA A100 (decoder)
Training GPUs (GAIA-2 era)	128x NVIDIA H100 (tokenizer) + 256x NVIDIA H100 (world model)
Storage	Azure Blob Storage: Archive tier (raw fleet data) + Hot tier (curated training datasets)
Orchestration	Apache Airflow for workflow orchestration
Data processing	Apache Spark / Hadoop for distributed processing
GPU provisioning	Mix of reserved instances (base load) + spot/pre-emptible instances (burst)
Network	Up to 400 Gbps theoretical throughput for distributed training

Data Pipeline

Fleet Vehicles (UK, US, Germany, Canada, Japan)
        |
        v
Raw Data Upload (Azure Blob Storage - Archive)
        |
        v
Data Processing (Apache Spark / Hadoop)
        |
        v
Active Learning Selection (identify most informative scenarios)
        |
        v
Curated Training Dataset (Azure Blob Storage - Hot)
        |
        v
Training (Azure GPU clusters, PyTorch)
        |
        v
Validation (Ghost Gym + GAIA simulation)
        |
        v
Model Deployment (to fleet vehicles via OTA)
        |
        v
Fleet Learning (real-world performance data flows back)

Training Data Scale

Data Source	Scale
GAIA-1 training data	4,700 hours of London driving data
Total proprietary corpus	Thousands of hours (significantly larger than GAIA-1 subset)
GAIA-2 training data	Multi-country data with 25M+ sequences
GAIA-3 training data	10x GAIA-2 scale, spanning 9 countries across 3 continents
Fleet testing coverage	500+ cities across Europe, North America, and Japan
Synthetic data	Generated by GAIA models for augmentation
Language data	Expert drivers providing spoken commentary while driving

Training Methodology

Driving Model Training:

Multi-task learning with uncertainty-weighted losses
Imitation learning from expert driving data
Self-supervised objectives (depth, flow, future prediction)
Active learning to prioritize challenging scenarios
Continuous iteration: models trained centrally, deployed to fleet, performance data flows back

World Model Training (GAIA):

GAIA-1: Autoregressive next-token prediction, 15 days on 64x A100
GAIA-2: Flow matching with L2 velocity prediction loss, 460K steps on 256x H100
GAIA-3: 5x more compute than GAIA-2, 10x more data

Perception-Specific Training:

Supervised perception tasks (semantics, traffic lights) use labeled subsets of fleet data
Self-supervised perception tasks (depth, flow, normals) use the full unlabeled corpus
Foundation vision model features (DINO/DINOv2) provide pre-trained semantic representations
Tokenizer training uses reconstruction losses (L1, L2, LPIPS) with GAN fine-tuning

MLOps and Deployment

Wayve implements MLOps workflows for continuous model development:

"Convergent and predictably rewarding training cycles" through active learning
Continuous validation in simulation before real-world deployment
Automated evaluation loops
Customer model customization per OEM
Operational readiness certification

19. Calibration {#19-calibration}

Rig3R: Learned Camera Calibration

Wayve's most significant contribution to camera calibration is Rig3R (NeurIPS 2025, Spotlight), a geometric foundation model that jointly performs 3D reconstruction, camera pose estimation, and rig calibration in a single forward pass.

Architecture

Image Encoder:

ViT-Large processes each input image into patch tokens with 2D sine-cosine positional embeddings
Multiple camera views processed simultaneously

Metadata Integration:

Accepts optional metadata tuples: camera ID, timestamp, rig calibration (as raymaps)
Raymaps: per-pixel rays encoding rig-relative camera poses
During training, metadata fields are randomly dropped to encourage robustness when information is unavailable
Discrete metadata uses 1D sine-cosine embeddings; raymaps undergo linear projection

Multi-View Decoder:

A second ViT-Large jointly attends across all images and timesteps
Fuses "spatial, temporal, and geometric cues within a shared latent space"

Three Prediction Heads

Pointmap Head: Dense per-pixel 3D points with confidence scores (depth estimation)
Pose Raymap Head: Per-pixel ray directions and global camera centers (camera pose estimation)
Rig Raymap Head: Per-pixel rays and rig-frame camera centers (camera calibration)

Key Innovation: Rig Awareness

Rig3R is the first learned method to explicitly leverage rig constraints when available:

When calibration is known, it uses rig constraints to enhance 3D reconstruction accuracy
When calibration is unknown or incomplete, it infers rig structure and calibration from image content
Seamlessly handles everything from unstructured images to synchronized rigs of varying configurations

Performance

Outperforms traditional and learned methods by 17-45% mAA on real-world driving benchmarks
Evaluated on Waymo Open validation set (LiDAR ground truth) and WayveScenes101 (COLMAP ground truth)
On unseen rig configurations, incorporating rig constraints substantially improves accuracy
Robust under challenging conditions: day-night transitions, motion blur, rain, snow, glare, low-texture scenes

Practical Impact

Rig3R enables Wayve to deploy across multiple hardware configurations without bespoke calibration or brittle geometry pipelines. This is critical for their OEM licensing model, where different automakers use different camera configurations. Rather than requiring precise factory calibration for each vehicle variant, Rig3R can infer or refine calibration on-the-fly.

Camera Parameter Integration in GAIA-2

GAIA-2 also demonstrates learned calibration handling:

Camera intrinsics (focal lengths, principal points), extrinsics (pose), and distortion parameters are each embedded via separate learnable linear projections
These are summed into a unified camera encoding added to spatial tokens
This allows the world model to generate geometrically correct multi-view content from arbitrary camera configurations

20. Key Publications {#20-key-publications}

Alex Kendall's Foundational Perception Papers

Paper	Venue/Year	Key Perception Contribution
PoseNet: A Convolutional Network for Real-Time 6-DOF Camera Relocalization	ICCV 2015	First CNN for end-to-end camera pose regression from single RGB image
SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image Segmentation	PAMI 2017	Efficient encoder-decoder for real-time semantic segmentation; pooling index upsampling
Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures	BMVC 2017	MC-Dropout uncertainty estimation for segmentation; 2-3% improvement from uncertainty
Modelling Uncertainty in Deep Learning for Camera Relocalization	ICRA 2016	Bayesian uncertainty for pose estimation
What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?	NeurIPS 2017 (Spotlight)	Distinguished aleatoric vs. epistemic uncertainty; framework for vision uncertainty
Concrete Dropout	NeurIPS 2017	Learnable dropout rates for principled uncertainty estimation in large models
Geometric Loss Functions for Camera Pose Regression with Deep Learning	CVPR 2017 (Spotlight)	Geometry-based loss functions; automatic position/orientation weighting
End-to-End Learning of Geometry and Context for Deep Stereo Regression (GC-Net)	ICCV 2017 (Spotlight)	3D cost volume with learned features for stereo depth; differentiable soft argmin
Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics	CVPR 2018 (Spotlight)	Homoscedastic uncertainty for multi-task loss weighting; joint depth + segmentation
Concrete Problems for Autonomous Vehicle Safety: Advantages of Bayesian Deep Learning	IJCAI 2017	Framework for AV safety through uncertainty quantification
PhD Thesis: Geometry and Uncertainty in Deep Learning for Computer Vision	Cambridge 2017	Unified framework; 2018 BMVA Prize, 2019 ELLIS Prize

Wayve Perception Research Publications

Paper	Venue/Year	Key Perception Contribution
Learning to Drive in a Day	ICRA 2019	First deep RL for autonomous driving; 10K-parameter CNN
Learning to Drive from Simulation without Real World Labels	ICRA 2019	Sim-to-real transfer via unsupervised domain adaptation for perception
Orthographic Feature Transform for Monocular 3D Object Detection	BMVC 2019 (Oral)	BEV feature projection without explicit depth; viewpoint-independent representations
Learning a Spatio-Temporal Embedding for Video Instance Segmentation	2019	Temporal perception via spatio-temporal embeddings; causal 3D convolutions
Urban Driving with Conditional Imitation Learning	ICRA 2020	Conditional imitation learning for urban driving perception-to-action
Probabilistic Future Prediction for Video Scene Understanding	ECCV 2020	Joint ego-motion, static scene, dynamic agent prediction; 5-module architecture
FIERY: Future Instance Prediction in Bird's-Eye View from Surround Monocular Cameras	ICCV 2021 (Oral)	BEV perception from surround cameras; probabilistic future instance prediction
Reimagining an Autonomous Vehicle	arXiv 2021	Manifesto for E2E driving; auxiliary self-supervised perception outputs
Model-Based Imitation Learning for Urban Driving (MILE)	NeurIPS 2022	BEV perception + world model + driving policy; StyleGAN decoders
Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous Driving	ICRA 2024	Object-level vector input to LLM for driving; 160K QA pairs
GAIA-1: A Generative World Model for Autonomous Driving	arXiv 2023	9.1B parameter world model; perception-through-generation
LingoQA: Visual Question Answering for Autonomous Driving	ECCV 2024	419K QA pairs; Lingo-Judge evaluation metric
LINGO-2: Driving with Natural Language	2024	First closed-loop VLA model on public roads; vision-to-language-to-action
CarLLaVA: Vision Language Models for Camera-Only Closed-Loop Driving	2024	VLM for driving; CARLA Challenge 2024 winner (458% improvement)
WayveScenes101: A Dataset and Benchmark for Novel View Synthesis	2024	101-scene driving benchmark for scene reconstruction
GAIA-2: A Controllable Multi-View Generative World Model	arXiv 2025	8.4B latent diffusion world model; multi-view generation; flow matching
SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action Alignment	CVPR 2025 (Spotlight)	VLM for driving; Action Dreaming; disentangled waypoints
PRISM-1: Photorealistic Reconstruction in Static and Dynamic Scenes	2025	4D scene reconstruction from camera-only; 3D Gaussian Splatting
Rig3R: Rig-Aware Conditioning and Discovery for 3D Reconstruction	NeurIPS 2025 (Spotlight)	Geometric foundation model; learned calibration; 17-45% improvement
GAIA-3: Scaling World Models to Power Safety and Evaluation	2025	15B parameter world model; safety evaluation; embodiment transfer

Open-Source Code Repositories

Repository	URL	Content
FIERY	github.com/wayveai/fiery	BEV future prediction from surround cameras
LingoQA	github.com/wayveai/LingoQA	VQA benchmark for autonomous driving
WayveScenes101	github.com/wayveai/wayve_scenes	Novel view synthesis dataset and benchmark
Driving with LLMs	github.com/wayveai/Driving-with-LLMs	Object-level LLM driving
SimLingo	github.com/RenzKa/simlingo	Vision-only closed-loop driving with language

21. Key Patents {#21-key-patents}

Patent Portfolio Status

Wayve Technologies maintains a patent portfolio covering their core autonomous driving technologies. However, as a UK-headquartered company focused on commercialization rather than patent assertion, their patent strategy appears to emphasize trade secrets and speed of innovation over extensive public patent filings.

Expected Patent Coverage Areas

Based on Wayve's published research, product announcements, and technology disclosures, their patent portfolio likely covers:

End-to-End Driving Architecture: Methods for training a single neural network from sensor inputs to driving trajectory without modular decomposition
Self-Supervised Perception Training: Methods for learning depth, flow, and geometry from unlabeled driving data
World Model Architecture: GAIA-series generative world models for autonomous driving (autoregressive transformer + diffusion, latent diffusion world models)
Vision-Language-Action Models: LINGO-series VLA models that combine driving, language commentary, and instruction-following
4D Scene Reconstruction: PRISM-1 Gaussian Splatting-based reconstruction for driving simulation
Multi-Task Uncertainty Weighting: Automatic loss weighting for joint perception task training (building on Kendall's CVPR 2018 work)
Rig-Aware 3D Reconstruction: Rig3R methods for learned calibration and 3D perception across camera configurations
Neural Simulation: Ghost Gym closed-loop neural simulator architecture
Geographic Generalization: Methods for rapid adaptation of driving models to new countries without HD maps
Sensor Fusion: Learned camera-radar fusion approaches using transformer architectures

Academic IP Foundation

Wayve's intellectual property builds on a foundation of academic research from the University of Cambridge, particularly Alex Kendall's lab. Key IP concepts include:

Bayesian deep learning for perception uncertainty (NeurIPS 2017)
Multi-task learning with homoscedastic uncertainty (CVPR 2018)
End-to-end learned camera pose regression (ICCV 2015, CVPR 2017)
Encoder-decoder architectures for segmentation (PAMI 2017)
Deep stereo regression with geometric cost volumes (ICCV 2017)

These academic contributions are extensively cited (52,000+ Google Scholar citations for Kendall) and form the theoretical basis for much of Wayve's commercial technology.

SLAM Methods

Methods

Wayve Perception Stack: Exhaustive Deep Dive ​

Table of Contents ​

1. End-to-End Philosophy: Why Wayve Rejects Modular Perception {#1-end-to-end-philosophy} ​

The Fundamental Divergence ​

Why Separate Perception Is a Bottleneck ​

Wayve's Alternative: Emergent Representations ​

Comparison to Competitors ​

What Wayve Means by "End-to-End" ​

2. Learned Perception: What Is Learned End-to-End vs. Modular {#2-learned-perception} ​

The Dual Regime: End-to-End Core + Auxiliary Decoders ​

Multi-Task Learning: Uncertainty-Weighted Loss Functions ​

Five Training Objectives ​

What Makes This Different from Modular Multi-Task ​

3. GAIA-1 World Model Perception {#3-gaia-1-world-model-perception} ​

How Perception Is Implicit in World Modeling ​

Architecture ​

Perception Capabilities Demonstrated ​

Training ​

4. GAIA-2 Perception Evolution {#4-gaia-2-perception-evolution} ​

Architectural Leap: Discrete Tokens to Continuous Latent Diffusion ​

Video Tokenizer Architecture ​

Latent World Model (8.4B parameters) ​

Conditioning Mechanisms ​

Multi-View Generation ​

Perception Improvements Over GAIA-1 ​

5. GAIA-3 Perception Evolution {#5-gaia-3-perception-evolution} ​

Scale and Architecture ​

Perception-Specific Innovations ​

6. LINGO-2 Perception: Vision-Language-Action {#6-lingo-2-perception} ​

Wayve Perception Stack: Exhaustive Deep Dive

Table of Contents

1. End-to-End Philosophy: Why Wayve Rejects Modular Perception {#1-end-to-end-philosophy}

The Fundamental Divergence

Why Separate Perception Is a Bottleneck

Wayve's Alternative: Emergent Representations

Comparison to Competitors

What Wayve Means by "End-to-End"

2. Learned Perception: What Is Learned End-to-End vs. Modular {#2-learned-perception}

The Dual Regime: End-to-End Core + Auxiliary Decoders

Multi-Task Learning: Uncertainty-Weighted Loss Functions

Five Training Objectives

What Makes This Different from Modular Multi-Task

3. GAIA-1 World Model Perception {#3-gaia-1-world-model-perception}

How Perception Is Implicit in World Modeling

Architecture

Perception Capabilities Demonstrated

Training

4. GAIA-2 Perception Evolution {#4-gaia-2-perception-evolution}

Architectural Leap: Discrete Tokens to Continuous Latent Diffusion

Video Tokenizer Architecture

Latent World Model (8.4B parameters)

Conditioning Mechanisms

Multi-View Generation

Perception Improvements Over GAIA-1

5. GAIA-3 Perception Evolution {#5-gaia-3-perception-evolution}

Scale and Architecture

Perception-Specific Innovations

6. LINGO-2 Perception: Vision-Language-Action {#6-lingo-2-perception}