Skip to content

Wayve Perception Stack: Exhaustive Deep Dive

Last updated: March 2026


Table of Contents

  1. End-to-End Philosophy: Why Wayve Rejects Modular Perception
  2. Learned Perception: What Is Learned End-to-End vs. Modular
  3. GAIA-1 World Model Perception
  4. GAIA-2 Perception Evolution
  5. GAIA-3 Perception Evolution
  6. LINGO-2 Perception: Vision-Language-Action
  7. Camera-Centric Architecture
  8. Sensor Fusion
  9. Depth Estimation
  10. Semantic Understanding
  11. Uncertainty Estimation
  12. PRISM-1 Scene Reconstruction
  13. Temporal Processing
  14. Geographic Generalization
  15. No-HD-Map Perception
  16. Self-Supervised Learning
  17. Foundation Model Architecture
  18. Training Pipeline
  19. Calibration
  20. Key Publications
  21. Key Patents

1. End-to-End Philosophy: Why Wayve Rejects Modular Perception {#1-end-to-end-philosophy}

The Fundamental Divergence

Wayve's perception philosophy represents a radical departure from the modular autonomous driving paradigm. Where companies like Waymo, Aurora, and Cruise decompose driving into a sequential pipeline of perception, prediction, and planning -- each a separate, hand-engineered module with explicit interfaces -- Wayve replaces the entire stack with a single neural network trained end-to-end.

This is not simply an architectural choice. It is a philosophical position rooted in Alex Kendall's academic research: that human-defined intermediate representations impose information bottlenecks that fundamentally limit system performance.

Why Separate Perception Is a Bottleneck

In a traditional AV1.0 stack, the perception module must compress the full richness of sensory data into a fixed, human-defined vocabulary: bounding boxes, lane lines, traffic light states, semantic labels. This compression introduces three critical failure modes:

  1. Information loss at module boundaries. The perception module must decide what is "relevant" before passing data downstream. Objects or scene features that do not fit predefined categories are discarded, even if they are safety-critical (e.g., an unusual road obstacle, a construction barrier arrangement never seen in the label taxonomy).

  2. Error propagation. Errors in perception cascade downstream through prediction and planning. A missed detection in the perception module cannot be recovered later. As Kendall et al. demonstrated in "Concrete Problems for Autonomous Vehicle Safety" (IJCAI 2017), propagating uncertainty through the pipeline is essential, but modular systems struggle to do this coherently.

  3. Loss of high-dimensional context. Bounding boxes and semantic labels are low-dimensional summaries of rich visual information. The texture of a road surface, the body language of a pedestrian, the subtle signs of a car about to change lanes -- all of this is lost when perception is compressed into predefined categories.

Wayve's Alternative: Emergent Representations

Instead of human-defined intermediate representations, Wayve generates what they call "emergent AI representations" -- abstract, high-dimensional feature vectors that are learned end-to-end and optimized directly for the driving task. These representations are not interpretable in human terms (they are not "bounding boxes" or "lane lines"), but they maximize the information available for producing safe driving actions.

As Wayve describes it: rather than using human-defined concepts, the system generates "abstract representations of the environment generated by AI through mathematical transformations that are optimally informative for maximizing the learning objective."

This approach is philosophically aligned with the observation in large language models that scaling model size and data diversity yields emergent capabilities that were never explicitly programmed. Wayve bets that a single, sufficiently large neural network trained on diverse driving data will develop internal representations that are richer than anything a human engineer could design.

Comparison to Competitors

CompanyPerception ApproachExplicit Perception OutputsEnd-to-End Training
WayveUnified end-to-end model; perception implicit in latent spaceAuxiliary outputs decoded from latent states for interpretability onlyFull end-to-end from sensors to trajectory
WaymoHistorically modular (dedicated LiDAR/camera fusion, detection, tracking); converging toward E2E elementsExplicit 3D bounding boxes, tracks, semantic labels as module outputsIncreasingly end-to-end, but retains module boundaries
AuroraFully modular with HD maps; Aurora Driver maintains clear module boundariesExplicit perception outputs from FirstLight LiDAR sensor fusionNot end-to-end
TeslaEvolved from modular to end-to-end with FSD v12+; philosophically closest to WayveHistorically explicit (bounding boxes in occupancy network); now more implicitFSD v12+ uses end-to-end for planning; vision backbone still somewhat modular
MobileyeModular with RSS safety framework; crowdsourced mapsExplicit perception outputs; SuperVision uses camera-first approachNot end-to-end

What Wayve Means by "End-to-End"

It is important to be precise about what Wayve means by end-to-end. The system is not a single monolithic function from pixel values to steering angle. Rather, it is a differentiable neural network with structured internal components:

  • A vision backbone that processes multi-camera images
  • Spatial-temporal reasoning modules
  • A motion planning head that outputs a trajectory

What makes it end-to-end is that:

  1. All components are trained jointly to optimize driving performance
  2. No hand-coded interfaces or fixed representations exist between components
  3. The internal representation is free to learn whatever features are most useful for driving
  4. Gradients flow from the driving loss all the way back through the entire model

The auxiliary outputs (depth, semantics, flow) are decoded from intermediate latent states as additional training signals and for interpretability, but they are not used in the decision pipeline itself.


2. Learned Perception: What Is Learned End-to-End vs. Modular {#2-learned-perception}

The Dual Regime: End-to-End Core + Auxiliary Decoders

Wayve's perception is organized into two regimes:

Regime 1: End-to-End Learned (Core Pipeline)

The following perception tasks are learned implicitly within the end-to-end driving model. They are not separate modules -- they are capabilities that emerge from the latent representation optimized for driving:

  • 3D scene understanding -- the spatial layout of the driving environment
  • Dynamic object recognition and tracking -- understanding of other road users and their behavior
  • Road structure understanding -- lanes, intersections, road edges, without HD maps
  • Traffic state comprehension -- traffic lights, signs, right-of-way
  • Predictive understanding -- anticipation of how the scene will evolve
  • Ego-motion understanding -- where the vehicle is and how it is moving

These are not decoded or evaluated as separate outputs. They exist as capabilities embedded in the model's latent state, evidenced by the model's ability to drive safely through complex scenarios.

Regime 2: Auxiliary Decoders (Interpretability and Training Signals)

The following perception outputs are decoded from the model's intermediate latent states. They serve two purposes: (a) providing additional training signals that accelerate learning (multi-task learning), and (b) enabling human interpretability and safety monitoring.

Auxiliary OutputTraining Signal TypePurpose
Semantic segmentationSupervised (labeled data)Interpretability; inductive bias for scene understanding
Traffic light state detectionSupervised (labeled data)Safety-critical output; explicit verification
Depth estimationSelf-supervised (geometric consistency)Geometric understanding; 3D scene structure
Surface normalsSelf-supervisedGeometric reasoning about road and obstacle surfaces
Optical flow / motion estimationSelf-supervised (frame-to-frame correspondence)Dynamic scene understanding
Future predictionSelf-supervisedAnticipatory driving; world model capabilities

Multi-Task Learning: Uncertainty-Weighted Loss Functions

The training of these auxiliary tasks alongside the primary driving task uses a multi-task learning framework directly inspired by Alex Kendall's seminal CVPR 2018 paper, "Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and Semantics" (Kendall, Gal, Cipolla).

In this framework, each task's loss function is weighted by a learned homoscedastic uncertainty parameter. Rather than manually tuning the relative weight of depth loss vs. segmentation loss vs. driving loss, the model learns the optimal weighting automatically. Tasks with higher inherent noise (aleatoric uncertainty) receive lower weight, preventing noisy signals from dominating the gradient.

This principled multi-task approach was demonstrated to learn per-pixel depth regression, semantic segmentation, and instance segmentation from a monocular input simultaneously -- precisely the combination needed for autonomous driving perception.

Five Training Objectives

Wayve's "Reimagining an Autonomous Vehicle" paper (Hawke, E, Badrinarayanan, Kendall, 2021) and associated blog posts describe five distinct training objectives combined in the multi-task framework:

  1. Imitation learning -- learning to mimic expert human driving behavior from recorded data
  2. Reinforcement learning -- learning from safety driver interventions and corrective actions
  3. Safety driver corrections -- learning from post-intervention corrective driving
  4. Dynamics modeling and future prediction -- learning to predict future states from off-policy data
  5. Computer vision representations -- learning semantic, geometric, and motion representations via supervised and self-supervised signals

What Makes This Different from Modular Multi-Task

In a modular system, multi-task learning might train a shared backbone to produce multiple outputs (detection, segmentation, depth), but those outputs are then consumed by downstream modules with fixed interfaces. In Wayve's system, the multi-task outputs are auxiliary -- the primary output is the driving trajectory, and the internal representation is free to develop features that do not correspond to any of the auxiliary tasks if they are useful for driving.


3. GAIA-1 World Model Perception {#3-gaia-1-world-model-perception}

How Perception Is Implicit in World Modeling

GAIA-1 (Generative AI for Autonomy, 9.1 billion parameters) is Wayve's first-generation generative world model. While it is not a perception model per se -- it is a generative model that produces video -- it demonstrates a deep form of implicit perception. To generate realistic driving videos, the model must have learned to perceive and understand:

  • 3D scene geometry -- objects maintain correct size, perspective, and occlusion relationships
  • Dynamic behavior -- vehicles accelerate, brake, and turn realistically
  • Temporal consistency -- scenes evolve coherently over time
  • Environmental conditions -- weather, lighting, road surfaces are rendered accurately
  • Causal relationships -- the generated future depends correctly on the conditioned ego-actions

This "perception through generation" is a key insight: a model that can accurately generate the future of a driving scene must have internalized a representation of that scene's structure, dynamics, and semantics.

Architecture

GAIA-1 consists of two components:

Component 1: World Model (6.5B parameters)

An autoregressive transformer that predicts the next set of image tokens in a sequence, conditioned on three modalities:

  • Video encoder (0.3B parameters): Discretizes each video frame using a VQ-VAE (Vector Quantized Variational Autoencoder). Each frame (resized to 9:16 aspect ratio) is encoded into 576 discrete tokens drawn from a learned codebook. The VQ approach converts continuous pixel data into a sequence of discrete symbols, enabling the transformer to treat video generation as a sequence prediction problem analogous to language modeling.

  • Text encoder: Discretizes and embeds natural language descriptions of driving scenarios into the shared representation space.

  • Action encoder: Projects scalar action values (steering angle, throttle, brake) into the shared representation space via learned projections.

All three modalities are projected into a shared representation space and temporally aligned. The autoregressive transformer then predicts future image tokens conditioned on this multimodal context.

Component 2: Video Diffusion Decoder (2.6B parameters)

A denoising video diffusion model that translates the predicted discrete image tokens back into pixel space. Critically, this operates on sequences of frames (not individual frames) to ensure temporal consistency. The diffusion process models frame sequences jointly, preventing temporal discontinuities that would arise from independent per-frame generation.

Perception Capabilities Demonstrated

CapabilityEvidence
3D geometry understandingGenerated scenes maintain correct perspective, depth ordering, and object scaling
Occlusion reasoningObjects correctly appear and disappear as they move behind other objects
Dynamic agent modelingOther vehicles and pedestrians move with physically plausible dynamics
Environmental understandingAccurate rendering of weather, lighting, road conditions from text prompts
Action-conditioned predictionEgo-vehicle behavior correctly responds to input action tokens
Scene compositionComplex multi-agent urban scenes generated with correct spatial relationships

Training

SpecificationValue
Total parameters~9.1B (6.5B world model + 2.6B decoder)
World model training15 days on 64x NVIDIA A100 GPUs
Video decoder training15 days on 32x NVIDIA A100 GPUs
Training data4,700 hours of proprietary London driving data (2019--2023)
Video encoder parameters0.3B
Tokens per frame576 discrete tokens
Aspect ratio9:16

4. GAIA-2 Perception Evolution {#4-gaia-2-perception-evolution}

Architectural Leap: Discrete Tokens to Continuous Latent Diffusion

GAIA-2 represents a fundamental architectural shift from GAIA-1. Where GAIA-1 used discrete VQ tokens and autoregressive prediction, GAIA-2 moves to a continuous latent space with a latent diffusion model. This has significant implications for perception quality.

Video Tokenizer Architecture

The video tokenizer is a space-time factorized transformer with an asymmetric encoder-decoder design:

Encoder (85M parameters):

  • Input: raw video frames
  • Two downsampling convolutional blocks:
    • First block: stride 2x8x8, embedding dimension 512
    • Second block: stride 2x2x2, embedding dimension 512
  • 24 spatial transformer blocks (512 dimensions, 16 attention heads)
  • Final convolution (stride 1x2x2) projecting to 2L channels for Gaussian distribution parameters (mean and standard deviation)
  • Total compression: 384x (32x spatial, 8x temporal, latent dimension L=64)
  • Encoder maps 8 frames to a single temporal latent independently (no temporal attention in encoder)

Decoder (200M parameters):

  • Linear projection from latent dimension to 512
  • First upsampling block (stride 1x2x2)
  • 16 space-time factorized transformer blocks (with both spatial and temporal attention)
  • Second upsampling (stride 2x2x2) + 8 additional transformer blocks
  • Final upsampling (stride 2x8x8) to 3 RGB channels
  • Key asymmetry: decoder jointly decodes 3 temporal latents to 24 frames, using temporal context for consistency
  • Rolling inference: for long sequences, overlapping strides generate new frames conditioned on previously generated frames in a sliding window fashion

Training: 300,000 steps, batch size 128, on 128 NVIDIA H100 GPUs. Losses include L1/L2 pixel reconstruction, LPIPS perceptual loss, DINO feature distillation, and KL divergence, with GAN fine-tuning for visual quality.

Latent World Model (8.4B parameters)

The world model is a space-time factorized transformer trained via flow matching (a more stable alternative to standard diffusion):

  • 22 transformer blocks with hidden dimension C=4096 and 32 attention heads
  • Each block contains:
    • Spatial attention (across space and camera views)
    • Temporal attention layer
    • Cross-attention layer (for conditioning)
    • MLP with adaptive layer norm
    • Query-key normalization before attention
  • Flow matching objective: predicts velocity targets v_{t+1:T} = x_{t+1:T} - epsilon_{t+1:T}
  • Training: 460,000 steps, batch size 256, on 256 NVIDIA H100 GPUs
  • Uses bimodal logit-normal time distribution for training schedule

Conditioning Mechanisms

GAIA-2 implements sophisticated conditioning that demonstrates deep scene understanding:

Conditioning TypeImplementationPerception Implication
Ego-action (speed, curvature)Adaptive layer norm injection (found more accurate than cross-attention)Model understands ego-dynamics
Camera parameters (intrinsics, extrinsics, distortion)Three separate learnable embeddings summed into unified encoding; sinusoidal positional encodingsModel understands multi-view geometry
Environmental (weather, time of day, lighting)Cross-attention conditioningModel perceives and can control environmental conditions
Road configuration (lanes, speed limits, crossings, intersections)Cross-attention conditioningModel understands road structure semantically
Dynamic agents (3D bounding boxes, trajectories)Cross-attention conditioning with class-based IoU evaluationModel perceives and can control other agents
Scenario embeddings (from proprietary driving model)Cross-attention with external latent embeddingsModel leverages driving model's internal representations

Multi-View Generation

GAIA-2 generates up to 5 temporally and spatially synchronized camera views at 448x960 resolution per view. Each camera view is encoded independently, then combined with camera geometry embeddings before transformer processing. This multi-view consistency demonstrates that the model has learned a coherent 3D representation of the scene, not just independent per-camera generation.

Perception Improvements Over GAIA-1

DimensionGAIA-1GAIA-2
Spatial fidelityDiscrete VQ tokens (lossy quantization)Continuous latent space (higher fidelity)
Temporal consistencyPer-frame autoregressive (sequential error accumulation)Joint sequence diffusion (global temporal coherence)
Multi-view coherenceSingle camera viewUp to 5 synchronized views with geometric consistency
Geographic diversityLondon onlyUK, US, Germany
Scene control granularityText + actionFine-grained control over agents, weather, road config
Agent understandingImplicitExplicit 3D bounding box conditioning with class-based metrics

5. GAIA-3 Perception Evolution {#5-gaia-3-perception-evolution}

Scale and Architecture

GAIA-3 (launched December 2, 2025) doubles GAIA-2 to 15 billion parameters and introduces perception-specific advances:

Video Tokenizer (2x GAIA-2 size):

  • Captures safety-critical spatial and temporal structures that GAIA-2's tokenizer missed
  • Enhanced fidelity for: subtle pedestrian motion, fast-moving vehicles, road signs, traffic lights, small objects
  • More faithful representation of real-world physics and causality

Training Scale:

  • ~10x more data than GAIA-2
  • Data spans 9 countries across 3 continents
  • 5x more compute than GAIA-2

Perception-Specific Innovations

1. Unified Perception-Prediction Representation: GAIA-3 "unified perception, prediction, and scene understanding around a single world representation, creating a feedback loop where improvements in one system directly informed the other." This means the perception capabilities of the world model and the driving model are co-optimized.

2. Safety-Critical Perception Validation: GAIA-3 introduces LiDAR-based validation of generated scenes. Real LiDAR point clouds are overlaid on generated frames to verify that "spatial structure and realism" are preserved during counterfactual scenario generation. This provides a ground-truth check on the world model's implicit perception.

3. World-on-Rails Perturbations: GAIA-3 can alter the ego-vehicle's trajectory while keeping other scene elements consistent, generating counterfactual collision scenarios. This demonstrates that the model has learned to disentangle ego-motion from scene perception -- a deep perception capability.

4. Embodiment Transfer: GAIA-3 re-renders scenes from new sensor configurations using "only a small, unpaired sample from the target rig." This demonstrates that the model's perception is not tied to a specific camera configuration but has learned a sensor-agnostic scene representation.

5. Synthetic-Test Fidelity: GAIA-3 reduced synthetic-test rejection rates fivefold compared to previous generations, indicating that the model's implicit perception of scene structure is approaching the fidelity needed for reliable safety evaluation.


6. LINGO-2 Perception: Vision-Language-Action {#6-lingo-2-perception}

Architecture Overview

LINGO-2 is the world's first closed-loop vision-language-action model (VLAM) tested on public roads. Its perception capabilities are embedded in a two-module architecture:

Module 1: Wayve Vision Model

  • Processes camera images from consecutive timestamps into a sequence of visual tokens
  • The exact backbone architecture is proprietary, but based on Wayve's published work (FIERY, MILE, Rig3R), it likely uses a transformer-based vision encoder that processes multi-camera inputs and lifts them into a unified representation

Module 2: Auto-regressive Language Model

  • Receives visual tokens from the vision model, plus conditioning variables (route, current speed, speed limit)
  • Trained to jointly predict: (a) a driving trajectory, and (b) commentary text
  • Bidirectional: language can be both input (instructions) and output (explanations)

How LINGO-2 Perceives Driving Scenes

LINGO-2's perception is demonstrated through its language capabilities. The model can:

  1. Describe what it sees: "There is a cyclist ahead on the left side of the road" -- demonstrating object detection and localization
  2. Explain its decisions: "I am slowing down because the traffic light ahead is red" -- demonstrating traffic state perception
  3. Respond to instructions: "Pull over on the left" -- demonstrating spatial understanding of road structure
  4. Ground language in vision: referential segmentation capabilities link language descriptions to specific image regions

SimLingo and CarLLaVA: Research Extensions

Wayve published two additional vision-language driving models that provide insight into their perception architecture:

CarLLaVA (CARLA Challenge 2024 Winner):

  • Uses LLaVA VLM with LLaMA backbone
  • Images split into two halves, independently encoded, concatenated, downsampled, and projected into the LLM
  • Label-free approach: no BEV, depth, or semantic segmentation labels required
  • Leverages vision encoder pre-trained on internet-scale vision-language data
  • Won 1st place in CARLA Autonomous Driving Challenge 2.0 sensor track (458% improvement over prior state-of-the-art)

SimLingo (CVPR 2025 Spotlight):

  • Vision encoder: InternViT-300M-448px (from InternVL2-1B)
  • Images split into N tiles of 448x448 pixels, each encoded independently
  • Pixel unshuffle technique downsamples tokens by 4x (each tile = 256 visual tokens)
  • LLM backbone: Qwen2-0.5B-Instruct, fine-tuned with LoRA (alpha=64, r=32, dropout=0.1)
  • Disentangled waypoint representation:
    • Temporal speed waypoints: coordinates every 0.25 seconds (for speed control)
    • Geometric path waypoints: coordinates every meter (for lateral control)
    • This disentanglement yielded 39.9% increase in driving score
  • Action Dreaming: novel technique generating synthetic instruction-action pairs using a kinematic bicycle model and world-on-rails assumption
  • Training: 14 epochs on 8x A100 80GB GPUs, 24 hours, 3.1M samples at 4fps

7. Camera-Centric Architecture {#7-camera-centric-architecture}

Camera Configuration

Wayve's core R&D system uses 6 monocular cameras providing 360-degree surround view. The configuration varies by platform:

PlatformCamera CountConfiguration
Core R&D fleet6 monocular cameras360-degree surround view
Nissan ProPILOT prototype11 camerasExtended coverage with redundancy
Gen 3 L4 robotaxiMulti-camera surroundFull redundancy for driverless operation
OEM consumer vehicles (2027+)Flexible (camera-first)OEM-determined based on vehicle architecture

Why Camera-First

Wayve began with a camera-only sensor suite because:

  1. Information density: Cameras capture color, texture, semantic, and geometric information that LiDAR cannot. A camera image contains orders of magnitude more information per pixel than a LiDAR point.

  2. Cost efficiency: Cameras cost orders of magnitude less than LiDAR sensors, critical for mass-market consumer vehicle deployment.

  3. AI-friendly: Modern vision transformers (ViTs) have demonstrated that 3D understanding can be extracted from 2D images with sufficient training data and model capacity. Kendall's PhD thesis was built on this principle.

  4. Scalability: Every car already has cameras. Adding more cameras is straightforward and cheap; every car cannot economically support LiDAR.

  5. Rapid prototyping: Starting camera-only was "the fastest way to prototype our AV2.0 approach" (Wayve blog).

Vision Backbone Evolution

Wayve's vision backbone has evolved significantly across their research publications:

Early Work (2018-2019):

  • Small CNNs (4 convolutional layers + 3 fully connected layers, ~10K parameters for the "Learning to Drive in a Day" RL agent)
  • SegNet-based encoder-decoder for semantic segmentation

MILE Era (2022):

  • CNN-based image encoder
  • Depth probability distribution over predefined bins, using camera intrinsics and extrinsics
  • 3D feature voxels projected to BEV via sum-pooling on a predefined grid

FIERY Era (2021):

  • Multi-camera surround input
  • Per-pixel depth probability distribution for 3D lifting
  • Spatial Transformer module for ego-motion compensation in BEV
  • 3D convolutional temporal model

Current Foundation Model:

  • Transformer-based vision backbone (likely ViT-Large or similar, based on Rig3R's use of ViT-Large)
  • Multi-camera image features extracted and lifted into 3D
  • Self-attention mechanisms for spatial and temporal reasoning
  • "Tens of millions of parameters" in the deployed driving model

Feature Extraction Pipeline

Based on Wayve's published papers, the feature extraction pipeline follows this general flow:

Multi-Camera Images (6 views)
        |
        v
Vision Backbone (per-camera feature extraction)
        |
        v
Depth Probability Distribution (per-pixel depth prediction)
        |
        v
3D Feature Lifting (using depth + camera intrinsics/extrinsics)
        |
        v
BEV Projection (sum-pooling of 3D features onto ground plane grid)
        |
        v
Spatial-Temporal Reasoning (transformer attention / 3D convolutions)
        |
        v
Latent State (compressed 1D vector encoding world state)
        |
        v
Motion Planning Head --> Trajectory
        |
        v
Auxiliary Decoders --> Depth, Semantics, Flow (for interpretability)

BEV Representation

The Bird's-Eye View (BEV) representation is central to Wayve's perception stack. Key details from published papers:

  • MILE: 3D feature voxels converted to BEV through sum-pooling on a predefined grid. The observation decoder and BEV decoder use StyleGAN-like architecture: prediction starts as a learned constant tensor, progressively upsampled with latent state injected via adaptive instance normalization.

  • FIERY: BEV spans a 100m x 100m area around the vehicle. Features from surround cameras are lifted to 3D using predicted depth distributions, projected to BEV, and registered to the present reference frame using past ego-motion via a Spatial Transformer module.

  • OFT (Orthographic Feature Transform): Wayve's 2019 paper (Roddick, Kendall, Cipolla) proposed mapping image features to an orthographic BEV representation without explicit depth estimation. The key insight: "as much reasoning as possible should be performed in this orthographic space rather than directly on the pixel-based image domain. Under this orthographic birds-eye-view representation, scale is homogeneous; appearance is largely viewpoint-independent; and distances between objects are meaningful."


8. Sensor Fusion {#8-sensor-fusion}

Camera-Radar Fusion

Wayve introduced radar to complement cameras starting with their second-generation autonomous driving system. Their fusion approach is fundamentally different from traditional hand-engineered sensor fusion:

Traditional Sensor Fusion (AV1.0):

  • Manually designed algorithms align LiDAR point clouds with camera images
  • Hand-coded rules determine which sensor to trust in different conditions
  • Fixed fusion pipelines with explicit geometric calibration
  • Failure modes are addressed individually with engineering patches

Wayve's Learned Fusion (AV2.0):

  • The end-to-end neural network learns to fuse camera and radar data automatically
  • "Our end-to-end neural network is not constrained by a hand-engineered scene representation. Instead, it learns a representation that best enables our system to leverage the complementary strengths of disparate sensing modalities." (Wayve blog)
  • Transformer architectures "are very capable of aligning representations between camera and radar data modalities"
  • The model autonomously learns optimal integration strategies without manual engineering

What Radar Provides That Cameras Cannot

CapabilityCameraRadarFusion Benefit
Illumination independenceDependent on ambient lightActive illumination via RF wavesRobust day/night operation
Direct velocity measurementRequires multi-frame optical flow estimationPer-frame Doppler velocity measurementPrecise speed detection of other agents
Weather resilienceDegraded by rain, fog, snow, glareDifferent weather phenomenology; complementary strength in inclement weatherRobust all-weather perception
Failure mode correlationAffected by lens obstruction, sun glare, dynamic range limitsDifferent hardware failure risksUncorrelated failure modes enhance safety
Range measurementInferred from learned depth estimationDirect distance measurement per frameComplementary depth sources

LiDAR Integration

LiDAR is optional in Wayve's architecture. The core AI system does not require it, but the architecture is sensor-agnostic and can ingest LiDAR data when available:

  • Used in some development vehicles for ground-truth validation and enhanced perception during R&D
  • Nissan ProPILOT prototype includes 1 next-gen LiDAR sensor alongside 11 cameras and 5 radars
  • LiDAR data is used for GAIA-3 validation: real LiDAR point clouds are overlaid on generated scenes to verify spatial consistency
  • Not required for deployment; positioned as an optional add-on for OEMs who want additional redundancy

Sensor Configuration by Deployment Tier

TierSensorsUse Case
Minimum viable6 camerasR&D prototyping, L2+ consumer
Standard consumerCameras + automotive radarL2+/L3 consumer deployment (2027+)
Advanced OEM11 cameras + 5 radars + 1 LiDARNissan ProPILOT integration
L4 robotaxiMulti-camera surround + radar + optional LiDARDriverless operation with full redundancy

9. Depth Estimation {#9-depth-estimation}

Alex Kendall's Foundational Depth Work

Depth estimation is arguably the single perception task most central to Alex Kendall's academic career, and this expertise is deeply embedded in Wayve's technology.

GC-Net: End-to-End Learning of Geometry and Context for Deep Stereo Regression (ICCV 2017)

Kendall's GC-Net (Geometry and Context Network) was a seminal contribution to stereo depth estimation:

  • Proposed a novel architecture for regressing disparity from rectified stereo images
  • Used knowledge of the problem's geometry to form a cost volume using deep feature representations
  • Applied 3D convolutions over the cost volume to incorporate contextual information
  • Introduced a differentiable soft argmin operation to regress sub-pixel disparity values
  • This replaced traditional hand-crafted matching cost functions with learned features while maintaining geometric structure

PoseNet: Camera Relocalization (ICCV 2015)

Kendall's PoseNet was the first CNN to regress full 6-DOF camera pose from a single RGB image end-to-end. This work established the principle that deep learning could directly estimate geometric quantities from images without traditional geometric computation pipelines.

Geometric Loss Functions for Camera Pose Regression (CVPR 2017)

Kendall and Cipolla explored novel loss functions based on geometry and scene reprojection error, showing how to automatically learn optimal weighting to simultaneously regress position and orientation. This work underpins Wayve's approach to learning geometric representations.

Wayve's Self-Supervised Depth Estimation

In the deployed system, depth estimation is self-supervised -- it is learned from geometric consistency across views and over time, without requiring ground-truth depth labels from LiDAR or other sources:

  1. Multi-view geometric consistency: With known camera intrinsics and extrinsics, the model learns depth by ensuring that features from different cameras are consistent when projected into 3D space.

  2. Temporal photometric consistency: Using consecutive frames and estimated ego-motion, the model learns depth by ensuring that a pixel in frame t, when warped to frame t+1 using the predicted depth and ego-motion, produces a photometrically consistent image.

  3. Depth probability distributions: Rather than predicting a single depth value per pixel, Wayve's models (MILE, FIERY) predict a probability distribution over depth bins. This captures depth uncertainty and enables soft 3D lifting of features, avoiding hard depth decisions that could propagate errors.

Depth in the Perception Stack

Depth serves multiple roles in Wayve's stack:

RoleHow Depth Is Used
3D feature liftingPredicted depth distributions are used with camera intrinsics/extrinsics to project 2D image features into 3D voxel space for BEV construction (MILE, FIERY)
Auxiliary training signalSelf-supervised depth loss provides geometric inductive bias that accelerates learning of spatial understanding
PRISM-1 reconstructionDepth estimation is one of the geometric inductive biases used in 4D scene reconstruction
InterpretabilityDecoded depth maps allow engineers to verify the model's geometric understanding
Rig3RDense per-pixel 3D point prediction with confidence scores provides depth as part of geometric foundation model

Depth Without LiDAR

Wayve's ability to estimate accurate depth from cameras alone -- without LiDAR ground truth during training -- is a direct product of Kendall's PhD research. The key insight from his thesis: "end-to-end deep learning architectures for core computer vision problems including...stereo vision" can be trained with geometric self-supervision, leveraging "underlying geometry of problems such as epipolar geometry for unsupervised learning."

This is why Wayve can operate without LiDAR while competitors like Waymo rely on LiDAR for both perception and training data generation. Wayve's depth estimation is a learned capability, not dependent on an expensive depth sensor.


10. Semantic Understanding {#10-semantic-understanding}

Semantic Segmentation: From SegNet to End-to-End

Wayve's semantic understanding capabilities trace directly to Alex Kendall's work on SegNet:

SegNet (PAMI 2017, Badrinarayanan, Kendall, Cipolla):

  • A deep fully convolutional encoder-decoder architecture for pixel-wise semantic segmentation
  • Encoder topologically identical to VGG16's 13 convolutional layers
  • Decoder uses pooling indices from the encoder's max-pooling layers for non-linear upsampling
  • Designed for memory efficiency and real-time inference -- critical for on-vehicle deployment
  • Originally demonstrated on road scene segmentation into 11 classes for autonomous driving

Bayesian SegNet (BMVC 2017, Kendall, Badrinarayanan, Cipolla):

  • Extended SegNet with uncertainty estimation via Monte Carlo dropout
  • Produced per-pixel uncertainty maps alongside semantic predictions
  • Demonstrated 2-3% segmentation improvement from uncertainty modeling
  • Established the principle that perception outputs should come with confidence estimates

Semantic Segmentation in Current System

In Wayve's current end-to-end driving model, semantic segmentation exists as an auxiliary output decoded from the model's latent representation:

  • Training signal type: Supervised (requires labeled data)
  • Purpose: Provides semantic inductive bias during training; enables interpretability monitoring
  • Not used in decision pipeline: The driving model does not consume explicit semantic labels; instead, semantic understanding is implicit in the learned representation

Scene Understanding Hierarchy

Wayve's model demonstrates understanding across multiple semantic levels:

  1. Pixel-level semantics: Road surface, sidewalk, vegetation, sky, buildings, vehicles, pedestrians, traffic signs, traffic lights
  2. Object-level understanding: Individual road users with implicit tracking (no explicit object detection module)
  3. Scene-level context: Type of road (urban, suburban, highway), intersection type, road complexity
  4. Behavioral semantics: Aggressive vs. cautious drivers, pedestrian intent, cyclist behavior
  5. Cultural semantics: Driving norms, right-of-way conventions, regional traffic patterns (learned from multi-country data)

Evolution from Explicit to Implicit Semantics

Wayve's blog post on driving computer vision (2018) described their perception system predicting "the semantic class of each pixel and the spatial layout of the scene" at 25 Hz on an NVIDIA Drive PX2. By 2022, the emphasis shifted from explicit semantic outputs to emergent representations:

"Deep-convolutional network architectures have replaced human-defined approaches, such as edge detection techniques used for lane detection." (Wayve blog)

The transition from explicit semantic segmentation (SegNet era) to implicit semantic understanding (foundation model era) reflects Wayve's core thesis: learned representations should be optimized for driving, not for human interpretability.

Traffic Light and Sign Perception

Traffic light state detection remains one of the few perception tasks that uses supervised learning with explicit labels in Wayve's system. This is likely because:

  • Traffic light states have direct safety implications
  • The binary/categorical nature of traffic light states makes them easy to label
  • Explicit traffic light detection provides a verifiable safety check

11. Uncertainty Estimation {#11-uncertainty-estimation}

Alex Kendall's Pioneering Contributions

Uncertainty estimation in deep learning-based perception is one of Alex Kendall's most significant scientific contributions, with two foundational papers:

"What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?" (NeurIPS 2017, Spotlight Oral)

This paper, co-authored with Yarin Gal, established the framework for understanding uncertainty in vision systems:

  • Aleatoric uncertainty: Captures noise inherent in the observations (e.g., sensor noise, ambiguous scene elements). This cannot be reduced with more data. It is further divided into:
    • Homoscedastic aleatoric uncertainty: constant for all inputs (task-dependent noise)
    • Heteroscedastic aleatoric uncertainty: varies per input (data-dependent noise)
  • Epistemic uncertainty: Captures uncertainty in the model parameters -- uncertainty that can be reduced with more training data. This is modeled through Bayesian inference over network weights.

The paper demonstrated that combining both types of uncertainty improved performance on semantic segmentation and depth regression tasks.

"Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder Architectures for Scene Understanding" (BMVC 2017)

This paper applied Bayesian deep learning to semantic segmentation:

  • Used Monte Carlo dropout (MC-Dropout) at test time to approximate Bayesian inference
  • Multiple forward passes with different dropout masks produce a distribution of predictions
  • The variance of these predictions estimates model (epistemic) uncertainty
  • Demonstrated 2-3% segmentation improvement by leveraging uncertainty
  • Showed that uncertainty maps correlate with actual prediction errors -- high uncertainty regions are indeed where the model makes mistakes

"Concrete Dropout" (NeurIPS 2017)

Co-authored with Yarin Gal and Jiri Hron, this paper proposed:

  • A continuous relaxation of dropout's discrete masks, enabling the dropout probability itself to be learned through backpropagation
  • Principled optimization of dropout rates in large models (grid-search over dropout probabilities is prohibitive)
  • In RL settings, allows agents to adapt uncertainty estimates dynamically as more data is observed

"Concrete Problems for Autonomous Vehicle Safety: Advantages of Bayesian Deep Learning" (IJCAI 2017)

Co-authored with McAllister, Gal, van der Wilk, Shah, Cipolla, and Weller, this paper articulated:

  • Three core challenges for AV safety: safety, interpretability, and compliance
  • That Bayesian deep learning addresses all three by quantifying uncertainty
  • That propagating uncertainty through the AV pipeline allows the system to "avert disaster" by recognizing when its perception is unreliable
  • That uncertainty estimation enables the system to know when it doesn't know -- a critical safety capability

How Uncertainty Is Used in Wayve's System

While Wayve does not publish full details of their production system's uncertainty handling, their published research and technical philosophy indicate several applications:

1. Perception Confidence:

  • Depth estimates come with uncertainty (probability distributions over depth bins, not point estimates)
  • Semantic predictions can be accompanied by per-pixel confidence (via Bayesian/MC-Dropout approaches)
  • The model can signal when it is uncertain about its perception, triggering safety behaviors

2. Multi-Task Loss Weighting:

  • Homoscedastic uncertainty is used to automatically weight the multiple loss functions in multi-task training (CVPR 2018 paper)
  • This prevents noisy tasks from dominating the gradient and allows the model to learn optimal task balancing

3. Anomaly Detection:

  • High epistemic uncertainty signals out-of-distribution inputs -- scenarios the model has not seen in training
  • This is critical for the "long tail" problem: the model can recognize when it is in unfamiliar territory

4. Safety Monitoring:

  • Auxiliary perception outputs (decoded from latent states) can be compared against expected values
  • Disagreement between auxiliary outputs and the model's behavior flags potential issues

5. Active Learning:

  • Uncertainty estimates identify the most informative scenarios from fleet data for retraining
  • High-uncertainty scenarios are prioritized in the training curriculum

6. Rig3R Confidence Scores:

  • Rig3R predicts dense per-pixel 3D points with confidence scores
  • These confidence-weighted predictions enable the model to express uncertainty about its 3D reconstruction

Uncertainty Propagation

One of Kendall's key insights (from "Concrete Problems for Autonomous Vehicle Safety") is that uncertainty must be propagated through the entire system, not just estimated at the perception layer. In Wayve's end-to-end architecture, this happens naturally: because there are no hard module boundaries, uncertainty in early perception features flows continuously through the network to influence the motion planning output. The model can produce more conservative trajectories when its internal representation is uncertain.


12. PRISM-1 Scene Reconstruction {#12-prism-1-scene-reconstruction}

Overview

PRISM-1 (Photorealistic Reconstruction In Static and dynamic scenes) is Wayve's scene reconstruction model for creating photorealistic 4D simulations (3D space + time) from camera-only driving data. While primarily a simulation tool, it has deep perception implications.

Core Representation: 3D Gaussian Splatting

PRISM-1 is built on 3D Gaussian Splatting as its primary scene representation (evidenced by characteristic Gaussian artifacts visible in outputs). Each Gaussian primitive encodes:

  • 3D position (mean)
  • 3D covariance (shape/orientation)
  • Color/appearance (view-dependent)
  • Opacity

The scene is represented as a collection of these Gaussians, which are rasterized via differentiable splatting for rendering.

Perception Components

PRISM-1 achieves generalization through both geometric and semantic inductive biases:

Geometric Inductive Biases:

  • Depth estimation: Provides geometric structure for placing Gaussians in 3D
  • Surface normals: Constrains the orientation of Gaussian primitives
  • Optical flow: Provides motion supervision for dynamic elements

Semantic Inductive Biases:

  • Semantic segmentation: Helps disentangle static background from dynamic foreground
  • Foundation vision model features: Leverages representations from large pre-trained vision models (e.g., DINO/DINOv2) for semantic understanding

Self-Supervised Scene Disentanglement

PRISM-1 separates static and dynamic scene elements in a self-supervised manner:

  • No explicit labels, scene graphs, or bounding boxes required
  • The model learns to distinguish static background (buildings, road) from dynamic foreground (vehicles, pedestrians, cyclists)
  • Handles complex dynamic elements: cyclists, pedestrians, brake lights, opening car doors, road debris
  • Maintains geometric consistency through implicit scene flow inference

Camera-Only Operation

PRISM-1 operates on camera-only inputs:

  • No LiDAR required for reconstruction
  • Generalizes across arbitrary camera rigs without additional sensors
  • Reconstructs scenes from partially seen or unseen viewpoints via novel view synthesis
  • Uses only image-level 2D self-supervision without explicit 3D labels

4D Reconstruction Capabilities

CapabilityDescription
Novel view synthesisRender scenes from arbitrary camera viewpoints not in original data
Freeze timeCamera pans around the scene while time is frozen
Freeze positionEgo-vehicle stationary, observe temporal motion
Dynamic reconstructionCyclists, pedestrians, deformable objects reconstructed
Perception outputsDepth maps, 3D velocity magnitude from reconstruction
Temporal consistencyImplicit scene flow maintains coherent 4D representation

Relationship to Perception and Ghost Gym

PRISM-1 serves as the reconstruction backbone for Ghost Gym, Wayve's closed-loop neural simulator. The perception connection is bidirectional:

  1. Perception feeds reconstruction: PRISM-1 uses perception outputs (depth, normals, flow, semantics) as geometric and semantic priors for reconstruction
  2. Reconstruction feeds perception training: Ghost Gym generates photorealistic re-simulations of real driving scenarios with modified ego trajectories, providing diverse training data for the perception-driving model
  3. Perception validation: reconstructed scenes can be used to verify that the perception model produces consistent outputs under novel viewpoints and conditions

WayveScenes101 Benchmark

Alongside PRISM-1, Wayve released the WayveScenes101 dataset for benchmarking novel view synthesis in driving:

  • 101 driving scenes from the UK and US
  • 20 seconds per scene, 10 FPS per camera, 5 synchronized cameras
  • 101,000 camera images with poses from COLMAP
  • Urban, suburban, and highway environments
  • Various weather and lighting conditions
  • Evaluation protocol includes held-out camera for off-axis reconstruction quality
  • Metrics: PSNR, SSIM, LPIPS, FID
  • Open-source code and data at github.com/wayveai/wayve_scenes

13. Temporal Processing {#13-temporal-processing}

The Critical Role of Time in Driving Perception

Driving is fundamentally a temporal task. A single frame provides a snapshot; understanding driving requires reasoning over time -- predicting where other agents are going, recognizing traffic light transitions, understanding road geometry from parallax, and planning actions that are temporally consistent.

Temporal Processing Across Wayve's Models

FIERY: 3D Convolutional Temporal Model (ICCV 2021)

FIERY processes temporal information through a dedicated 3D convolutional module:

  • Multiple past frames are lifted to BEV independently
  • BEV features are registered to the present frame using known ego-motion (via Spatial Transformer)
  • A 3D convolutional temporal model learns spatio-temporal state from the registered BEV sequence
  • Future states are predicted via conditional variational inference, with present and future distributions
  • Produces multimodal future trajectories (multiple plausible futures)

MILE: Recurrent Temporal Dynamics (NeurIPS 2022)

MILE models temporal dynamics with a recurrent neural network (RNN):

  • Observations are encoded into a compressed BEV latent vector
  • An RNN predicts the next latent state from the previous state and action
  • This enables "imagining" diverse and plausible futures
  • StyleGAN-like decoders reconstruct BEV segmentation from predicted latent states

Probabilistic Future Prediction (ECCV 2020, Hu, Cotter, Mohan, Gurau, Kendall)

This earlier Wayve paper introduced a five-module architecture for temporal perception:

  • Perception module: learns representation from RGB video with spatio-temporal convolutional module
  • Dynamics module: models how the world evolves over time
  • Present/Future Distributions: conditional variational approach for stochastic future prediction
  • Future Prediction module: decodes predicted future to semantic segmentation, depth, and optical flow
  • Control module: drives from the learned temporal representation

The paper was the first to jointly predict ego-motion, static scene, and dynamic agent motion in a probabilistic manner.

Video Instance Segmentation: Spatio-Temporal Embedding (2019, Hu, Kendall et al.)

This Wayve paper proposed:

  • A spatio-temporal embedding loss for temporally consistent video instance segmentation
  • A 3D causal convolutional network for modeling motion (entirely causal -- no future frame information)
  • Integration of appearance, motion, and geometry cues (including a monocular self-supervised depth loss)
  • In the embedding space, video-pixels of the same instance cluster together while being separated from others, enabling natural tracking without complex post-processing
  • Real-time operation with causal architecture

GAIA-1: Autoregressive Temporal Prediction

GAIA-1 processes temporal information through autoregressive next-token prediction:

  • All encoders (video, text, action) are temporally aligned to ensure coherent timeline
  • The transformer predicts future tokens conditioned on all past tokens
  • Temporal consistency is enforced by the video diffusion decoder, which operates on frame sequences

GAIA-2: Space-Time Factorized Attention

GAIA-2 separates spatial and temporal processing explicitly:

  • Spatial attention: operates within each frame, attending across space and camera views
  • Temporal attention: operates across frames, learning temporal dynamics
  • This factorization is more efficient than full space-time attention while maintaining temporal coherence
  • The video tokenizer encoder processes 8 frames per temporal latent (temporal compression 8x)
  • The decoder jointly decodes 3 temporal latents to 24 frames for smooth temporal transitions

LINGO-2: Sequential Token Processing

LINGO-2 processes temporal information through:

  • The vision model processes camera images from consecutive timestamps into a sequence of tokens
  • The auto-regressive language model processes these temporal visual tokens alongside conditioning variables
  • This enables the model to reason about temporal context when predicting driving actions

Foundation Model Temporal Processing

Wayve's current foundation driving model uses transformer-based temporal attention:

  • Self-attention mechanisms attend across time steps, learning temporal dependencies
  • The model processes multiple past frames to build a temporal understanding of scene dynamics
  • Temporal reasoning is not a separate module -- it is integrated into the unified model

14. Geographic Generalization {#14-geographic-generalization}

The Generalization Challenge

Traditional AV systems require per-city HD map creation, rule tuning, and extensive testing before deployment in a new geography. Wayve has been tested in 500+ cities across Europe, North America, and Japan without city-specific fine-tuning.

How Wayve's Perception Generalizes

1. Foundation Model with Universal Backbone: The model is trained on a "universal backbone" -- a foundation model trained on petabyte-scale datasets that "encodes rich, transferable driving behaviors." This backbone learns features that transfer across geographies rather than overfitting to specific cities.

2. Data Diversity Strategy:

  • Training data from multiple countries (UK, US, Germany, Canada, Japan)
  • Includes lower-fidelity driving videos from diverse sources
  • GAIA-3 trained on data spanning 9 countries across 3 continents

3. Cross-Geographic Data Network Effect: Wayve demonstrated that adding geographically diverse data improves performance everywhere. Training on UK and US data together resulted in 3x performance improvement in the UK compared to adding the same volume of UK-only data.

4. Rapid Adaptation:

  • US deployment: 500 hours of incremental data over 8 weeks achieved UK-equivalent performance
  • Only 100 hours of data showed "strong improvements" in behavioral competencies
  • Germany: 3x better zero-shot performance than initial US deployment (benefiting from UK+US training)

Perception Challenges Across Geographies

ChallengeHow Wayve Handles It
Driving side (left vs. right)Learned from data; model adapts to mirror geometry
Traffic sign differencesLearned from visual appearance; no pre-programmed sign database
Intersection rulesLearned from observed behavior (4-way stops, roundabouts, unprotected turns)
Road marking stylesLearned from visual features; no explicit lane line detector
Cultural driving normsLearned implicitly from driving data ("too nuanced to program manually")
Vehicle platform differences100 hours of vehicle-specific training for sensor configuration adaptation

Zero-Shot vs. Few-Shot Generalization

Wayve distinguishes between:

  • Zero-shot generalization: deploying in a new country with no local training data (demonstrated with 3x improvement in Germany due to UK+US training)
  • Few-shot adaptation: rapid adaptation with a small amount of local data (500 hours for full US equivalence)

This is fundamentally different from the AV1.0 approach, which requires complete per-city re-engineering.


15. No-HD-Map Perception {#15-no-hd-map-perception}

How Wayve Perceives Road Structure Without Prior Mapping

Traditional AV systems rely on pre-built HD maps that encode:

  • Precise lane geometry (centerlines, boundaries, widths)
  • Traffic sign and signal locations
  • Speed limits
  • Intersection topology
  • Crosswalk locations
  • Road surface markings

Wayve's system replaces all of this with learned perception from raw sensor data, augmented only by standard satellite navigation (turn-by-turn directions).

What the Model Perceives in Real-Time

Road Geometry:

  • The model learns road edges, lane boundaries, and road curvature from visual features and driving behavior
  • No explicit lane line detector -- road structure is part of the emergent representation
  • The BEV representation encodes drivable area implicitly

Intersection Topology:

  • Intersection type, topology, and right-of-way rules are learned from observed driving behavior
  • The model handles: roundabouts, 4-way stops, T-junctions, unprotected turns, signalized intersections
  • These capabilities are "not mapped out or explicitly specified" -- they emerge from training data

Traffic Infrastructure:

  • Traffic light states are detected with supervised learning (one of the few explicitly labeled tasks)
  • Traffic signs are perceived through the vision backbone's learned features
  • Speed limits, pedestrian crossings, and road configuration are understood implicitly

Dynamic Road Changes:

  • Construction zones, temporary road markings, and detours are handled through visual perception
  • No need to update maps when road conditions change -- the model perceives the current state

The Sat-Nav Interface

The only map-like input to Wayve's system is standard satellite navigation providing turn-by-turn directions. This serves as a high-level routing signal (turn left at next junction, go straight, take the second exit at the roundabout), not as a geometric reference. The model must perceive the road structure in real-time to execute these high-level commands.

Advantages of Map-Free Perception

  1. Instant deployability: no per-city mapping required
  2. Robustness to change: road construction, temporary changes, and map errors do not affect the system
  3. Lower cost: no mapping fleet, no map maintenance infrastructure
  4. Better generalization: the model learns to perceive road structure from any location, not just pre-mapped areas

The Philosophical Connection

Wayve's map-free approach is consistent with their end-to-end philosophy: HD maps are, in essence, a hand-crafted perception cache -- pre-computed perception stored in a database. By replacing this cache with real-time learned perception, Wayve eliminates a major source of rigidity and fragility in the AV stack.


16. Self-Supervised Learning {#16-self-supervised-learning}

The Core Principle

The majority of Wayve's training is self-supervised, meaning models learn from raw, unlabeled driving data without requiring expensive per-frame human annotations. This is a critical competitive advantage: while competitors must pay for millions of labeled frames (bounding boxes, segmentations, lane markings), Wayve's data scales with fleet miles driven, not annotation budget.

Self-Supervised Perception Objectives

1. Depth Estimation (Geometric Consistency)

  • The model learns depth by ensuring that features from different cameras (spatial multi-view) and consecutive frames (temporal) are geometrically consistent when projected into 3D
  • Uses camera intrinsics and extrinsics as supervision signals
  • No ground-truth depth labels (from LiDAR or stereo) are required
  • Leverages epipolar geometry as an unsupervised learning signal (a direct application of Kendall's PhD thesis findings)

2. Optical Flow (Frame-to-Frame Correspondence)

  • The model learns pixel-level motion (optical flow) by predicting how pixels move between consecutive frames
  • This provides a learning signal for understanding dynamic scene elements
  • Flow prediction is self-supervised: the model must predict frame-to-frame correspondences without labels

3. Ego-Motion Estimation

  • The model learns to estimate its own motion from odometry signals (wheel encoders, IMU)
  • This provides a self-supervised signal for understanding the relationship between ego-motion and visual change

4. Future Prediction

  • The model learns to predict future frames, latent states, or driving scenarios from current observations
  • This is inherently self-supervised: the future is the "label" for the current observation
  • GAIA-1/2/3 take this to the extreme, learning to generate entire future driving videos

5. Contrastive Learning and Unsupervised Object Discovery

  • Wayve has mentioned using "unsupervised object discovery" and "contrastive learning" to reduce manual segmentation labeling requirements
  • Contrastive methods learn discriminative features by pulling together representations of similar scenes/objects and pushing apart dissimilar ones

6. Image Reconstruction

  • Video tokenizers (in GAIA models) are trained with self-supervised reconstruction losses (L1, L2, LPIPS perceptual loss)
  • DINO feature distillation provides additional self-supervised semantic learning
  • PRISM-1 uses image-level 2D self-supervision without explicit 3D labels

Self-Supervised vs. Supervised Components

ComponentSelf-SupervisedSupervisedNotes
Depth estimationYes (geometric consistency)NoCore self-supervised output
Surface normalsYesNoDerived from depth/geometry
Optical flowYes (frame correspondence)NoMotion understanding
Future predictionYes (next frame/state)NoWorld modeling
Ego-motionYes (odometry signals)NoPose estimation
Semantic segmentationPartially (contrastive learning)Partially (labeled data)Hybrid approach
Traffic light detectionNoYes (explicit labels)Safety-critical; requires labels
Driving behaviorYes (imitation of expert data)NoSelf-supervised from recorded driving

The "Revolution Will Not Be Supervised"

Wayve titled one of their key blog posts "The revolution will not be supervised," emphasizing that:

  • Self-supervised learning eliminates the annotation bottleneck
  • Human-defined labels impose an information ceiling (you can only learn what you label)
  • Self-supervised representations can capture nuances that defy categorization
  • Scaling self-supervised learning with data follows similar power laws to large language models

17. Foundation Model Architecture {#17-foundation-model-architecture}

Transformer-Based Core

Wayve's foundation driving model is transformer-based, using self-attention mechanisms for both spatial and temporal reasoning. While exact architectural details of the production driving model are proprietary, the published research papers provide detailed architectural insight:

Published Architectures

FIERY (ICCV 2021):

  • CNN-based image encoder (per-camera)
  • Lift-splat-shoot style 3D lifting with depth probability distributions
  • Spatial Transformer for BEV registration
  • 3D convolutional temporal model
  • Probabilistic future prediction with conditional variational inference

MILE (NeurIPS 2022):

  • CNN-based image encoder
  • BEV projection via depth bins + sum-pooling
  • RNN-based temporal dynamics in latent space
  • StyleGAN-like decoders for observation/BEV reconstruction
  • 1D latent vector encoding world state

Rig3R (NeurIPS 2025 Spotlight):

  • ViT-Large encoder for image encoding
  • ViT-Large decoder for multi-view fusion
  • Patch tokens with 2D sine-cosine positional embeddings
  • Three prediction heads: pointmap, pose raymap, rig raymap
  • Joint attention across all images and timesteps
  • Single forward pass for dense 3D reconstruction, camera pose estimation, and rig calibration

SimLingo (CVPR 2025 Spotlight):

  • InternViT-300M-448px vision encoder
  • Qwen2-0.5B-Instruct LLM backbone
  • Tile-based image encoding (448x448 tiles)
  • LoRA fine-tuning for efficient adaptation
  • Disentangled waypoint prediction (temporal speed + geometric path)

GAIA-2 (March 2025):

  • Video tokenizer: asymmetric space-time factorized transformer (85M encoder / 200M decoder)
  • World model: 8.4B parameter space-time factorized transformer
  • 22 transformer blocks, hidden dim 4096, 32 attention heads
  • Flow matching training objective
  • Cross-attention conditioning with adaptive layer norm

Attention Mechanisms

Wayve's models use several forms of attention:

Attention TypeWhere UsedPurpose
Spatial self-attentionWithin each frame (GAIA-2, Rig3R)Understanding spatial relationships between objects
Temporal self-attentionAcross frames (GAIA-2)Understanding temporal dynamics
Cross-camera attentionAcross camera views (GAIA-2, Rig3R)Multi-view geometric consistency
Cross-attentionConditioning injection (GAIA-2)Integrating action, metadata, agent information
Adaptive layer normAction injection (GAIA-2)Efficient conditioning for continuous signals
Joint multi-view attentionRig3R decoderFusing spatial, temporal, and geometric cues

How Perception Is Embedded in the Foundation Model

In the foundation driving model, perception is not a separate module but a capability distributed across the network's layers:

  • Early layers extract low-level visual features (edges, textures, colors) from raw camera images
  • Middle layers develop increasingly abstract representations (object-like features, spatial structure, depth cues)
  • Late layers integrate spatial-temporal context for scene-level understanding
  • The motion planning head reads the final representation to produce a trajectory

The key architectural insight is that no layer boundary corresponds to a "perception/planning boundary." Features at every level contribute to both understanding the scene and deciding how to drive.

Compute Requirements for Inference

PlatformCompute HardwarePerformance
Early R&D (2018)NVIDIA Drive PX2Real-time at 25 Hz
Current R&D fleetNVIDIA GPU compute unitsReal-time multi-camera processing
Gen 3 L4 platformNVIDIA DRIVE AGX Thor (Blackwell, 2000 FP4 TFLOPS)Full L3/L4 inference
Consumer productionQualcomm Snapdragon Ride SoCEnergy-efficient on-device AI inference

18. Training Pipeline {#18-training-pipeline}

Infrastructure

Wayve's training infrastructure is built on Microsoft Azure, with NVIDIA GPU hardware:

ComponentSpecification
Cloud platformMicrosoft Azure (partnership since 2020)
Training GPUs (GAIA-1 era)64x NVIDIA A100 (world model) + 32x NVIDIA A100 (decoder)
Training GPUs (GAIA-2 era)128x NVIDIA H100 (tokenizer) + 256x NVIDIA H100 (world model)
StorageAzure Blob Storage: Archive tier (raw fleet data) + Hot tier (curated training datasets)
OrchestrationApache Airflow for workflow orchestration
Data processingApache Spark / Hadoop for distributed processing
GPU provisioningMix of reserved instances (base load) + spot/pre-emptible instances (burst)
NetworkUp to 400 Gbps theoretical throughput for distributed training

Data Pipeline

Fleet Vehicles (UK, US, Germany, Canada, Japan)
        |
        v
Raw Data Upload (Azure Blob Storage - Archive)
        |
        v
Data Processing (Apache Spark / Hadoop)
        |
        v
Active Learning Selection (identify most informative scenarios)
        |
        v
Curated Training Dataset (Azure Blob Storage - Hot)
        |
        v
Training (Azure GPU clusters, PyTorch)
        |
        v
Validation (Ghost Gym + GAIA simulation)
        |
        v
Model Deployment (to fleet vehicles via OTA)
        |
        v
Fleet Learning (real-world performance data flows back)

Training Data Scale

Data SourceScale
GAIA-1 training data4,700 hours of London driving data
Total proprietary corpusThousands of hours (significantly larger than GAIA-1 subset)
GAIA-2 training dataMulti-country data with 25M+ sequences
GAIA-3 training data10x GAIA-2 scale, spanning 9 countries across 3 continents
Fleet testing coverage500+ cities across Europe, North America, and Japan
Synthetic dataGenerated by GAIA models for augmentation
Language dataExpert drivers providing spoken commentary while driving

Training Methodology

Driving Model Training:

  1. Multi-task learning with uncertainty-weighted losses
  2. Imitation learning from expert driving data
  3. Self-supervised objectives (depth, flow, future prediction)
  4. Active learning to prioritize challenging scenarios
  5. Continuous iteration: models trained centrally, deployed to fleet, performance data flows back

World Model Training (GAIA):

  • GAIA-1: Autoregressive next-token prediction, 15 days on 64x A100
  • GAIA-2: Flow matching with L2 velocity prediction loss, 460K steps on 256x H100
  • GAIA-3: 5x more compute than GAIA-2, 10x more data

Perception-Specific Training:

  • Supervised perception tasks (semantics, traffic lights) use labeled subsets of fleet data
  • Self-supervised perception tasks (depth, flow, normals) use the full unlabeled corpus
  • Foundation vision model features (DINO/DINOv2) provide pre-trained semantic representations
  • Tokenizer training uses reconstruction losses (L1, L2, LPIPS) with GAN fine-tuning

MLOps and Deployment

Wayve implements MLOps workflows for continuous model development:

  • "Convergent and predictably rewarding training cycles" through active learning
  • Continuous validation in simulation before real-world deployment
  • Automated evaluation loops
  • Customer model customization per OEM
  • Operational readiness certification

19. Calibration {#19-calibration}

Rig3R: Learned Camera Calibration

Wayve's most significant contribution to camera calibration is Rig3R (NeurIPS 2025, Spotlight), a geometric foundation model that jointly performs 3D reconstruction, camera pose estimation, and rig calibration in a single forward pass.

Architecture

Image Encoder:

  • ViT-Large processes each input image into patch tokens with 2D sine-cosine positional embeddings
  • Multiple camera views processed simultaneously

Metadata Integration:

  • Accepts optional metadata tuples: camera ID, timestamp, rig calibration (as raymaps)
  • Raymaps: per-pixel rays encoding rig-relative camera poses
  • During training, metadata fields are randomly dropped to encourage robustness when information is unavailable
  • Discrete metadata uses 1D sine-cosine embeddings; raymaps undergo linear projection

Multi-View Decoder:

  • A second ViT-Large jointly attends across all images and timesteps
  • Fuses "spatial, temporal, and geometric cues within a shared latent space"

Three Prediction Heads

  1. Pointmap Head: Dense per-pixel 3D points with confidence scores (depth estimation)
  2. Pose Raymap Head: Per-pixel ray directions and global camera centers (camera pose estimation)
  3. Rig Raymap Head: Per-pixel rays and rig-frame camera centers (camera calibration)

Key Innovation: Rig Awareness

Rig3R is the first learned method to explicitly leverage rig constraints when available:

  • When calibration is known, it uses rig constraints to enhance 3D reconstruction accuracy
  • When calibration is unknown or incomplete, it infers rig structure and calibration from image content
  • Seamlessly handles everything from unstructured images to synchronized rigs of varying configurations

Performance

  • Outperforms traditional and learned methods by 17-45% mAA on real-world driving benchmarks
  • Evaluated on Waymo Open validation set (LiDAR ground truth) and WayveScenes101 (COLMAP ground truth)
  • On unseen rig configurations, incorporating rig constraints substantially improves accuracy
  • Robust under challenging conditions: day-night transitions, motion blur, rain, snow, glare, low-texture scenes

Practical Impact

Rig3R enables Wayve to deploy across multiple hardware configurations without bespoke calibration or brittle geometry pipelines. This is critical for their OEM licensing model, where different automakers use different camera configurations. Rather than requiring precise factory calibration for each vehicle variant, Rig3R can infer or refine calibration on-the-fly.

Camera Parameter Integration in GAIA-2

GAIA-2 also demonstrates learned calibration handling:

  • Camera intrinsics (focal lengths, principal points), extrinsics (pose), and distortion parameters are each embedded via separate learnable linear projections
  • These are summed into a unified camera encoding added to spatial tokens
  • This allows the world model to generate geometrically correct multi-view content from arbitrary camera configurations

20. Key Publications {#20-key-publications}

Alex Kendall's Foundational Perception Papers

PaperVenue/YearKey Perception Contribution
PoseNet: A Convolutional Network for Real-Time 6-DOF Camera RelocalizationICCV 2015First CNN for end-to-end camera pose regression from single RGB image
SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image SegmentationPAMI 2017Efficient encoder-decoder for real-time semantic segmentation; pooling index upsampling
Bayesian SegNet: Model Uncertainty in Deep Convolutional Encoder-Decoder ArchitecturesBMVC 2017MC-Dropout uncertainty estimation for segmentation; 2-3% improvement from uncertainty
Modelling Uncertainty in Deep Learning for Camera RelocalizationICRA 2016Bayesian uncertainty for pose estimation
What Uncertainties Do We Need in Bayesian Deep Learning for Computer Vision?NeurIPS 2017 (Spotlight)Distinguished aleatoric vs. epistemic uncertainty; framework for vision uncertainty
Concrete DropoutNeurIPS 2017Learnable dropout rates for principled uncertainty estimation in large models
Geometric Loss Functions for Camera Pose Regression with Deep LearningCVPR 2017 (Spotlight)Geometry-based loss functions; automatic position/orientation weighting
End-to-End Learning of Geometry and Context for Deep Stereo Regression (GC-Net)ICCV 2017 (Spotlight)3D cost volume with learned features for stereo depth; differentiable soft argmin
Multi-Task Learning Using Uncertainty to Weigh Losses for Scene Geometry and SemanticsCVPR 2018 (Spotlight)Homoscedastic uncertainty for multi-task loss weighting; joint depth + segmentation
Concrete Problems for Autonomous Vehicle Safety: Advantages of Bayesian Deep LearningIJCAI 2017Framework for AV safety through uncertainty quantification
PhD Thesis: Geometry and Uncertainty in Deep Learning for Computer VisionCambridge 2017Unified framework; 2018 BMVA Prize, 2019 ELLIS Prize

Wayve Perception Research Publications

PaperVenue/YearKey Perception Contribution
Learning to Drive in a DayICRA 2019First deep RL for autonomous driving; 10K-parameter CNN
Learning to Drive from Simulation without Real World LabelsICRA 2019Sim-to-real transfer via unsupervised domain adaptation for perception
Orthographic Feature Transform for Monocular 3D Object DetectionBMVC 2019 (Oral)BEV feature projection without explicit depth; viewpoint-independent representations
Learning a Spatio-Temporal Embedding for Video Instance Segmentation2019Temporal perception via spatio-temporal embeddings; causal 3D convolutions
Urban Driving with Conditional Imitation LearningICRA 2020Conditional imitation learning for urban driving perception-to-action
Probabilistic Future Prediction for Video Scene UnderstandingECCV 2020Joint ego-motion, static scene, dynamic agent prediction; 5-module architecture
FIERY: Future Instance Prediction in Bird's-Eye View from Surround Monocular CamerasICCV 2021 (Oral)BEV perception from surround cameras; probabilistic future instance prediction
Reimagining an Autonomous VehiclearXiv 2021Manifesto for E2E driving; auxiliary self-supervised perception outputs
Model-Based Imitation Learning for Urban Driving (MILE)NeurIPS 2022BEV perception + world model + driving policy; StyleGAN decoders
Driving with LLMs: Fusing Object-Level Vector Modality for Explainable Autonomous DrivingICRA 2024Object-level vector input to LLM for driving; 160K QA pairs
GAIA-1: A Generative World Model for Autonomous DrivingarXiv 20239.1B parameter world model; perception-through-generation
LingoQA: Visual Question Answering for Autonomous DrivingECCV 2024419K QA pairs; Lingo-Judge evaluation metric
LINGO-2: Driving with Natural Language2024First closed-loop VLA model on public roads; vision-to-language-to-action
CarLLaVA: Vision Language Models for Camera-Only Closed-Loop Driving2024VLM for driving; CARLA Challenge 2024 winner (458% improvement)
WayveScenes101: A Dataset and Benchmark for Novel View Synthesis2024101-scene driving benchmark for scene reconstruction
GAIA-2: A Controllable Multi-View Generative World ModelarXiv 20258.4B latent diffusion world model; multi-view generation; flow matching
SimLingo: Vision-Only Closed-Loop Autonomous Driving with Language-Action AlignmentCVPR 2025 (Spotlight)VLM for driving; Action Dreaming; disentangled waypoints
PRISM-1: Photorealistic Reconstruction in Static and Dynamic Scenes20254D scene reconstruction from camera-only; 3D Gaussian Splatting
Rig3R: Rig-Aware Conditioning and Discovery for 3D ReconstructionNeurIPS 2025 (Spotlight)Geometric foundation model; learned calibration; 17-45% improvement
GAIA-3: Scaling World Models to Power Safety and Evaluation202515B parameter world model; safety evaluation; embodiment transfer

Open-Source Code Repositories

RepositoryURLContent
FIERYgithub.com/wayveai/fieryBEV future prediction from surround cameras
LingoQAgithub.com/wayveai/LingoQAVQA benchmark for autonomous driving
WayveScenes101github.com/wayveai/wayve_scenesNovel view synthesis dataset and benchmark
Driving with LLMsgithub.com/wayveai/Driving-with-LLMsObject-level LLM driving
SimLingogithub.com/RenzKa/simlingoVision-only closed-loop driving with language

21. Key Patents {#21-key-patents}

Patent Portfolio Status

Wayve Technologies maintains a patent portfolio covering their core autonomous driving technologies. However, as a UK-headquartered company focused on commercialization rather than patent assertion, their patent strategy appears to emphasize trade secrets and speed of innovation over extensive public patent filings.

Expected Patent Coverage Areas

Based on Wayve's published research, product announcements, and technology disclosures, their patent portfolio likely covers:

  1. End-to-End Driving Architecture: Methods for training a single neural network from sensor inputs to driving trajectory without modular decomposition
  2. Self-Supervised Perception Training: Methods for learning depth, flow, and geometry from unlabeled driving data
  3. World Model Architecture: GAIA-series generative world models for autonomous driving (autoregressive transformer + diffusion, latent diffusion world models)
  4. Vision-Language-Action Models: LINGO-series VLA models that combine driving, language commentary, and instruction-following
  5. 4D Scene Reconstruction: PRISM-1 Gaussian Splatting-based reconstruction for driving simulation
  6. Multi-Task Uncertainty Weighting: Automatic loss weighting for joint perception task training (building on Kendall's CVPR 2018 work)
  7. Rig-Aware 3D Reconstruction: Rig3R methods for learned calibration and 3D perception across camera configurations
  8. Neural Simulation: Ghost Gym closed-loop neural simulator architecture
  9. Geographic Generalization: Methods for rapid adaptation of driving models to new countries without HD maps
  10. Sensor Fusion: Learned camera-radar fusion approaches using transformer architectures

Academic IP Foundation

Wayve's intellectual property builds on a foundation of academic research from the University of Cambridge, particularly Alex Kendall's lab. Key IP concepts include:

  • Bayesian deep learning for perception uncertainty (NeurIPS 2017)
  • Multi-task learning with homoscedastic uncertainty (CVPR 2018)
  • End-to-end learned camera pose regression (ICCV 2015, CVPR 2017)
  • Encoder-decoder architectures for segmentation (PAMI 2017)
  • Deep stereo regression with geometric cost volumes (ICCV 2017)

These academic contributions are extensively cited (52,000+ Google Scholar citations for Kendall) and form the theoretical basis for much of Wayve's commercial technology.


Sources

Wayve Official Sources

Academic Papers

Alex Kendall Foundational Papers

External Coverage

GitHub Repositories

Public research notes collected from public sources.