Skip to content

Vision Transformers: First Principles

Vision Transformers: First Principles curated visual

Visual: image/BEV patch tokenization through ViT attention, hierarchy/windowing, deformable attention, BEVFormer geometry, and dense output recovery.

Scope

This note explains why transformers work for images, BEV grids, and point clouds, and how those ideas translate into AV perception, SLAM, and mapping. It avoids duplicating the deeper deployment survey in sparse-attention-3d-perception.md and the world-model transformer details in transformer-world-models.md.

1. From Images to Tokens

A transformer expects a sequence of tokens. A Vision Transformer turns an image into tokens by splitting it into fixed-size patches:

text
Image:  H x W x C
Patch:  P x P x C
Tokens: (H / P) * (W / P)

Each patch is flattened and linearly projected into a d_model vector. For a 224x224 image with 16x16 patches:

text
14 x 14 = 196 tokens

The model then applies the same transformer blocks used in language. The key conceptual move is treating an image as a set of visual words.

2. What ViT Proved

The original ViT result was not that convolution is obsolete. It showed that a pure transformer over image patches can perform very well when pre-trained at sufficient scale and transferred to downstream tasks.

The first-principles tradeoff:

  • A CNN has strong local inductive bias and is data-efficient.
  • A ViT has weaker locality bias but scales well with model size and data.
  • A ViT can use the same global token-mixing mechanism for classification, detection, segmentation, BEV fusion, and video.

For AV systems with limited domain labels, this is why self-supervised pre-training matters. A ViT trained only from scratch on a small airside dataset will often underperform a more biased CNN. A ViT pre-trained on broad image/video data can become a strong feature extractor for novel domains.

3. Patch Size Is a Design Knob

Patch size controls token count and spatial precision.

Patch sizeTokens for 224x224StrengthWeakness
32x3249CheapLoses small objects
16x16196Standard balanceModerate detail
8x8784Better small detailsExpensive

Driving scenes contain small but safety-critical objects: cones, chocks, FOD, personnel limbs, lane markings, and far-away vehicles. Large patches can erase these signals. Dense prediction usually needs hierarchical features or smaller effective stride.

4. Why Plain ViT Is Awkward for Dense Perception

Plain ViT produces one sequence at one scale. AV perception needs multi-scale outputs:

  • Small objects require high-resolution detail.
  • Large objects require broad context.
  • Map lanes and curbs require thin structure.
  • Aircraft and GSE can span a huge range of scales.

CNNs naturally build pyramids through downsampling. Plain ViT does not. This motivated hierarchical vision transformers.

5. Swin Transformer: Local Windows Plus Hierarchy

Swin restricts attention to local windows and shifts the windows between layers. This gives:

  • Near-linear complexity in image size.
  • Hierarchical feature maps like CNN backbones.
  • Cross-window communication through shifted windows.

The design maps well to dense tasks such as detection and segmentation. In AV pipelines, the same principle appears in BEV and point-cloud models: local attention is cheap, but the window pattern must let information move across the scene over depth.

6. Deformable Attention

Dense feature maps have many locations, but each query often needs only a few relevant samples. Deformable attention learns a small set of sampling offsets around a reference point:

text
query -> reference point -> K learned offsets -> sample features -> weighted sum

This is useful for:

  • Small object detection, where full attention over feature maps is wasteful.
  • Multi-scale detection, where a query samples from several feature pyramid levels.
  • Camera-to-BEV lifting, where each BEV query should gather from a few projected image regions rather than all pixels.

The risk is reference failure. If the initial reference is badly placed or calibration is wrong, sparse sampling can miss the evidence entirely.

7. BEVFormer: Vision Transformer for Driving Geometry

BEVFormer introduced a clean mental model for camera-only 3D perception:

text
Grid-shaped BEV queries
    attend to multi-camera image features for spatial evidence
    attend to previous BEV states for temporal evidence
    output a unified BEV representation

The important idea is that BEV is not a camera image. It is a spatial memory surface in ego or world coordinates. BEV queries are fixed locations in that surface. Camera features are sampled through cross-attention.

For mapping and SLAM readers:

  • BEV queries resemble map cells or local submap cells.
  • Temporal self-attention is a learned map update step.
  • Ego-motion compensation is the bridge between frame-local observations and a persistent spatial state.

8. Point Transformer V3: Point Clouds as Serialized Local Sequences

Point clouds are unordered and irregular. Full attention over points is infeasible for large LiDAR scans. PTv3 uses space-filling serialization:

text
3D coordinates -> quantized grid -> space-filling curve index -> sorted sequence

Then standard window attention runs on the sorted sequence. Nearby 3D points tend to be nearby in the sequence, so local 1D windows approximate local 3D neighborhoods.

This is important because it avoids expensive neighbor search while using optimized attention kernels. For detailed PTv3 mechanics and benchmarks, see sparse-attention-3d-perception.md.

9. Vision Transformers in a Perception Stack

A practical AV perception stack can use transformers at several levels:

text
Camera image backbone:
  ViT/Swin extracts image features

Camera-to-BEV:
  BEV queries cross-attend to image features

LiDAR backbone:
  PTv3 or sparse voxel attention extracts point features

Fusion:
  BEV, LiDAR, radar, and map tokens cross-attend

Temporal memory:
  BEV tokens attend to prior BEV tokens after ego-motion compensation

Task heads:
  Detection, segmentation, occupancy, flow, map elements, planning cost

Use transformers where adaptive long-range mixing is valuable. Use CNNs, sparse convs, or classical geometry where they are simpler, faster, and more predictable.

10. SLAM and Mapping Uses

Vision transformers are useful in learned SLAM and mapping when the task needs robust matching or semantic context:

  • Visual place recognition: global tokens summarize scene layout.
  • Local feature matching: transformer matchers resolve repeated patterns.
  • Dynamic object filtering: temporal attention separates static map evidence from movers.
  • Online semantic mapping: BEV tokens accumulate lane, curb, stand, and obstacle evidence.
  • Map change detection: current BEV/map tokens attend against stored map tiles.

However, learned features should not be the only source of metric consistency. Bundle adjustment, pose graph optimization, scan matching, and occupancy fusion still provide the checks that make a mapping system stable over long horizons.

11. AV-Specific Design Considerations

Coordinate Frames

For camera image transformers, patch positions are image-plane coordinates. For BEV and map transformers, token positions are metric world coordinates. Mixing them requires known calibration and projection geometry. If calibration drifts, attention can learn wrong associations.

Temporal Alignment

Multi-camera rigs, spinning LiDAR, radar, and vehicle state often arrive at different times. A temporal transformer can hide synchronization errors by learning average patterns, but mapping quality will suffer. Motion compensation should happen before temporal fusion when geometry matters.

Small Objects

Patch-based models can miss small hazards. Use high-resolution paths, feature pyramids, deformable sampling, occupancy heads, or LiDAR/radar safety channels for FOD, cones, chocks, and personnel.

Edge Deployment

On Orin-class hardware, attention layers compete for GPU memory bandwidth. Favor:

  • Windowed attention.
  • BEV token budgets that match latency targets.
  • TensorRT-friendly standard attention patterns.
  • INT8 calibration with task-specific validation.
  • A fast geometric fallback for safety-critical detection.

12. How to Choose a Vision Transformer Pattern

NeedPreferReason
Image classification or general featuresViT or DINOv2 backboneScales with pre-training
Dense 2D detection/segmentationSwin or hierarchical ViTMulti-scale features
Camera-to-BEV perceptionBEVFormer-style queriesGeometry-aware lifting
LiDAR segmentationPTv3 or sparse attentionIrregular 3D tokens
Small object detectionDeformable attention plus high-res featuresSparse adaptive sampling
Real-time safety pathCNN/sparse conv plus simple headPredictable latency
Long video memoryHybrid attention plus SSMAttention cache grows with context

13. Common Failure Modes

  • Treating image patches as metric geometry without calibration.
  • Using global attention where local sparse attention is enough.
  • Using local windows where map-level context is required.
  • Forgetting that self-supervised image features may encode appearance, not metric occupancy.
  • Overfitting to camera rig layout and failing after sensor replacement.
  • Training with future labels or non-causal map updates in temporal perception.

14. Relationship to Other Local Docs

Sources

Public research notes collected from public sources.