Skip to content

Tokenization and Discretization: First Principles

Tokenization and Discretization: First Principles curated visual

Visual: sensor-to-token conversion map covering text, image/video patches, BEV/voxel/point tokens, quantization, FSQ, and rate-distortion artifacts.

Scope

Tokenization turns continuous or structured data into discrete units that a model can index, embed, compress, and predict. In language this means subword tokens. In AV systems it can mean image patches, BEV cells, point tokens, occupancy codes, trajectory tokens, map tokens, or VQ-VAE codebook indices.

This page broadens the narrower VQ-VAE and Discrete Tokenization note. It explains the design choices that decide what a transformer, diffusion model, JEPA encoder, or world model can represent.

Why Tokenization Matters

A model cannot recover information destroyed by its tokenizer. If tokenization erases small debris, thin cones, lane markings, aircraft gear, or low-height obstacles, the downstream model will not reason about them reliably.

Tokenization controls:

  • Vocabulary size and memory.
  • Spatial or temporal resolution.
  • Compression ratio.
  • Reconstruction quality.
  • Rare-object preservation.
  • Whether prediction is discrete classification or continuous regression.

Text Tokenization

Language models typically use subword tokenization. Byte Pair Encoding starts from characters or bytes and repeatedly merges frequent adjacent symbols. SentencePiece trains subword units directly from raw text without requiring pre-tokenized words.

Why AV engineers should care:

  • Prompted perception, VLMs, and VLA models inherit subword quirks.
  • Rare airport terms, part numbers, and local procedures can split into awkward tokens.
  • Text tokenization affects grounding for rare classes such as "towbarless tug" or "belt loader".

Image and Video Tokens

Vision transformers often split an image into patches:

text
image H x W x C -> patches N x (P*P*C) -> linear projection -> tokens

Patch size is a safety-relevant parameter. Larger patches reduce compute but can erase small hazards. Video tokenization adds time:

text
frames x patches -> spatiotemporal tokens

For world models, video tokens may be predicted autoregressively, masked, or denoised. Temporal stride controls whether short events such as a pedestrian stepping out, a cone falling, or a tug starting motion remain visible.

BEV, Voxel, and Point Tokens

AV perception often tokenizes geometry instead of pixels.

Token typeExampleStrengthWeakness
Pillar tokenPointPillars-style vertical columnEfficient BEVLoses vertical detail.
Voxel tokenSparse 3D voxelMetric geometryLarge memory if dense.
Point tokenRaw or sampled pointPreserves geometryIrregular and expensive.
Object queryDETR/Sparse4D queryCompact instancesCan miss unknown objects.
Occupancy tokenCell/voxel statePlanner-friendlyResolution tradeoff.
Map tokenLanelet, polyline, landmarkStructured priorMap staleness risk.

Dynamic/static removal depends on this choice. A voxel tokenizer can preserve occupancy evidence for unknown debris. An object-query tokenizer may ignore objects outside its learned query distribution.

Quantization

Vector quantization maps a continuous latent vector to the nearest codebook entry:

text
z_q = e_argmin_k ||z_e - e_k||^2

This creates discrete indices that can be modeled like language tokens. But the codebook becomes an information bottleneck.

Common failure modes:

  • Codebook collapse: only a few codes are used.
  • Rare-object erasure: small hazards share codes with background.
  • Domain collapse: road-trained codes poorly represent airside objects.
  • Reconstruction bias: visually pleasing reconstructions hide metric errors.

Finite Scalar Quantization

Finite Scalar Quantization replaces learned vector codebooks with scalar quantization across latent dimensions. It can reduce codebook collapse and simplify training. The tradeoff is that the representation geometry is different from nearest-neighbor vector codebooks.

FSQ is useful to know because world-model tokenizers are moving beyond classic VQ-VAE when stability matters.

Rate-Distortion View

Tokenization is a rate-distortion problem:

text
minimize distortion subject to limited bits

For AV, distortion should not be measured only by reconstruction loss. It should include downstream task loss:

  • Free-space error.
  • Small-object recall.
  • Localization landmark preservation.
  • Motion-state preservation.
  • Map-change sensitivity.
  • Planner collision cost.

Tokenizer Evaluation

A tokenizer should be evaluated before the world model built on top of it.

TestWhat it catches
Reconstruction by object sizeSmall hazards disappearing.
Occupancy IoU by height/rangeThin or far geometry loss.
Motion-state probeDynamic/static evidence erased.
Map-change probeMoved-object evidence compressed away.
Domain splitAirside or indoor tokens underused.
Code usage/perplexityCollapse or wasteful vocabulary.

AV Implementation Guidance

  • Preserve raw sensor logs and tokenized outputs for audit.
  • Probe token embeddings for safety-relevant attributes.
  • Report tokenizer metrics separately from model metrics.
  • Tune token resolution by hazard size, not only benchmark throughput.
  • Avoid training a planner on tokenized worlds before proving the tokenizer preserves collision-relevant geometry.

Sources

Public research notes collected from public sources.