State Space Models, S4, and Mamba: First Principles
Visual: continuous-time SSM to discretized recurrence, convolution view, S4 kernel, selective scan, and attention duality.
Why This Page Exists
State-space models are the bridge between classical dynamical systems and modern long-sequence neural networks. They matter for AV research because autonomy data is naturally streaming: LiDAR sweeps, camera frames, radar detections, tracks, occupancy maps, map-change evidence, and world-model latents arrive over time.
This page focuses on the math layer behind S4, Mamba, Mamba-2, and Mamba-3. For the broader temporal-modeling comparison, see Sequence Models: RNNs, SSMs, Attention, and Mamba. For driving-specific applied papers, see Mamba and State Space Models for Autonomous Driving.
Continuous-Time State Space
A linear state-space model has hidden state h(t), input x(t), and output y(t):
dh(t) / dt = A h(t) + B x(t)
y(t) = C h(t) + D x(t)A controls memory dynamics, B writes the input into state, C reads state into output, and D is a direct skip from input to output.
This is already familiar to robotics engineers: Kalman filters, vehicle models, IMU propagation, and control systems all use state-space equations. Neural SSMs replace hand-designed state dynamics with trainable ones while preserving the idea of a hidden state that evolves through time.
Discretization
Digital systems process sequences at discrete times:
h_t = A_bar h_{t-1} + B_bar x_t
y_t = C h_t + D x_tFor a sampling interval Delta, a common zero-order-hold discretization is:
A_bar = exp(Delta A)
B_bar = A^{-1}(A_bar - I) BThe sampling interval matters. In AV systems, a model trained on fixed 10 Hz frames can fail when sensors stream at 5 Hz, 12 Hz, or event-driven intervals. Modern SSM layers often learn or condition Delta so the model can adapt memory updates to the input.
Convolution View
Unrolling the recurrence gives:
y_t = C A_bar^0 B_bar x_t
+ C A_bar^1 B_bar x_{t-1}
+ C A_bar^2 B_bar x_{t-2}
+ ...This is a convolution:
y = K * x
K_i = C A_bar^i B_barThat dual view is the key SSM advantage:
- During training, use convolution-like parallelism over the whole sequence.
- During streaming inference, use recurrent state updates with constant memory.
For AV, this means one family of layers can support both offline training on long logs and online execution on a vehicle.
S4
S4 made long-sequence SSMs practical by using structured state matrices that preserve long memory while allowing efficient computation. Its core contribution was not just "use recurrence"; it was a parameterization and algorithmic path that made long-range state dynamics trainable and fast.
The useful intuition:
S4 learns a bank of stable filters over time.
Each filter has dynamics controlled by A, written by B, and read by C.Strengths:
- Handles very long sequences.
- Trains efficiently with structured convolution.
- Provides a strong alternative to attention when exact token lookup is not required.
Limitations:
- Dynamics are mostly fixed after training.
- The model cannot easily choose different memory rules for different input content.
- Practical tuning can be more complex than a standard transformer block.
Selective State Spaces
Mamba adds input-dependent selection. Instead of fixed B, C, and Delta, the layer computes them from the current token:
B_t = f_B(x_t)
C_t = f_C(x_t)
Delta_t = softplus(f_Delta(x_t))Then:
h_t = A_bar_t h_{t-1} + B_bar_t x_t
y_t = C_t h_tThis means the model can decide when to write strongly, when to forget, and when to preserve memory. In a perception stream, a sudden moving object, sensor dropout, or map-change cue can trigger a different state update than a routine static frame.
Mamba
Mamba combines selective SSMs with a hardware-aware parallel scan. Selection makes the layer more expressive, but it removes the simple fixed convolution trick. The scan algorithm restores efficient sequence processing.
Operational properties:
| Property | Meaning |
|---|---|
| Linear sequence scaling | Compute grows roughly with sequence length rather than with pairwise token interactions. |
| Constant streaming state | Inference does not require a full KV cache. |
| Content-dependent memory | Input controls how state is updated and read. |
| Less exact retrieval | A compressed state is not the same as attention over all past tokens. |
For AV, Mamba is attractive when the model must process long histories at high rate: radar streams, BEV memory, map-change evidence, fleet logs, or long occupancy sequences.
Mamba-2 and Structured State Space Duality
Mamba-2 reframes SSMs and attention through structured state space duality. The important lesson is that attention and SSMs are not unrelated species; both can be described as structured ways to move information across a sequence.
The Mamba-2 design improves parallelism and matrix-multiplication efficiency. It also makes hybrid designs more natural:
- Use attention for short-range relational reasoning and exact lookup.
- Use SSM blocks for long streaming memory.
- Use both when a model needs high-rate processing and occasional global interaction.
Mamba-3 Direction
Mamba-3 continues the inference-first direction with richer state dynamics, including more expressive recurrent updates and multi-input/multi-output formulations. The key trend is better state tracking without giving up the linear-time inference profile.
For autonomy, the open question is not "will Mamba replace transformers?" The better question is where compressed streaming state is safer and cheaper than keeping full attention context.
AV Design Guidance
| Use case | SSM/Mamba fit | Caution |
|---|---|---|
| Online radar or LiDAR temporal filtering | Strong | Verify timestamp handling and sensor-rate changes. |
| Long BEV memory for occupancy | Strong | Check whether rare events survive compression. |
| Dynamic map-change evidence over time | Strong | Preserve explicit evidence logs outside hidden state. |
| Object interaction reasoning | Mixed | Attention may be better for pairwise relation lookup. |
| World-model latent rollout | Promising | Closed-loop error accumulation must be evaluated. |
| Safety-critical state estimation | Auxiliary only | Do not replace calibrated estimators without consistency evidence. |
Failure Modes
- Hidden state forgets rare but safety-critical events.
- Training sequence length differs from deployment context length.
- Sensor-rate changes alter effective discretization.
- Compressed memory hides why a decision changed.
- Long-context benchmark gains do not translate to closed-loop planning.
- Hybrid Mamba-transformer models become hard to profile on embedded hardware.
Review Checklist
What is the sequence length during training and deployment?
Is the model causal, bidirectional, or chunked?
How does it handle variable sensor intervals?
What state is retained across planner cycles?
Can rare events be probed from the hidden state?
Is exact retrieval required, or is compressed memory enough?
What happens after dropped frames, time jumps, and sensor resets?Sources
- Gu et al., "Efficiently Modeling Long Sequences with Structured State Spaces": https://arxiv.org/abs/2111.00396
- Gu and Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces": https://arxiv.org/abs/2312.00752
- Dao and Gu, "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality": https://arxiv.org/abs/2405.21060
- "Mamba-3: Enhancing Linear Sequence Modeling with MIMO and Complex-Valued Dynamics": https://arxiv.org/abs/2603.15569
- Local companion: Sequence Models: RNNs, SSMs, Attention, and Mamba