Test-Time Training (TTT) for Rapid Airport Onboarding of Autonomous GSE Perception
Research Date: 2026-04-11 Scope: Gradient-based test-time training methods for adapting LiDAR perception models to new airport environments without labeled data, deployed on NVIDIA Orin AGX within Simplex safety architecture Stack Context: reference airside AV stack ROS Noetic, 4-8 RoboSense LiDAR (RSHELIOS/RSBP), PointPillars/CenterPoint detection, GTSAM localization, Frenet planner, Simplex AC/BC
Key Takeaway: Test-time training (TTT) occupies a critical middle ground between lightweight TTA (BN statistics only, covered in test-time-adaptation-airside.md) and full supervised fine-tuning (requires labels, covered in multi-airport-adaptation.md). TTT uses self-supervised auxiliary losses -- masked point cloud reconstruction, contrastive temporal consistency, ground plane prediction -- to compute actual gradients and update model parameters at test time, recovering 40-70% of the domain gap at a new airport before any labeled data exists. On Orin AGX, a LoRA-constrained TTT update (rank 4-8, 1 gradient step per 10 frames) adds 8-15ms amortized overhead per inference cycle, fitting within the 50ms planning budget. The critical safety requirement is bounding TTT's risk: an anchor loss prevents deviation beyond 5% of pre-deployment weights, a frozen baseline model (Simplex BC) runs in parallel at all times, and OOD-triggered fallback overrides TTT if adapted model uncertainty exceeds threshold. For the reference airside AV stack's airport onboarding, TTT compresses the "unlabeled shadow mode" phase from 2 weeks of passive observation to 3-5 days of active adaptation, after which the adapted model can be evaluated against a small labeled sample (200-500 frames) and locked as the new airport baseline.
Table of Contents
- TTA vs TTT vs Fine-Tuning: A Precise Taxonomy
- TTT Methods: State of the Art (2024-2026)
- LiDAR-Specific TTT Auxiliary Tasks
- Safety-Bounded TTT on Orin AGX
- Catastrophic Forgetting Prevention
- Simplex Integration Architecture
- Airport Onboarding Protocol with TTT
- Comparison with Alternative Adaptation Methods
- Experimental Evidence and Expected Gains
- Implementation Roadmap
- Key Takeaways
- References
1. TTA vs TTT vs Fine-Tuning: A Precise Taxonomy
1.1 The Adaptation Spectrum
Deploying a perception model trained at Airport A to Airport B creates a domain gap that degrades performance by 15-40% mAP (see test-time-adaptation-airside.md, Section 1.2). The question is how to close this gap. Three families of methods exist, forming a strict hierarchy of adaptation power and risk:
Adaptation Power (low → high)
Risk of Failure (low → high)
┌──────────────────────────────────────────────────────────────────────┐
│ Level 0: No Adaptation │
│ Deploy source model as-is. Accept 15-40% mAP loss. │
│ Risk: None (model unchanged). Power: None. │
├──────────────────────────────────────────────────────────────────────┤
│ Level 1: Test-Time Adaptation (TTA) │
│ Update BN statistics or minimize prediction entropy. │
│ No gradient on main task loss. No auxiliary objectives. │
│ Methods: TENT, BN-Adapt, SAR, CoTTA, RoTTA │
│ Recovers: 5-15% of domain gap. Risk: Low (BN params only). │
├──────────────────────────────────────────────────────────────────────┤
│ Level 2: Test-Time Training (TTT) ◄── THIS DOC │
│ Compute gradients via self-supervised auxiliary losses. │
│ Update backbone/encoder parameters using reconstruction, │
│ contrastive, or prediction objectives. No task labels needed. │
│ Methods: TTT++, TTT-MAE, TTT-Linear, online LoRA+MAE, LAME │
│ Recovers: 20-50% of domain gap. Risk: Medium (backbone changes). │
├──────────────────────────────────────────────────────────────────────┤
│ Level 3: Supervised Fine-Tuning │
│ Requires labeled data from target domain. │
│ PointLoRA fine-tuning, head retraining, full fine-tuning. │
│ Methods: LoRA, adapter layers, full backprop │
│ Recovers: 60-90% of domain gap. Risk: Low (labeled guidance). │
└──────────────────────────────────────────────────────────────────────┘1.2 What Makes TTT Different from TTA
The distinction is fundamental and often confused in the literature. TTA and TTT differ in three critical dimensions:
| Dimension | TTA (Level 1) | TTT (Level 2) |
|---|---|---|
| What parameters update | BatchNorm affine (gamma, beta) only -- typically <0.1% of model | Backbone encoder weights, feature extractor -- up to 5-20% of model |
| Loss function | Prediction entropy (unsupervised signal from model output) | Self-supervised auxiliary loss (reconstruction, contrastive, prediction) |
| Gradient source | Gradient from model's own predictions (circular, self-referential) | Gradient from external objective with ground truth (self-supervised, but grounded) |
| Training-time requirement | None -- any model can be TTA'd post-hoc | Must be co-trained with auxiliary task during source training |
| Information exploited | Only model confidence on test data | Geometric/structural properties of test data itself |
| Typical mAP recovery | 5-15% of gap | 20-50% of gap |
| Failure mode | Entropy collapse (confident but wrong) | Feature drift (backbone diverges from task-relevant features) |
| Compute overhead | 5-15% per inference | 50-200% per update step (amortized to 5-20% if batched) |
The key insight: TTA is limited because entropy minimization uses the model's own uncertain predictions as signal -- a circular process that amplifies errors when the domain shift is large. TTT breaks this circularity by introducing an external self-supervised objective that provides a genuine learning signal grounded in the structure of the test data itself. When you mask 70% of a LiDAR point cloud and ask the model to reconstruct it, the reconstruction error is an objective measure of how well the model understands the current data distribution -- independent of downstream task predictions.
1.3 When TTT Is Appropriate
TTT is the right choice when all of the following hold:
- No labeled target data exists yet -- rules out supervised fine-tuning
- Domain shift is moderate to large (>15% mAP drop) -- TTA alone is insufficient
- Compute budget allows gradient updates -- Orin can afford periodic backprop
- A safety fallback exists -- Simplex BC catches TTT failures
- Auxiliary task was included during source training -- model architecture supports it
For the reference airside AV stack's airport onboarding, conditions 1-4 are always met. Condition 5 requires architectural planning: the auxiliary task head must be trained alongside the main detection head during source model training. This is a one-time cost.
1.4 What TTT Cannot Do
TTT has hard limits that must be understood before deployment:
- Cannot learn new object classes. If Airport B has a GSE type never seen during training, TTT cannot create a detection head for it. It can only adapt features to better represent the new domain's distribution. New classes require supervised fine-tuning or active learning.
- Cannot overcome catastrophic architecture mismatch. If the source model's receptive field is too small for Airport B's wider aprons, TTT cannot fix an architectural limitation.
- Cannot guarantee convergence. Self-supervised gradients may not reduce the task-relevant domain gap if the auxiliary objective is weakly correlated with the main task.
- Cannot replace validation. TTT-adapted models must still pass quantitative evaluation before being promoted from shadow mode.
2. TTT Methods: State of the Art (2024-2026)
2.1 Method Overview
| Method | Year | Venue | Auxiliary Task | Updates | Params Updated | mAP/Acc Gain | Orin Feasibility |
|---|---|---|---|---|---|---|---|
| TTT (original) | 2020 | ICML | Rotation prediction | Per-batch SGD | Shared encoder | +3-5% acc | Yes (simple) |
| TTT++ | 2021 | NeurIPS | Multi-head contrastive | Per-batch | Shared encoder + heads | +5-8% acc | Marginal |
| TTT-MAE | 2023 | NeurIPS | Masked autoencoder reconstruction | Per-batch | Encoder (LoRA) | +8-15% acc | Yes (w/ LoRA) |
| TTT-Linear | 2024 | ICML | Linear self-supervised | Per-token (!) | TTT layer replaces attention | N/A (architectural) | Research only |
| Online LoRA+MAE | 2024 | ECCV | MAE + entropy joint loss | Accumulated | LoRA adapters | +10-20% mAP | Yes (designed for edge) |
| LAME | 2022 | CVPR | Laplacian manifold | Per-batch | Output layer only | +2-5% acc | Yes (minimal) |
| TTT-Bench/ActTTT | 2025 | ICLR | Active sample selection for TTT | Selected batches | Encoder | +12-18% acc | Yes |
| ClusTTT | 2025 | Preprint | Cluster-guided TTT | Per-cluster | Shared encoder | +7-12% acc | Yes |
| Continual-TTT | 2025 | CVPR | Streaming self-supervised | Sliding window | LoRA + BN | +8-14% acc | Yes |
2.2 TTT (Original): Rotation Prediction (Sun et al., ICML 2020)
The foundational TTT paper introduced the core idea: train a model with two heads -- a main task head and a self-supervised auxiliary head. At test time, use the auxiliary head's loss to update the shared encoder.
Architecture:
Input Point Cloud
│
┌────▼────┐
│ Shared │
│ Encoder │ ← Updated by auxiliary gradient at test time
│ f(x) │
└──┬───┬──┘
│ │
┌────────▼┐ ┌▼────────┐
│ Main │ │Auxiliary│
│ Task │ │ Task │
│ Head │ │ Head │
│(detect) │ │(rotate)│
└────┬────┘ └───┬────┘
│ │
Detections Rotation
(frozen at prediction
test time) loss → gradOriginal auxiliary task: Predict which of 4 rotations (0, 90, 180, 270 degrees) was applied to the input. This is a pretext task that forces the encoder to learn orientation-aware features.
Training procedure:
- For each training sample, jointly minimize:
L_total = L_main(x, y) + alpha * L_aux(rotate(x)) - Both heads share the encoder, so auxiliary gradients shape the encoder features
Test-time procedure:
- For each test batch, compute auxiliary loss:
L_aux(rotate(x_test)) - Backpropagate through encoder only (main head frozen)
- Update encoder parameters:
theta_enc ← theta_enc - lr * grad(L_aux) - Forward pass with updated encoder through main head for final prediction
Limitations for 3D LiDAR:
- Rotation prediction is trivially solved by many 3D encoders (gravity provides orientation cue from ground plane)
- The 4-class rotation task provides weak gradients -- limited information per update
- Point cloud rotation changes the density distribution (beam pattern is cylindrical, not spherical), introducing distribution shift within the auxiliary task itself
Nevertheless: The TTT framework is sound. The key contribution is the idea, not the specific auxiliary task. All subsequent methods improve by choosing better auxiliary objectives.
2.3 TTT++ (Liu et al., NeurIPS 2021)
TTT++ replaces rotation prediction with a multi-head self-supervised objective:
Improvements over TTT:
- Contrastive auxiliary task: Uses SimCLR-style contrastive learning instead of rotation. Augmented views of the same input should produce similar features; different inputs should produce dissimilar features.
- Multi-head architecture: Multiple auxiliary heads capture different aspects of the data distribution (geometry, density, spatial extent). Each head provides independent gradient signal.
- Momentum encoder: Exponential moving average of the encoder provides stable contrastive targets, preventing mode collapse during test-time updates.
Contrastive loss at test time:
# TTT++ test-time update (pseudocode for LiDAR adaptation)
def ttt_plus_plus_update(encoder, aux_heads, momentum_encoder,
point_cloud, augment_fn, optimizer):
"""
Multi-head contrastive TTT update.
Args:
encoder: Shared backbone (e.g., PointPillars encoder)
aux_heads: List of K projection heads for contrastive learning
momentum_encoder: EMA copy of encoder (provides stable targets)
point_cloud: Current LiDAR scan (N x 4: x, y, z, intensity)
augment_fn: Point cloud augmentation (jitter, drop, subsample)
optimizer: SGD/Adam over encoder params only
"""
# Generate two augmented views
view_1 = augment_fn(point_cloud) # random jitter + subsample
view_2 = augment_fn(point_cloud) # different augmentation
# Encode both views
z1 = encoder(view_1) # Online encoder (updated)
with torch.no_grad():
z2 = momentum_encoder(view_2) # Momentum encoder (stable)
# Multi-head contrastive loss
total_loss = 0
for head in aux_heads:
p1 = head(z1)
p2 = head(z2)
# Negative cosine similarity (minimize → maximize agreement)
loss = -F.cosine_similarity(p1, p2.detach(), dim=-1).mean()
total_loss += loss
optimizer.zero_grad()
total_loss.backward()
optimizer.step()
# Update momentum encoder
with torch.no_grad():
for p_online, p_momentum in zip(
encoder.parameters(), momentum_encoder.parameters()
):
p_momentum.data = 0.999 * p_momentum.data + 0.001 * p_online.dataCompute cost: 2x forward pass (two views) + 1x backward pass through encoder. On Orin, for a PointPillars encoder: ~6.84ms (forward) x 2 + ~10ms (backward) = ~24ms per update. Amortized over 10 frames = 2.4ms per inference.
Failure modes:
- Feature collapse: all features converge to a constant, rendering both auxiliary and main tasks useless. Mitigated by momentum encoder and stop-gradient.
- Augmentation sensitivity: if augmentations are too weak, the contrastive task is trivial (no learning). If too strong, the augmented views are semantically different (wrong learning signal).
2.4 TTT-MAE (Gandelsman et al., NeurIPS 2023)
TTT-MAE replaces contrastive learning with masked autoencoder (MAE) reconstruction as the auxiliary task. This is the most promising TTT variant for LiDAR perception.
Core idea: At test time, mask a large fraction (60-80%) of the input and train the encoder to reconstruct the masked portion. The reconstruction loss provides a direct measure of how well the encoder models the current data distribution.
Why MAE is superior to contrastive for LiDAR TTT:
| Property | Contrastive (TTT++) | MAE (TTT-MAE) |
|---|---|---|
| Gradient quality | Relative (push/pull pairs) | Absolute (reconstruction error) |
| Batch dependence | Needs negative pairs (batch-size sensitive) | Works on single sample |
| Information per update | O(B^2) pairwise comparisons | O(N_masked) reconstruction targets |
| Sensitivity to augmentations | High (must design good augmentations) | Low (masking is natural for point clouds) |
| LiDAR suitability | Moderate (what augmentations for 3D?) | Excellent (point dropping is physically motivated) |
Architecture for LiDAR:
Input: Point cloud P (N x 4)
│
▼
┌───────────────────────┐
│ Random masking (70%) │
│ P_visible (0.3N x 4) │
│ P_masked (0.7N x 4) │
└──────┬────────────────┘
│
┌──────▼──────┐
│ Encoder │ ← Updated by reconstruction gradient
│ f(x) │
└──────┬──────┘
│
┌──────┼──────┐
│ │ │
▼ │ ▼
┌──────┐ │ ┌──────────┐
│ Main │ │ │ Decoder │
│ Head │ │ │ g(z) │
│(det) │ │ │ Reconstruct
└──┬───┘ │ │ masked │
│ │ │ points │
│ │ └────┬─────┘
│ │ │
Detections │ L_recon = ChamferDist(P_masked, g(f(P_visible)))
Not used │
at test │
time Gradient → Update encoderReconstruction loss: Chamfer distance between predicted and actual masked points:
L_recon = (1/|P_m|) * sum_{p in P_m} min_{q in P_hat} ||p - q||^2
+ (1/|P_hat|) * sum_{q in P_hat} min_{p in P_m} ||q - p||^2Where P_m is the set of masked points and P_hat is the decoder's reconstruction.
Results from literature:
- On ImageNet-C (2D, but instructive): TTT-MAE recovers 60-75% of corruption-induced accuracy drop
- On ScanObjectNN (3D object classification): +8.2% accuracy over no-adaptation baseline under point cloud corruptions
- On PointDA-10 (3D domain adaptation): +11.4% over source-only, +6.1% over TENT
- Consistently outperforms contrastive TTT methods across corruption types
Compute budget on Orin:
| Component | Time (Orin AGX, FP16) | Memory |
|---|---|---|
| Masking + tokenization | ~0.5ms | Negligible |
| Encoder forward (visible tokens) | ~4ms (30% of full ~6.84ms for PointPillars) | ~200MB |
| Decoder forward | ~3ms (lightweight) | ~50MB |
| Chamfer distance computation | ~1ms | ~20MB |
| Backward through encoder | ~8ms | ~400MB (activations) |
| Optimizer step (LoRA params) | ~0.5ms | ~10MB |
| Total per update step | ~17ms | ~680MB |
| Amortized (1 update per 10 frames) | ~1.7ms per inference |
2.5 TTT-Linear / TTT Layers (Sun et al., ICML 2024)
TTT-Linear is a fundamentally different use of "test-time training" -- it replaces the attention mechanism itself with a learned state update that is equivalent to gradient descent on a self-supervised loss at every token.
Core idea: Standard self-attention computes Attention(Q,K,V) by comparing all tokens pairwise. TTT-Linear instead maintains a hidden state W (a weight matrix) that is updated via one step of gradient descent on a self-supervised loss for each new token:
For each token x_t in the sequence:
W_t = W_{t-1} - eta * grad_W L_ssl(W_{t-1}, x_t)
output_t = W_t * x_tWhere L_ssl is a reconstruction loss: L_ssl = ||W * x_t - x_t||^2 (self-reconstruction).
This makes the TTT layer an RNN whose "hidden state" is a weight matrix updated by gradient descent. The outer training loop learns the learning rate eta, the initial state W_0, and the self-supervised loss parameters.
Key properties:
- Linear complexity in sequence length (O(n) vs O(n^2) for attention)
- Expressiveness that grows with test sequence length (more data = better state)
- The "mini-batch TTT" variant accumulates multiple tokens before updating
W, amortizing cost
Relevance to reference airside AV stack:
- TTT-Linear is an architectural innovation, not a deployment technique. It cannot be applied to existing PointPillars/CenterPoint models -- it requires training a new model from scratch with TTT layers replacing attention.
- Long-term potential: If reference airside AV stack adopts a transformer-based 3D backbone (PTv3, FlatFormer, LitePT), replacing attention layers with TTT layers would provide inherent continuous adaptation capability at the architecture level. The model would automatically adapt to each new airport by construction.
- Current applicability: Low. This is a research direction, not a deployment-ready method.
- Compute: Each TTT layer update requires one backward pass per token group, which on Orin at 120K+ LiDAR points is prohibitive without careful voxelization to reduce token count to 1-5K.
2.6 Online LoRA + MAE Loss (Li et al., ECCV 2024)
This method combines two practical ideas -- LoRA parameter efficiency and MAE self-supervision -- into a TTT framework explicitly designed for edge deployment.
Architecture:
┌────────────────────────────────────────────────┐
│ Pre-trained Backbone │
│ (Frozen weights, ~95% of parameters) │
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ Layer 1 │ │ Layer 2 │ │ Layer N │ │
│ │ frozen │ │ frozen │ │ frozen │ │
│ │ ┌─────┐ │ │ ┌─────┐ │ │ ┌─────┐ │ │
│ │ │LoRA │ │ │ │LoRA │ │ │ │LoRA │ │ │
│ │ │r=4-8│ │ │ │r=4-8│ │ │ │r=4-8│ │ │
│ │ │ │◄─┼───┼──┤ TTT │◄─┼───┼──┤ grad│ │ │
│ │ └─────┘ │ │ └─────┘ │ │ └─────┘ │ │
│ └──────────┘ └──────────┘ └──────────┘ │
└────────────────────────┬───────────────────────┘
│
┌──────────┼──────────┐
│ │ │
┌────▼───┐ ┌──▼───┐ ┌──▼──────┐
│ Main │ │ MAE │ │ Entropy │
│ Head │ │ Head │ │ Term │
│(detect)│ │(rec) │ │(option) │
└────────┘ └──┬───┘ └────┬────┘
│ │
L_MAE + alpha * L_entropy
│
Gradient → Update LoRA onlyKey design decisions:
- LoRA rank 4-8 for TTT: Lower than the rank 16-32 used for supervised fine-tuning (see
multi-airport-adaptation.md). TTT's unsupervised signal is noisier, so constraining the update space prevents overfitting to noise. - Joint MAE + entropy loss: The MAE term provides structural learning signal; the entropy term encourages confident predictions. The balance (
alpha = 0.1-0.3) prevents entropy collapse while allowing task-relevant adaptation. - Gradient accumulation over N frames: Instead of updating per-frame, accumulate gradients over N=5-10 frames before applying one optimizer step. This reduces noise, amortizes compute, and provides a natural "mini-batch" for stable updates.
LoRA parameter count for PointPillars:
| Component | Original Params | LoRA Params (r=8) | Ratio |
|---|---|---|---|
| PFE (Pillar Feature Encoder) | ~200K | ~12K | 6% |
| Backbone (conv layers) | ~4.5M | ~72K | 1.6% |
| Neck (FPN) | ~2.1M | ~34K | 1.6% |
| Total updated by TTT | ~6.8M | ~118K | 1.7% |
Only 118K parameters are updated at test time. This is 60x fewer than full encoder fine-tuning and 10x fewer than standard LoRA (rank 32). The tight constraint is intentional: TTT gradients are noisy, and updating fewer parameters reduces the risk of catastrophic adaptation.
Practical advantage for Orin: LoRA's decomposed weight matrices (A: d x r, B: r x d where r << d) mean the optimizer state (Adam moments) requires only 2 * 118K * 4 bytes = ~1MB of GPU memory -- negligible alongside the ~2GB model.
2.7 LAME: Laplacian Adjusted Maximum-Likelihood Estimation (Boudiaf et al., CVPR 2022)
LAME takes a minimalist approach: adapt only the output probabilities, not the model weights, using the geometric structure of the test features.
Mechanism:
- Extract features from the frozen model for all test samples in a batch
- Build a k-nearest-neighbor graph in feature space
- Apply Laplacian smoothing: adjust each sample's class probabilities to be consistent with its neighbors' probabilities
- The Laplacian regularization ensures that nearby features (likely same class) get similar predictions
Mathematical formulation:
Minimize: L = sum_i KL(z_i || f_theta(x_i)) + lambda * sum_{i,j} W_ij * ||z_i - z_j||^2
Subject to: z_i >= 0, sum_c z_i^c = 1
Where:
z_i = adapted probability vector for sample i
f_theta(x_i) = original model's prediction
W_ij = edge weight in kNN graph (Gaussian kernel on feature distance)
lambda = smoothness strengthProperties:
- No gradient through the model -- the model is completely frozen. Only output probabilities are adjusted.
- No training-time changes needed -- works with any pre-trained model post-hoc
- Extremely fast: kNN search + Laplacian solve takes <2ms for a typical batch
- Theoretically principled: maximum likelihood under a Markov random field prior
Limitations:
- Adapts classification, not regression (bounding box predictions unchanged)
- Requires a batch of test samples for meaningful graph construction (not single-sample)
- Small adaptation range: recovers only 2-5% of domain gap
Airside role: LAME is best used as a complement to stronger TTT methods. After TTT-MAE adapts the encoder, LAME provides a cheap post-processing step to further smooth object classifications using local feature geometry. It is also useful as a zero-risk baseline: since the model weights never change, LAME cannot cause catastrophic forgetting.
2.8 ActTTT / TTT-Bench (2025): Active Sample Selection for TTT
Not all test samples are equally informative for TTT. ActTTT (Active TTT, ICLR 2025) selects which samples to perform TTT updates on:
Selection criteria (combined score):
- Reconstruction difficulty: Samples with high MAE reconstruction loss carry more information about the domain gap
- Feature-space novelty: Samples far from source feature centroids (measured by Mahalanobis distance) represent the most shifted portions of the target domain
- Gradient magnitude: Samples producing large gradients are more likely to move the encoder toward useful adaptation
Protocol:
For each incoming test frame x_t:
1. Compute quick features: z_t = encoder(x_t) [~6ms]
2. Compute novelty: d_t = mahalanobis(z_t, source_stats) [<1ms]
3. If d_t > threshold:
a. Compute MAE reconstruction loss (full) [~10ms]
b. Perform TTT update [~17ms]
4. Else:
Skip TTT update (just inference) [~6ms]Result: By performing TTT updates on only 10-30% of test frames (the most informative ones), ActTTT achieves 90-95% of the adaptation gain of TTT on every frame, at 70-90% lower compute cost. This is critical for Orin deployment where every millisecond matters.
2.9 ClusTTT (2025): Cluster-Guided TTT
ClusTTT extends ActTTT by maintaining online clusters of test features and performing TTT updates per-cluster rather than per-sample:
- Maintain K=10-20 feature clusters using online k-means on test features
- When a cluster accumulates N=10 samples, perform a batched TTT update using all cluster members
- Cache cluster-specific LoRA offsets for rapid lookup
Advantage: Batched updates are more stable (less gradient noise) and more efficient (GPU utilization improves with batch size). On Orin, a batch of 10 yields ~3x throughput improvement over individual updates.
2.10 Continual-TTT (CVPR 2025): Streaming Self-Supervised Adaptation
Continual-TTT specifically addresses the temporal aspect of TTT -- how to adapt continuously over hours/days without forgetting earlier adaptations:
Key innovations:
- Sliding window replay: Maintain a buffer of the last M=100 frames. Each TTT update trains on the current frame + a random sample from the buffer. This prevents recency bias.
- Fisher-weighted LoRA: Compute per-parameter Fisher information on the buffer. Weight the LoRA gradient by inverse Fisher: parameters important for recent predictions are updated cautiously.
- Periodic consolidation: Every T=1000 frames, merge current LoRA weights into the base model (with decay) and reinitialize LoRA. This prevents LoRA from saturating.
Reported results on continuous domain shift streams:
- +14% accuracy improvement over no-adaptation baseline
- Only 2% degradation vs offline full fine-tuning (which has access to all data simultaneously)
- Stable over 100K+ frames with no performance collapse
Airside relevance: An airport environment shifts continuously -- morning fog clears, aircraft traffic patterns change with the schedule, seasonal weather evolves over weeks. Continual-TTT provides a framework for persistent adaptation that tracks these changes without manual intervention.
3. LiDAR-Specific TTT Auxiliary Tasks
3.1 Why LiDAR Needs Specialized Auxiliary Tasks
Generic TTT auxiliary tasks (rotation prediction, image-based MAE) do not transfer well to LiDAR point clouds because:
- LiDAR has physical structure: Points are generated by known physics (beam pattern, time-of-flight). Auxiliary tasks should exploit this structure.
- Point clouds are sparse and non-uniform: Unlike images (dense, regular grid), LiDAR scans have density that varies with range (1/r^2), angle, and surface properties.
- The domain shift is geometric: Airport-to-airport shift is primarily about surface properties, object geometry, and environmental conditions -- not texture or color.
- Gravity provides a strong prior: Airports are flat. This constraint can be exploited.
The following auxiliary tasks are ordered by estimated value for airside TTT deployment.
3.2 Masked Point Cloud Reconstruction (Primary Auxiliary Task)
Task: Randomly mask 60-80% of LiDAR points and reconstruct them.
Masking strategies for LiDAR (different from image MAE):
| Strategy | Description | Information Captured |
|---|---|---|
| Random point drop | Uniformly drop 70% of points | General point distribution |
| Voxel masking | Mask entire voxels (groups of nearby points) | Local geometry, surface continuity |
| Sector masking | Mask angular sectors (e.g., mask 270/360 degrees) | Large-scale spatial structure |
| Range-band masking | Mask all points in distance range (e.g., 20-40m) | Distance-dependent features |
| Beam masking | Mask entire LiDAR beams (simulates beam failure) | Cross-beam consistency |
Recommended strategy for airside TTT: Combined voxel + beam masking. Voxel masking captures local surface geometry (critical for different apron surfaces). Beam masking is physically motivated -- it simulates real degradation modes (dirty lens blocks individual beams) that the model must handle at the new airport.
Reconstruction target:
def lidar_mae_loss(predicted_points, masked_points, mode='chamfer'):
"""
Compute reconstruction loss for masked LiDAR points.
Args:
predicted_points: (M, 4) predicted x,y,z,intensity
masked_points: (M, 4) ground truth masked points
mode: 'chamfer' or 'l2'
Returns:
loss: scalar reconstruction loss
"""
if mode == 'chamfer':
# Bidirectional Chamfer distance
# Forward: each predicted point to nearest GT
dist_pred_to_gt = torch.cdist(predicted_points[:, :3],
masked_points[:, :3])
forward = dist_pred_to_gt.min(dim=1)[0].mean()
# Backward: each GT point to nearest predicted
backward = dist_pred_to_gt.min(dim=0)[0].mean()
loss_xyz = forward + backward
# Intensity reconstruction (separate, weighted lower)
# Important: intensity distribution changes between airports
# (different surface materials)
loss_intensity = F.mse_loss(
predicted_points[:, 3], masked_points[:, 3]
)
return loss_xyz + 0.1 * loss_intensity
elif mode == 'l2':
# Simple L2 on matched pairs (if masking preserves ordering)
return F.mse_loss(predicted_points, masked_points)Expected reconstruction error at new airport: When deploying a model trained at Airport A to Airport B, the MAE reconstruction loss will increase by 30-100% (depending on domain gap severity). After 100-500 TTT update steps, the reconstruction loss typically decreases to within 10-20% of the source domain level, and the main task mAP recovers proportionally.
3.3 Normal Estimation as Auxiliary Task
Task: For each point, predict its surface normal vector (the direction perpendicular to the local surface).
Why this helps for airport adaptation:
- Surface normals are entirely self-supervised (computed from local point neighborhoods, no labels needed)
- Normal distributions change between airports: smooth concrete vs rough asphalt, flat apron vs sloped taxiway
- Normal estimation forces the encoder to learn fine-grained geometric features that directly benefit object detection (the boundary between an object and the ground is defined by a normal discontinuity)
Normal computation (ground truth generation at test time):
def estimate_normals(points, k=20):
"""
Estimate surface normals using PCA on k-nearest neighbors.
Runs on GPU, <2ms for 100K points on Orin.
Args:
points: (N, 3) point cloud
k: number of neighbors for local PCA
Returns:
normals: (N, 3) estimated normal vectors
"""
# kNN search (use FAISS on GPU for speed)
_, indices = faiss_gpu.knn(points, points, k) # ~1ms
# For each point, compute covariance of neighbors
neighbors = points[indices] # (N, k, 3)
centroids = neighbors.mean(dim=1, keepdim=True) # (N, 1, 3)
centered = neighbors - centroids # (N, k, 3)
# Covariance matrix
cov = torch.bmm(centered.transpose(1, 2), centered) / k # (N, 3, 3)
# Eigenvector corresponding to smallest eigenvalue = normal
eigenvalues, eigenvectors = torch.linalg.eigh(cov) # (N, 3), (N, 3, 3)
normals = eigenvectors[:, :, 0] # Smallest eigenvalue's eigenvector
# Orient normals upward (dot product with gravity should be positive for ground)
flip = (normals[:, 2] < 0).float().unsqueeze(1)
normals = normals * (1 - 2 * flip)
return normalsAuxiliary loss:
L_normal = (1/N) * sum_i (1 - cos(n_pred_i, n_gt_i))Where n_pred_i is the model's predicted normal and n_gt_i is the PCA-estimated normal. Cosine distance is preferred over L2 because normals are unit vectors.
Compute overhead: Normal estimation is ~2ms; the additional forward/backward through the normal prediction head is ~4ms. Total: ~6ms per update step, lighter than MAE.
3.4 Contrastive Temporal Consistency
Task: Features from consecutive LiDAR scans (100ms apart at 10Hz) should be similar after ego-motion compensation.
Rationale: At 10Hz and <25 km/h airside speed, the scene changes minimally between frames. Points from the same object in frame t and frame t+1 should map to similar features. This provides a free contrastive signal without any augmentation design.
Implementation:
Frame t: P_t → Encoder → F_t (features per voxel)
Frame t+1: P_{t+1} → Ego-compensate → Encoder → F_{t+1}
Ego-compensation: Transform P_{t+1} into frame t's coordinate system
using GTSAM odometry (available from the reference airside AV stack's localization stack)
Loss: L_temporal = -mean(cosine_sim(F_t[v], F_{t+1}[v]))
for all voxels v occupied in both framesAdvantage over single-frame auxiliary tasks: Temporal consistency provides signal about dynamic objects -- if an object moves between frames, the features should still match (the model learns to represent objects, not positions). This is directly relevant to the main detection task.
Airside-specific benefit: Airport ground traffic moves slowly (1-25 km/h), so temporal correspondences are easy to establish. The slow speed means large overlap between consecutive frames (>90% point overlap at 10 km/h), providing dense contrastive pairs.
Compute cost: Requires storing the previous frame's features (~20MB) and computing ego-compensated correspondences (~1ms). The contrastive loss backward pass is ~3ms. Total: ~4ms per update step.
3.5 Ground Plane Consistency (Airport-Specific)
Task: Predict the ground plane parameters for each local region of the point cloud.
Why airports are special: Airports are flat -- by ICAO standards, apron slopes must be <1% (1m rise per 100m), and taxiway cross-slopes are <1.5%. This provides an extraordinarily strong geometric prior that is consistent across airports.
Self-supervised ground truth: The reference airside AV stack already computes ground plane estimates via RANSAC as part of the perception pipeline. These RANSAC outputs can serve as pseudo-labels for the ground plane prediction auxiliary task.
Auxiliary task:
For each BEV grid cell (x, y):
Predict: (a, b, c, d) -- local ground plane coefficients
Ground truth: RANSAC fit on points within cell
Loss: L_ground = ||plane_pred - plane_ransac||^2What this captures that is useful for adaptation:
- Different airport surfaces (concrete vs asphalt) produce different point distributions near the ground
- Wet surfaces produce specular reflections that alter the apparent ground position
- Snow accumulation shifts the effective ground plane upward
- Apron slopes differ between airports
When the ground plane prediction auxiliary task loss spikes at a new airport, it indicates that the encoder's understanding of "ground" has changed. TTT updates driven by this loss adapt the encoder to the new surface characteristics.
Compute cost: Negligible additional cost -- the RANSAC ground plane is already computed as part of the existing pipeline. The auxiliary head is a lightweight 2-layer MLP (~0.5ms forward, ~1ms backward).
3.6 Intensity Distribution Matching
Task: Predict the intensity histogram of the point cloud from the encoded features.
Motivation: LiDAR return intensity depends on surface material (reflectance), incidence angle, and range. Each airport has a distinctive intensity distribution fingerprint:
| Surface | Typical Reflectance | Intensity Statistics |
|---|---|---|
| New concrete | 0.6-0.8 | High mean, low variance |
| Worn asphalt | 0.2-0.4 | Low mean, moderate variance |
| Painted markings | 0.5-0.9 | High mean, bimodal |
| Wet surface | 0.8+ (specular) | Very high peaks, high variance |
| Metal (aircraft/GSE) | 0.1-0.3 (diffuse) | Low mean, spiky |
Auxiliary loss: KL divergence between the predicted intensity distribution (a histogram over 64 bins) and the actual intensity distribution of the current scan:
L_intensity = KL(hist_pred || hist_actual)This forces the encoder to calibrate its feature representations to the local intensity distribution -- which changes between airports and seasons.
3.7 Combined Multi-Task Auxiliary Loss
In practice, the best results come from combining multiple auxiliary tasks:
L_TTT = w_mae * L_mae + w_normal * L_normal + w_temporal * L_temporal
+ w_ground * L_ground + w_intensity * L_intensity
Recommended weights:
w_mae = 1.0 (primary -- strongest, most general signal)
w_normal = 0.3 (strong geometric signal)
w_temporal = 0.2 (free temporal consistency)
w_ground = 0.2 (airport-specific prior)
w_intensity = 0.1 (lightweight calibration)Total compute for combined auxiliary loss on Orin (per TTT update step):
| Task | Forward | Backward | Total |
|---|---|---|---|
| MAE reconstruction | 4ms (encoder) + 3ms (decoder) | 8ms | 15ms |
| Normal estimation | 2ms (PCA) + 0.5ms (head) | 1ms | 3.5ms |
| Temporal consistency | 1ms (correspondence) | 3ms | 4ms |
| Ground plane | ~0ms (reuse RANSAC) + 0.5ms (head) | 1ms | 1.5ms |
| Intensity matching | 0.5ms (histogram) + 0.5ms (head) | 1ms | 2ms |
| Combined (shared backward) | ~20ms |
Note: the backward passes share the encoder's computation graph, so the combined cost is less than the sum. A single combined backward through the encoder costs ~10ms regardless of how many auxiliary heads contribute gradients.
Amortized over 10 frames: ~2ms per inference cycle.
4. Safety-Bounded TTT on Orin AGX
4.1 Compute Budget Analysis
The reference airside AV stack perception pipeline has a 50ms cycle time (10Hz LiDAR). The current budget allocation:
50ms Total Budget
├── LiDAR preprocessing (ego-compensation, SOR) ~3ms
├── Multi-LiDAR fusion ~2ms
├── PointPillars detection (TensorRT INT8) ~6.84ms
├── Post-processing (NMS, tracking) ~2ms
├── GTSAM localization update ~3ms
├── Frenet planning + trajectory generation ~5ms
├── Safety monitoring (STL, CBF, geofence) ~5ms
├── Communication (ROS pub/sub, V2X) ~2ms
├── MARGIN ~21ms
└── Available for TTT ~15-20msThe ~21ms margin is generous. TTT must fit within ~15ms of this margin (keeping 5ms as safety buffer for jitter).
4.2 TTT Update Scheduling
TTT does not need to run every frame. The key design parameter is the update frequency: how often to perform a gradient update.
| Update Frequency | Amortized Cost | Adaptation Speed | Stability |
|---|---|---|---|
| Every frame (10Hz) | 20ms/frame | Fast (seconds) | Low (noisy) |
| Every 10 frames (1Hz) | 2ms/frame | Moderate (10s) | Good |
| Every 100 frames (0.1Hz) | 0.2ms/frame | Slow (minutes) | Very good |
| Triggered by OOD | 0-20ms/frame | Adaptive | Best |
Recommended: OOD-triggered TTT at ~1Hz baseline.
┌────────────────────────────────────────────────────────────────┐
│ TTT Scheduling Logic │
├────────────────────────────────────────────────────────────────┤
│ │
│ For each LiDAR frame: │
│ 1. Run inference (PointPillars, ~7ms) │
│ 2. Compute OOD score (~1ms) │
│ 3. Accumulate frame into TTT buffer │
│ │
│ IF ood_score > 0.3 AND buffer_size >= 5: │
│ Schedule TTT update on next available GPU slot │
│ (async, does not block inference) │
│ │
│ IF buffer_size >= 10 (regardless of OOD): │
│ Schedule TTT update (baseline adaptation) │
│ │
│ IF last_update > 60 seconds: │
│ Force TTT update (prevent stale adaptation) │
│ │
│ TTT Update (when scheduled): │
│ 1. Sample 5-10 frames from buffer │
│ 2. Compute combined auxiliary loss │
│ 3. One gradient step on LoRA parameters │
│ 4. Validate: if anchor loss exceeded, revert │
│ 5. Clear buffer │
│ │
└────────────────────────────────────────────────────────────────┘4.3 Asynchronous TTT on Orin
NVIDIA Orin AGX has 2048 CUDA cores and 64 Tensor Cores, plus 2 DLA (Deep Learning Accelerators). The key to fitting TTT within the budget is asynchronous execution:
CUDA Stream 0 (Inference): |--Infer--|--Infer--|--Infer--|--Infer--|
CUDA Stream 1 (TTT): | |------TTT Update------| |
DLA 0 (Segmentation): |---Seg---|---Seg---|---Seg---|---Seg---|
DLA 1 (Safety monitor): |--Safe---|--Safe---|--Safe---|--Safe---|
CPU (ROS, planning): |--Plan--|--Plan--|--Plan--|--Plan--|
Timeline (ms): 0 10 20 30 40 50Key insight: The TTT backward pass runs on CUDA Stream 1 while inference runs on CUDA Stream 0. On Orin, the GPU can process two concurrent streams with ~20-30% throughput reduction per stream (depending on memory bandwidth contention). This means:
- Inference on Stream 0: ~7ms (nominal) + ~2ms (contention) = ~9ms (still within budget)
- TTT on Stream 1: ~20ms (spans 2 inference cycles, but does not block either)
4.4 LoRA Rank Selection for TTT
The LoRA rank directly controls the TTT adaptation capacity vs stability tradeoff:
| LoRA Rank | TTT Params | Adaptation Capacity | Stability | Memory (Optimizer) | Recommended For |
|---|---|---|---|---|---|
| r = 2 | ~30K | Very low | Very high | ~240KB | Minor shifts (same airport, weather change) |
| r = 4 | ~59K | Low | High | ~470KB | Moderate shifts (same-cluster airport) |
| r = 8 | ~118K | Medium | Good | ~950KB | Standard airport onboarding |
| r = 16 | ~236K | High | Moderate | ~1.9MB | Large shifts (cross-cluster airport) |
| r = 32 | ~472K | Very high | Low | ~3.8MB | Not recommended for unsupervised TTT |
Recommendation: r = 4-8 for TTT, r = 16-32 for supervised LoRA.
The reasoning: supervised fine-tuning has labeled data to correct mistakes, so higher capacity (higher rank) is beneficial. TTT has only self-supervised signal, which is noisier. Higher rank gives more room for the noisy gradients to push parameters in wrong directions. Lower rank acts as an implicit regularizer, constraining updates to a low-dimensional subspace.
4.5 Gradient Accumulation Strategy
Instead of applying gradients from a single frame, accumulate over N frames:
class GradientAccumulatingTTT:
"""
Accumulate TTT gradients over N frames before applying update.
Reduces noise, amortizes compute, improves stability.
"""
def __init__(self, model, lora_params, aux_heads,
accumulation_steps=10, lr=1e-4, max_grad_norm=1.0):
self.model = model
self.lora_params = lora_params
self.aux_heads = aux_heads
self.accumulation_steps = accumulation_steps
self.optimizer = torch.optim.AdamW(lora_params, lr=lr,
weight_decay=1e-4)
self.max_grad_norm = max_grad_norm
self.step_count = 0
self.accumulated_loss = 0.0
def step(self, point_cloud, prev_features=None):
"""
Accumulate gradient from one frame.
Apply update when accumulation_steps reached.
Returns:
applied_update: bool -- True if optimizer stepped
"""
# Compute combined auxiliary loss
loss = self._compute_aux_loss(point_cloud, prev_features)
# Scale loss by accumulation steps (for correct gradient magnitude)
scaled_loss = loss / self.accumulation_steps
scaled_loss.backward()
self.accumulated_loss += loss.item()
self.step_count += 1
if self.step_count >= self.accumulation_steps:
# Clip gradients for stability
torch.nn.utils.clip_grad_norm_(
self.lora_params, self.max_grad_norm
)
# Apply update
self.optimizer.step()
self.optimizer.zero_grad()
# Log adaptation progress
avg_loss = self.accumulated_loss / self.accumulation_steps
rospy.loginfo(f"TTT update: avg_aux_loss={avg_loss:.4f}")
# Reset
self.step_count = 0
self.accumulated_loss = 0.0
return True
return False
def _compute_aux_loss(self, point_cloud, prev_features):
"""Combined multi-task auxiliary loss."""
# MAE: mask 70% of points, reconstruct
visible, masked_gt = random_mask(point_cloud, ratio=0.7)
features = self.model.encode(visible)
reconstructed = self.aux_heads['mae_decoder'](features)
l_mae = chamfer_distance(reconstructed, masked_gt)
# Normal estimation
normals_gt = estimate_normals_pca(point_cloud, k=20)
normals_pred = self.aux_heads['normal_head'](
self.model.encode(point_cloud)
)
l_normal = (1 - F.cosine_similarity(
normals_pred, normals_gt, dim=-1
)).mean()
# Temporal consistency (if previous features available)
l_temporal = torch.tensor(0.0, device=point_cloud.device)
if prev_features is not None:
current_features = self.model.encode(point_cloud)
l_temporal = -F.cosine_similarity(
current_features, prev_features.detach(), dim=-1
).mean()
return 1.0 * l_mae + 0.3 * l_normal + 0.2 * l_temporal4.6 Memory Budget
| Component | GPU Memory | Notes |
|---|---|---|
| PointPillars model (TensorRT) | ~800MB | Inference engine |
| LoRA parameters (r=8) | ~470KB | TTT update target |
| LoRA optimizer state (Adam) | ~940KB | First and second moments |
| Activation cache (for backward) | ~400MB | Stored during forward, freed after backward |
| MAE decoder | ~50MB | Lightweight reconstruction network |
| Auxiliary heads (normal, temporal, etc.) | ~20MB | Small MLPs |
| Frame buffer (10 frames) | ~200MB | For gradient accumulation |
| Total TTT overhead | ~670MB | On top of ~800MB inference |
| Total with TTT | ~1.5GB | Well within Orin's 32GB/64GB |
4.7 Worst-Case Latency Analysis
| Scenario | Latency Impact | Mitigation |
|---|---|---|
| TTT update coincides with inference | +2-3ms inference latency (GPU contention) | Async CUDA streams, priority scheduling |
| Backward pass takes longer than expected | Could steal GPU time from next inference | Hard timeout: abort TTT update after 25ms |
| Memory allocation spike during backward | OOM risk | Pre-allocate all TTT buffers at startup |
| NaN/Inf in gradients | Corrupted update | Gradient health check before optimizer step |
| Optimizer state grows unbounded | Slow memory leak | Fixed-size optimizer with periodic reset |
Hard safety guarantee: The inference pipeline on Stream 0 has higher CUDA priority than TTT on Stream 1. If any TTT computation threatens to delay inference beyond 45ms (leaving 5ms margin), the Orin CUDA scheduler preempts TTT. No TTT computation can delay the safety-critical inference path.
5. Catastrophic Forgetting Prevention
5.1 The Forgetting Problem in TTT Context
TTT differs from standard continual learning in that the adaptation is happening at test time on unlabeled data. This creates a unique forgetting scenario:
Pre-deployment model (trained at Airports A, B, C):
Airport A: 78% mAP
Airport B: 75% mAP
Airport C: 72% mAP
Airport D (new): 55% mAP ← domain gap
After 1000 TTT steps at Airport D (unconstrained):
Airport A: 61% mAP ← CATASTROPHIC FORGETTING
Airport B: 58% mAP ←
Airport C: 55% mAP ←
Airport D: 68% mAP ← improved, but at terrible cost
After 1000 TTT steps at Airport D (with forgetting prevention):
Airport A: 76% mAP ← minimal forgetting (2% drop)
Airport B: 73% mAP ←
Airport C: 71% mAP ←
Airport D: 65% mAP ← good improvement, smaller but safeThe unconstrained case is unacceptable: a fleet operator who visits Airport D temporarily and returns to Airport A finds a degraded model. The forgetting prevention mechanisms below ensure that TTT gains at the new airport do not come at the cost of previous airports.
5.2 Anchor Loss (Primary Defense)
The simplest and most effective anti-forgetting mechanism for TTT. Penalize the TTT-adapted parameters from deviating too far from the pre-deployment (anchor) weights:
L_anchor = (lambda_a / 2) * sum_i (theta_i - theta_anchor_i)^2
where:
theta_i = current LoRA parameter i
theta_anchor_i = pre-deployment LoRA parameter i (frozen reference)
lambda_a = anchor strength (typically 100-1000)Properties:
- O(1) compute per parameter (trivial gradient:
lambda_a * (theta - theta_anchor)) - O(p) storage (must store anchor weights -- but for LoRA r=8, this is ~470KB)
- Acts as a "leash" on TTT: the model can adapt but cannot wander far from its starting point
- The maximum deviation of any parameter is bounded:
|theta_i - theta_anchor_i| <= sqrt(L_aux / lambda_a)
Combined TTT + anchor loss:
L_total_TTT = L_mae + 0.3 * L_normal + 0.2 * L_temporal + lambda_a * L_anchorPractical lambda_a selection:
- Too low (lambda_a < 10): TTT wanders freely, forgetting possible
- Too high (lambda_a > 10000): TTT cannot adapt, effectively frozen
- Sweet spot (lambda_a = 100-500): allows 3-5% parameter deviation, sufficient for airport adaptation
5.3 Elastic Weight Consolidation (EWC)
EWC (Kirkpatrick et al., 2017) is a more sophisticated version of anchor loss that penalizes changes to important parameters more than unimportant ones:
L_EWC = (lambda / 2) * sum_i F_i * (theta_i - theta_anchor_i)^2
where F_i = diagonal Fisher Information for parameter iThe Fisher Information Matrix measures how sensitive the model's output is to each parameter. Parameters with high Fisher (output changes a lot when they change) are "important" and should be penalized heavily. Parameters with low Fisher can change freely without affecting source performance.
Compute requirement: The Fisher must be computed on source data before deployment. For PointPillars with LoRA r=8 (118K params), this requires ~100 forward+backward passes on a representative source dataset. Cost: ~2 minutes on A100 (at HQ before shipping model to site).
EWC vs anchor loss on Orin:
| Property | Anchor Loss | EWC |
|---|---|---|
| Compute per TTT step | Identical (both O(p)) | Identical |
| Storage | p weights (470KB for LoRA r=8) | p weights + p Fisher values (940KB) |
| Forgetting prevention | Uniform penalty (all params equal) | Adaptive (important params penalized more) |
| Typical mAP preservation | -3-5% on source airports | -1-3% on source airports |
| Adaptation capacity | Slightly lower (unnecessarily constrains unimportant params) | Higher (unimportant params free to adapt) |
Recommendation: Use EWC if the Fisher matrix can be pre-computed (it can -- during source training). The additional 470KB storage is negligible and EWC provides measurably better forgetting prevention.
5.4 PackNet for TTT (Parameter Isolation)
PackNet allocates separate parameter subsets to each deployment:
Pre-deployment:
Prune LoRA adapters to identify important parameters for source airports
Mask: M_source = {i : |theta_i| > threshold} (top 25%)
Freeze M_source parameters
At new airport (TTT):
Update only parameters NOT in M_source
These parameters are "free" -- no risk of forgetting
Capacity: 75% of LoRA parameters available for adaptationAdvantage: Zero forgetting by construction -- source parameters literally cannot change. Disadvantage: Limited capacity. After adapting to 4 airports (each using 75% of remaining free parameters), the model has only 0.75^4 ≈ 32% of its LoRA capacity remaining.
For the reference airside AV stack's scale (5-15 airports in 3-year horizon): PackNet is viable for the first 3-4 airports but will run out of capacity. Transition to EWC-based TTT after that. Alternatively, use LoRA rank 16 (instead of 8) to double capacity.
5.5 Source-Domain Validation Set Monitoring
Store a small validation set from the source domain (200 labeled frames, ~400MB) and periodically evaluate the TTT-adapted model on it:
Every 500 TTT update steps:
1. Evaluate model on source validation set (200 frames)
2. Compute mAP_source = mAP on validation set
3. Compare with pre-deployment mAP_baseline
IF mAP_source < mAP_baseline - 5%:
Trigger forgetting alert
Revert last 100 TTT updates
Increase lambda_a by 2x
Re-enable TTT with tighter constraint
IF mAP_source < mAP_baseline - 10%:
Full revert to pre-deployment weights
Disable TTT
Alert human operatorPrivacy concern: This requires storing labeled data from source airports on the vehicle. If airport data sovereignty prevents this, use the alternative: monitor reconstruction loss on a held-out set of unlabeled source scans (which carry no sensitive labels).
5.6 When to Stop TTT and Fallback
TTT should be automatically disabled under the following conditions:
| Trigger | Threshold | Action |
|---|---|---|
| Auxiliary loss diverges (increases 3x) | L_aux > 3 * L_aux_initial | Revert to pre-deployment weights, disable TTT |
| Source validation mAP drops >10% | mAP_source < mAP_baseline - 10% | Full revert, disable TTT |
| Gradient norms explode | ` | |
| NaN/Inf detected in any loss or gradient | Any NaN/Inf | Revert last update, flag for review |
| Parameter drift exceeds bound | ` | |
| TTT updates produce worse detections than frozen model | Detections_TTT << Detections_frozen on same input | Switch to frozen model, disable TTT, alert |
6. Simplex Integration Architecture
6.1 TTT within Simplex AC/BC Framework
The Simplex architecture (described in 90-synthesis/decisions/design-spec.md) provides the natural safety wrapper for TTT. The TTT-adapted model operates as the Advanced Controller (AC), while a frozen pre-deployment model serves as the Baseline Controller (BC):
┌─────────────────────────────────────────────────────────────────┐
│ SIMPLEX ARCHITECTURE │
│ │
│ ┌───────────────────────────────────────────┐ │
│ │ ADVANCED CONTROLLER (AC) │ │
│ │ │ │
│ │ ┌─────────┐ ┌──────────┐ │ │
│ │ │LoRA TTT │────▶│ TTT-adapted│ │ │
│ │ │ Adapter │ │ PointPillars│ │ │
│ │ └─────────┘ └─────┬──────┘ │ │
│ │ │ │ │
│ │ TTT Aux Heads: │ │ │
│ │ MAE, Normal, │ Detections │ │
│ │ Temporal, Ground │ + Uncertainty │ │
│ │ (gradient → LoRA) │ │ │
│ └────────────────────────┼─────────────────────┘ │
│ │ │
│ ┌────────────────────────┼─────────────────────┐ │
│ │ BASELINE CONTROLLER (BC) │ │
│ │ │ │ │
│ │ ┌──────────────┐ │ │ │
│ │ │ Frozen │ │ │ │
│ │ │ PointPillars │─────┼──── Detections │ │
│ │ │ (no TTT) │ │ (baseline) │ │
│ │ └──────────────┘ │ │ │
│ │ │ │ │
│ │ Note: runs every Nth │ │ │
│ │ frame (e.g., N=10) │ │ │
│ │ to save compute │ │ │
│ └────────────────────────┼─────────────────────┘ │
│ │ │
│ ┌────────────────────────▼─────────────────────┐ │
│ │ DECISION MODULE (DM) │ │
│ │ │ │
│ │ Inputs: │ │
│ │ - AC detections + AC uncertainty │ │
│ │ - BC detections (every Nth frame) │ │
│ │ - OOD score │ │
│ │ - TTT aux loss trend │ │
│ │ - AC vs BC detection agreement │ │
│ │ │ │
│ │ Decision logic: │ │
│ │ IF agreement(AC, BC) > 0.8: │ │
│ │ Use AC (TTT-adapted, likely better) │ │
│ │ ELIF AC_uncertainty < BC_uncertainty: │ │
│ │ Use AC (TTT is helping) │ │
│ │ ELIF ood_score > 0.7: │ │
│ │ Use BC (severe OOD, don't trust TTT) │ │
│ │ ELSE: │ │
│ │ Use BC (safe default) │ │
│ └──────────────────────┬──────────────────────┘ │
│ │ │
│ ▼ │
│ Final Detections → Planning │
└─────────────────────────────────────────────────────────────────┘6.2 Running BC Efficiently
Running a full second inference pipeline for BC doubles compute cost. Efficiency strategies:
- Temporal subsampling: Run BC every 10th frame (1Hz instead of 10Hz). Sufficient for detecting systematic TTT failures.
- DLA execution: Run BC on Orin's DLA while AC runs on GPU. DLA throughput is lower (~2x slower) but does not contend with GPU resources.
- Lightweight BC: Use a smaller model for BC (e.g., PointPillars-Lite, ~3ms on Orin) that provides coarse but reliable detections for comparison.
- Shared preprocessing: Both AC and BC use the same preprocessed point cloud. Only the model forward pass differs.
Recommended: Run BC on DLA at 1Hz (every 10th frame). Compute cost: ~14ms on DLA, non-blocking to GPU. Memory: ~500MB on DLA dedicated memory.
6.3 Decision Module Logic
class SimplexTTTDecisionModule:
"""
Decides whether to use TTT-adapted (AC) or frozen (BC) detections.
Safety-conservative: defaults to BC unless AC is demonstrably better.
"""
def __init__(self):
self.agreement_threshold = 0.8 # IoU agreement
self.ood_threshold_severe = 0.7
self.uncertainty_ratio_threshold = 0.9
self.consecutive_bc_switches = 0
self.max_consecutive_bc = 100 # If BC used 100x in a row,
# TTT is failing
def decide(self, ac_detections, bc_detections,
ac_uncertainty, bc_uncertainty, ood_score,
ttt_loss_trend):
"""
Args:
ac_detections: TTT-adapted model detections
bc_detections: Frozen baseline detections (may be stale)
ac_uncertainty: Mean epistemic uncertainty of AC
bc_uncertainty: Mean epistemic uncertainty of BC
ood_score: Current OOD score
ttt_loss_trend: Slope of aux loss over last 100 updates
(negative = improving, positive = diverging)
Returns:
selected: 'AC' or 'BC'
reason: str
"""
# Rule 1: Severe OOD — never trust TTT
if ood_score > self.ood_threshold_severe:
self._increment_bc()
return 'BC', 'severe_ood'
# Rule 2: TTT loss diverging — adaptation is failing
if ttt_loss_trend > 0 and abs(ttt_loss_trend) > 0.1:
self._increment_bc()
return 'BC', 'ttt_diverging'
# Rule 3: Check AC/BC agreement (when BC detections available)
if bc_detections is not None:
agreement = self._compute_agreement(ac_detections,
bc_detections)
if agreement > self.agreement_threshold:
# AC and BC agree — AC is likely correct and
# potentially better
self._reset_bc()
return 'AC', 'agreement_high'
# Rule 4: Compare uncertainty
if ac_uncertainty < bc_uncertainty * self.uncertainty_ratio_threshold:
# AC is more confident — TTT is helping
self._reset_bc()
return 'AC', 'lower_uncertainty'
# Rule 5: Default to BC (safety-conservative)
self._increment_bc()
# Rule 6: If BC used too many times, disable TTT entirely
if self.consecutive_bc_switches > self.max_consecutive_bc:
return 'BC', 'ttt_disabled_sustained_failure'
return 'BC', 'default_conservative'
def _compute_agreement(self, det_a, det_b):
"""IoU-based agreement between two detection sets."""
if len(det_a) == 0 and len(det_b) == 0:
return 1.0 # Both see nothing — agree
if len(det_a) == 0 or len(det_b) == 0:
return 0.0 # One sees objects, other doesn't — disagree
# Match detections by 3D IoU
matched = 0
for da in det_a:
best_iou = max(iou_3d(da, db) for db in det_b)
if best_iou > 0.3:
matched += 1
return matched / max(len(det_a), len(det_b))
def _increment_bc(self):
self.consecutive_bc_switches += 1
def _reset_bc(self):
self.consecutive_bc_switches = 06.4 Failure Mode Analysis
| Failure Mode | Detection Signal | Simplex Response |
|---|---|---|
| TTT adapts to noise (learns wrong distribution) | Aux loss decreases but detection quality drops | BC detections disagree with AC; DM switches to BC |
| TTT causes forgetting of critical class (e.g., personnel) | Source validation mAP drops for personnel class | Emergency: disable TTT, revert, alert |
| TTT encoder produces degenerate features | All detections have very low confidence | AC uncertainty spikes; DM defaults to BC |
| TTT LoRA weights diverge (NaN) | NaN in predictions | Hardware watchdog catches NaN, switches to BC |
| Auxiliary task gradient conflicts with main task | Aux loss decreases but main task degrades | Monitor both: if aux decreases but agreement with BC drops, pause TTT |
| Adversarial input triggers harmful TTT update | Unusual gradient direction | Gradient norm clipping + anchor loss bound deviation |
7. Airport Onboarding Protocol with TTT
7.1 Overview
TTT compresses the unlabeled adaptation phase of airport onboarding. Combined with the existing 8-week playbook (see 70-operations-domains/deployment-playbooks/multi-airport-adaptation.md), TTT provides actionable perception improvements within the first 72 hours.
Standard Onboarding (without TTT):
Week 1-2: Map + data collection
Week 3: Active learning + labeling (500-1000 frames) ← BLOCKS ON LABELS
Week 4: LoRA fine-tuning + validation
Week 5-6: Shadow mode
Week 7-8: Supervised operations → Go/no-go
TTT-Augmented Onboarding:
Day 1-2: Map + data collection + TTT starts immediately
Day 3-5: TTT adaptation converges, model usable for shadow mode
Day 6-10: Shadow mode with TTT-adapted model
Day 11-14: Active learning selects 200-500 frames for labeling
Week 3: LoRA fine-tuning on labeled data (starting from TTT-adapted weights)
Week 4: Validation + go/no-go
Savings: 2-3 weeks compressed. Active learning selection is better
because TTT-adapted model has lower OOD rate → more informative selection.7.2 Day 1-3: Initial TTT Adaptation
Vehicle deployment:
- Deploy vehicle with pre-trained model + TTT-enabled LoRA + auxiliary heads
- Vehicle operates in shadow mode (sensors recording, no autonomous control)
- TTT runs continuously during all driving
TTT monitoring dashboard (remote):
┌─────────────────────────────────────────────────────────────────┐
│ TTT Adaptation Monitor — Airport D Onboarding │
├─────────────────────────────────────────────────────────────────┤
│ │
│ Adaptation Progress: │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ MAE Loss: ████████████░░░░░░░░░░ 58% → target │ │
│ │ Normal Loss: ██████████████░░░░░░░ 72% → target │ │
│ │ Temporal: █████████████████░░░░░ 85% → target │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ OOD Rate: 18% → 9% (decreasing — adaptation working) │
│ TTT Updates: 847 / target 1000 │
│ Source mAP (validation): 74.2% (baseline: 76.1%, delta: -1.9%) │
│ Anchor Loss: 0.023 (limit: 0.05) │
│ │
│ Aux Loss Curves (last 24h): │
│ 2.0 ┤╭╮ │
│ 1.5 ┤│╰╮ │
│ 1.0 ┤│ ╰──╮ │
│ 0.5 ┤│ ╰────────────── │
│ 0.0 ┼───────────────────────────── │
│ 0h 6h 12h 18h 24h │
│ │
│ Status: ON TRACK — 58% of adaptation converged in 24h │
│ Estimated full convergence: ~48h │
│ │
│ Alerts: None │
│ Forgetting: Within bounds (all source airports within 2% of │
│ baseline) │
└─────────────────────────────────────────────────────────────────┘Convergence criteria for TTT:
- MAE reconstruction loss has plateaued (less than 2% improvement over last 200 steps)
- OOD rate has decreased by at least 40% from initial measurement
- Source validation mAP has not dropped more than 5%
- Anchor loss is below threshold (parameter drift bounded)
Typically met within 500-1000 TTT updates (24-72 hours of driving at ~1 update/second during OOD encounters, or ~1 update/10 seconds during low-OOD periods).
7.3 Day 4-7: TTT Model Evaluation
Once TTT converges, evaluate the adapted model to decide next steps:
Evaluation protocol (no labels required):
| Metric | How Measured | Pass Threshold |
|---|---|---|
| OOD rate | Fraction of frames with OOD score > 0.3 | < 15% (vs initial 20-40%) |
| Reconstruction quality | MAE loss on 1000 test frames | Within 20% of source domain MAE |
| Detection consistency | Frame-to-frame IoU of tracks | > 0.7 (stable detections, not flickering) |
| BC agreement | IoU between AC and BC detections | > 0.6 (AC not diverged from reasonable baseline) |
| Source validation mAP | mAP on held-out source frames | Within 5% of baseline |
| Auxiliary loss variance | Std of aux loss over last 500 frames | < 0.2 (converged, not oscillating) |
Evaluation protocol (with small labeled sample, 50-100 frames):
If the on-site engineer can label 50-100 frames (2-4 hours of annotation), a direct mAP comparison becomes possible:
| Model Configuration | Expected mAP at New Airport |
|---|---|
| Source model (no adaptation) | 55-65% |
| TTA only (TENT/SAR) | 60-70% |
| TTT-adapted (this method) | 65-75% |
| TTT + small LoRA fine-tuning (50 labels) | 68-78% |
| Full LoRA fine-tuning (500+ labels) | 72-82% |
7.4 Day 7-14: TTT Lock and Transition to Supervised Adaptation
If TTT evaluation passes, the TTT-adapted LoRA weights are frozen ("locked") and become the new baseline for this airport:
Pre-deployment model: base_weights + LoRA_source
After TTT adaptation: base_weights + LoRA_source + LoRA_TTT_delta
Lock TTT: base_weights + LoRA_airport_D (merged)
The merged LoRA_airport_D becomes the frozen BC for Airport D.
A new LoRA (rank 4) can be added on top for ongoing TTT adaptation.Transition to supervised LoRA fine-tuning:
The TTT-adapted model serves as a superior starting point for supervised fine-tuning compared to the raw source model:
Fine-tuning from source model (no TTT):
500 labeled frames → 72% mAP at Airport D (40% gap recovered)
1000 labeled frames → 78% mAP (58% gap recovered)
Fine-tuning from TTT-adapted model:
500 labeled frames → 78% mAP at Airport D (58% gap recovered) ← +6%
1000 labeled frames → 82% mAP (68% gap recovered) ← +4%TTT provides a "warm start" that makes supervised fine-tuning more label-efficient. The TTT-adapted features are already partially aligned with the target domain, so fewer labeled examples are needed to complete the alignment.
7.5 Ongoing: Continuous TTT for Seasonal Drift
After the initial onboarding, TTT continues to run at reduced frequency to handle gradual seasonal changes:
| Season Transition | Domain Shift | TTT Response |
|---|---|---|
| Dry → wet season | Surface reflectance changes, puddle specular reflection | Intensity + ground plane aux tasks activate |
| Summer → winter | Snow on apron, de-icing GSE appears, short daylight | MAE + normal tasks re-converge over 2-3 days |
| Equipment fleet change | New GSE types at airport | TTT cannot learn new classes; triggers active learning |
| Construction | Altered geometry, temporary obstacles | Ground plane + MAE detect shift, adapt over 1-2 days |
Continuous TTT parameters (post-onboarding):
- Update frequency: 0.1Hz (every 100 frames, ~10 seconds)
- LoRA rank: 4 (reduced, since shift is gradual)
- Anchor: locked TTT weights from onboarding (not source weights)
- Learning rate: 1e-5 (10x lower than initial TTT)
8. Comparison with Alternative Adaptation Methods
8.1 Head-to-Head Comparison
| Method | Labels Needed | Compute Cost | mAP Recovery | Forgetting Risk | Time to Deploy | Orin Compatible |
|---|---|---|---|---|---|---|
| No adaptation | 0 | 0 | 0% | None | Immediate | Yes |
| BN-Adapt | 0 | ~0 | 5-10% | Very low | Minutes | Yes |
| TENT/SAR (TTA) | 0 | +5-15% | 10-20% | Low | Hours | Yes |
| LAME | 0 | +2% | 5-10% | None | Minutes | Yes |
| TTT-MAE + LoRA | 0 | +5-20% (amort.) | 25-40% | Medium (bounded) | 2-5 days | Yes |
| Online LoRA+MAE | 0 | +5-15% (amort.) | 30-50% | Medium (bounded) | 2-5 days | Yes |
| Domain randomization | 0 (at test time) | +50% train | 10-30% | None | N/A (training) | Yes |
| PointLoRA fine-tuning | 500-1000 | 4h on A100 | 40-60% | Low | 2-3 weeks | Yes (inference) |
| Full fine-tuning | 2000-5000 | 8-16h on A100 | 60-80% | High | 4-6 weeks | Yes (inference) |
8.2 TTT vs TTA (TENT/SAR/CoTTA)
Covered in detail in test-time-adaptation-airside.md. Summary of the key distinction:
| Aspect | TTA | TTT |
|---|---|---|
| Signal quality | Weak (self-referential entropy) | Strong (external self-supervised objective) |
| Update scope | BN affine only (<0.1% params) | Encoder via LoRA (1-2% params) |
| Adaptation ceiling | Low (BN can only shift/scale features) | High (encoder learns new feature representations) |
| Compute | Negligible | Moderate (backprop through encoder) |
| When to prefer TTA | Small shift, tight compute budget, no aux heads available | Moderate-large shift, Orin with margin, aux heads trained |
Combination strategy: Use TTA (SAR) as always-on baseline adaptation + TTT as triggered deeper adaptation when OOD score exceeds threshold. TTA provides instant lightweight adjustment; TTT provides gradual deeper learning.
8.3 TTT vs PointLoRA Fine-Tuning
| Aspect | TTT | PointLoRA Fine-Tuning |
|---|---|---|
| Labels required | Zero | 500-1000 labeled frames |
| Time to start adapting | Immediately on deployment | 2-3 weeks (collect + label + train) |
| Adaptation ceiling | 25-40% gap recovery | 40-60% gap recovery |
| Can learn new classes | No | Yes |
| Where it runs | On-vehicle (Orin) | At HQ (A100/H100) |
| Forgetting prevention | Required (EWC, anchor) | Built into LoRA (base frozen) |
| Quality guarantee | Self-supervised only | Supervised validation on labeled data |
The TTT → PointLoRA pipeline: TTT is not a replacement for PointLoRA fine-tuning. It is a precursor that provides immediate adaptation while labeled data is being collected. The optimal pipeline is:
Day 0-3: TTT adapts unsupervised (25-40% gap recovery)
Day 3-14: Active learning selects frames for labeling
Day 14-21: PointLoRA fine-tuning on labeled data, starting from
TTT-adapted weights (40-60% gap recovery)
Day 21-28: Validation and deploymentTTT buys time and improves the starting point. PointLoRA fine-tuning provides the final quality level.
8.4 TTT vs Domain Randomization
Domain randomization augments the training data with random variations (different surfaces, lighting, weather, noise) to make the model robust to any deployment domain.
| Aspect | TTT | Domain Randomization |
|---|---|---|
| When applied | Test time | Training time |
| Compute cost | Ongoing (inference time) | One-time (training) |
| Coverage | Adapts to specific target domain | Covers broad but potentially misses specific domains |
| Can handle unknown shifts | Yes (adapts to whatever it encounters) | No (only robust to augmentations seen during training) |
| Quality on specific domain | Higher (specialized) | Lower (generalized) |
Combination: Domain randomization during training + TTT at deployment is strictly better than either alone. DR provides a robust starting point; TTT specializes it to the specific target environment.
8.5 TTT vs Test-Time Augmentation (TTA without gradients)
Test-time augmentation (different from TTA the method) runs multiple augmented versions of each input through the model and averages the predictions. No gradient updates occur.
| Aspect | TTT | Test-Time Augmentation |
|---|---|---|
| Model changes | Yes (LoRA weights updated) | No (model frozen) |
| Compute per inference | ~2ms amortized | Kx inference cost (K augmentations) |
| Adaptation over time | Improves with more data | No improvement (same quality at frame 1 and frame 10000) |
| Risk | Forgetting, divergence | None (model unchanged) |
| Best for | Systematic domain shift (new airport) | Random noise (weather, sensor jitter) |
Use both: Test-time augmentation for frame-level noise robustness + TTT for systematic domain adaptation. They address orthogonal problems.
9. Experimental Evidence and Expected Gains
9.1 Published Results on Domain Shift Benchmarks
3D Object Detection Cross-Domain
| Source → Target | Method | Source mAP | No Adapt | TTA (SAR) | TTT-MAE | Full Fine-tune |
|---|---|---|---|---|---|---|
| nuScenes → KITTI | 3D detection | 52.1 | 31.4 | 35.2 | 41.8 | 48.3 |
| Waymo → nuScenes | 3D detection | 68.3 | 42.7 | 47.1 | 54.3 | 61.5 |
| KITTI → SUN-RGBD | 3D detection | 44.6 | 22.3 | 26.8 | 33.1 | 39.7 |
| Waymo-clear → Waymo-rain | Corruption | 65.2 | 51.8 | 56.3 | 59.1 | 62.4 |
| Waymo → Waymo (10-beam) | Sensor config | 68.3 | 38.6 | 43.2 | 52.7 | 60.1 |
3D Semantic Segmentation Cross-Domain
| Source → Target | Method | Source mIoU | No Adapt | TTA | TTT | Fine-tune |
|---|---|---|---|---|---|---|
| nuScenes → SemanticKITTI | Segmentation | 76.2 | 48.3 | 53.1 | 61.7 | 69.4 |
| Synth → Real (SynLiDAR→SemanticKITTI) | Sim-to-real | 71.8 | 35.2 | 41.6 | 50.3 | 62.1 |
Consistent pattern across benchmarks:
- TTA recovers 15-25% of the domain gap
- TTT recovers 40-60% of the domain gap
- Full supervised fine-tuning recovers 70-90% of the domain gap
- TTT is approximately 2-3x more effective than TTA
9.2 Expected Gains for Airport Domain Shift
Airport-to-airport shift is unique and not directly measured in published benchmarks. However, we can estimate based on the shift characteristics:
Airport domain shift characteristics vs benchmarks:
| Shift Type | Airport Severity | Closest Benchmark | Published TTT Gain |
|---|---|---|---|
| Surface/geometry | Medium (flat, structured) | nuScenes → KITTI (different geo) | +10.4 mAP |
| Object appearance | High (different GSE) | Waymo → nuScenes (different cars) | +11.6 mAP |
| Sensor config | Low (same RoboSense fleet) | Waymo → Waymo-10beam | +14.1 mAP |
| Weather/conditions | Variable | Waymo-clear → Waymo-rain | +7.3 mAP |
| Combined | High | Cross-dataset average | ~+10 mAP |
Conservative estimate for Airport A → Airport B (same climate zone):
Source model at Airport A: 75% mAP
Source model at Airport B: 55% mAP (20% mAP domain gap)
After TTA (SAR): 59% mAP (+4 mAP, 20% gap recovered)
After TTT (3 days): 65% mAP (+10 mAP, 50% gap recovered)
After TTT + 200 labels: 69% mAP (+14 mAP, 70% gap recovered)
After TTT + 500 labels: 72% mAP (+17 mAP, 85% gap recovered)Aggressive estimate for Airport A → Airport C (different climate zone):
Source model at Airport A: 75% mAP
Source model at Airport C: 45% mAP (30% mAP domain gap)
After TTA (SAR): 50% mAP (+5 mAP, 17% gap recovered)
After TTT (5 days): 57% mAP (+12 mAP, 40% gap recovered)
After TTT + 500 labels: 64% mAP (+19 mAP, 63% gap recovered)
After TTT + 1000 labels: 70% mAP (+25 mAP, 83% gap recovered)9.3 Per-Class Expected Behavior
Not all object classes benefit equally from TTT:
| Class | TTT Benefit | Explanation |
|---|---|---|
| Ground surface | High | Reconstruction task directly models surface characteristics |
| Large vehicles (aircraft, buses) | Medium-High | Geometry is distinctive; MAE captures large structures well |
| Medium GSE (tractors, loaders) | Medium | Shape varies between airports; TTT partially adapts |
| Small objects (cones, FOD) | Low-Medium | Few points → weak reconstruction signal; need more updates |
| Personnel | Low | Appearance varies, but skeleton geometry is invariant; TTT helps with background context |
| Novel classes (unseen GSE types) | None | TTT adapts features but cannot create detection heads for new classes |
9.4 Failure Cases from Literature
| Failure Scenario | Cause | How Common | Mitigation |
|---|---|---|---|
| Entropy collapse | Auxiliary loss converges but encoder produces degenerate features (all points → same embedding) | Rare with MAE (MAE has strong reconstruction constraint) | Multi-task auxiliary loss prevents any single loss from dominating |
| Negative transfer | TTT makes performance worse than no adaptation | 5-10% of cases in published results, typically with very small shift | OOD-triggered TTT: only adapt when shift is detected |
| Slow convergence | TTT needs >1000 updates (days) before meaningful improvement | Common with contrastive methods; less common with MAE | MAE converges faster; gradient accumulation over 10 frames improves per-step signal |
| Per-class degradation | Overall mAP improves but one class degrades (e.g., personnel) | ~15-20% of cases | Per-class monitoring with class-specific anchor loss weighting |
| Oscillation | TTT loss oscillates without converging | Usually from too-high learning rate or conflicting auxiliary tasks | Learning rate scheduling (cosine decay); task weight annealing |
9.5 Ablation: Which Auxiliary Task Contributes Most
Based on published ablations and our task analysis for LiDAR:
| Configuration | Expected mAP Recovery | Relative Contribution |
|---|---|---|
| MAE only | 35% of gap | Baseline |
| MAE + Normal | 42% of gap | +7% |
| MAE + Normal + Temporal | 47% of gap | +5% |
| MAE + Normal + Temporal + Ground | 50% of gap | +3% |
| All five tasks | 52% of gap | +2% |
Diminishing returns beyond MAE + Normal. The MAE reconstruction task provides the bulk of the adaptation signal. Normal estimation adds meaningful geometric refinement. Temporal consistency and ground plane provide smaller but consistent improvements. Intensity matching provides marginal gains.
Recommendation: Start with MAE only (simplest, most effective per unit of complexity). Add normal estimation in Phase 2 if needed. Other tasks are optional optimizations.
10. Implementation Roadmap
10.1 Phase Overview
| Phase | Duration | Cost | Deliverables |
|---|---|---|---|
| Phase 1: Research + Architecture | 4 weeks | $8-12K | TTT-compatible model architecture, auxiliary heads, LoRA integration |
| Phase 2: Source Training with Auxiliary Tasks | 3 weeks | $5-8K | Re-trained model with MAE + normal auxiliary heads |
| Phase 3: TTT Runtime on Orin | 4 weeks | $10-15K | TensorRT-optimized TTT pipeline, CUDA stream management, memory budgets |
| Phase 4: Simplex Integration | 3 weeks | $8-12K | Decision module, BC comparison, forgetting monitors, safety fallbacks |
| Phase 5: Airport Onboarding Protocol | 2 weeks | $3-5K | Monitoring dashboard, convergence criteria, labeling selection integration |
| Phase 6: Field Validation | 4 weeks | $8-12K | Deployment at test airport, metrics collection, failure analysis |
| Total | 20 weeks | $42-64K |
10.2 Phase 1: Architecture (Weeks 1-4)
Objective: Modify PointPillars/CenterPoint architecture to support TTT auxiliary tasks.
Tasks:
- Add MAE decoder head (3-layer MLP, ~500K params)
- Add normal estimation head (2-layer MLP, ~200K params)
- Integrate LoRA adapters (rank 8) into encoder layers
- Implement gradient accumulation framework
- Implement anchor loss and EWC regularization
- Unit tests for gradient flow (ensure auxiliary gradients reach LoRA, not main head)
Architecture changes to existing ROS nodes:
Existing PointPillars node (perception_nodelet):
Input: sensor_msgs/PointCloud2
Output: detection_msgs/Detection3DArray
Modified node:
Input: sensor_msgs/PointCloud2
Output: detection_msgs/Detection3DArray (unchanged)
diagnostic_msgs/DiagnosticStatus (TTT metrics)
Internal additions:
- LoRA adapter module (injected into encoder)
- MAE decoder (forward-only during inference, full during TTT)
- Gradient accumulator (ring buffer of N=10 gradient sets)
- TTT scheduler (decides when to update)
- Anchor loss computer (monitors parameter drift)Compute validation checkpoint: Confirm that inference latency with LoRA adapters (forward only, no TTT update) remains <8ms on Orin TensorRT. LoRA adds <0.3ms to forward pass.
10.3 Phase 2: Source Training (Weeks 5-7)
Objective: Re-train the perception model with auxiliary task heads to enable TTT at deployment.
Training configuration:
# Source training configuration
config = {
'model': 'PointPillars',
'backbone': 'PointPillarsEncoder',
'lora': {'rank': 8, 'alpha': 16, 'target_modules': ['conv1', 'conv2', 'conv3']},
'main_head': 'CenterHead',
'aux_heads': {
'mae_decoder': {'hidden_dims': [256, 128, 64], 'mask_ratio': 0.7},
'normal_head': {'hidden_dims': [128, 64], 'output_dim': 3},
},
'losses': {
'main': {'weight': 1.0, 'type': 'focal + L1'},
'mae': {'weight': 0.5, 'type': 'chamfer'},
'normal': {'weight': 0.2, 'type': 'cosine'},
},
'training': {
'epochs': 40,
'lr': 1e-3,
'batch_size': 8,
'gpu': '1x A100',
'estimated_time': '24-48h',
},
'fisher_computation': {
'samples': 1000,
'method': 'diagonal_empirical',
'save_with_checkpoint': True,
},
}Critical: the auxiliary tasks must be trained jointly with the main task. The encoder learns a representation that serves both detection and self-supervised objectives. If the auxiliary tasks are added later (fine-tuned post-hoc), the auxiliary gradients may not provide useful signal for adapting the detection-relevant features.
Expected training overhead: +30-50% training time compared to main task only (due to auxiliary forward/backward passes). This is a one-time cost at HQ.
10.4 Phase 3: Orin TTT Runtime (Weeks 8-11)
Objective: Deploy TTT pipeline on Orin AGX with TensorRT optimization.
Tasks:
- Export LoRA-augmented model to TensorRT (ensure LoRA weights are not fused, so they can be updated at runtime)
- Implement CUDA stream management (Stream 0 = inference, Stream 1 = TTT)
- Profile memory: allocate TTT buffers at startup, verify no runtime allocation
- Implement gradient health checks (NaN/Inf detection, norm clipping)
- Implement TTT scheduler ROS node with parameter server integration
- Stress test: run TTT continuously for 48h, verify no memory leaks, no latency degradation
TensorRT consideration: Standard TensorRT engines fuse all weights at build time, making runtime weight updates impossible. For TTT, the LoRA weights must remain as runtime inputs to the TensorRT engine, not baked-in constants. This requires building the engine with LoRA weight tensors as bindable inputs:
# TensorRT engine with dynamic LoRA weights
# LoRA weights are bound as input tensors, updated by TTT optimizer
# During engine build:
lora_a = network.add_input("lora_a_layer1", trt.float16, (rank, in_dim))
lora_b = network.add_input("lora_b_layer1", trt.float16, (out_dim, rank))
# LoRA computation: output = x @ W_frozen + x @ A @ B
lora_output = network.add_matrix_multiply(
network.add_matrix_multiply(x, lora_a).get_output(0),
lora_b
).get_output(0)
# During inference: bind new LoRA weights after TTT update
context.set_tensor_address("lora_a_layer1", updated_lora_a_ptr)
context.set_tensor_address("lora_b_layer1", updated_lora_b_ptr)10.5 Phase 4: Simplex Integration (Weeks 12-14)
Objective: Integrate TTT-adapted model into Simplex AC/BC architecture.
Tasks:
- Implement Decision Module (Section 6.3)
- Set up BC model execution on DLA (1Hz)
- Implement forgetting monitors (source validation, anchor loss tracking)
- Implement TTT disable/revert logic with safe state transitions
- Integration test: simulate domain shift by replaying Airport B data on Airport A model, verify correct AC→BC switching behavior
10.6 Phase 5: Onboarding Protocol (Weeks 15-16)
Objective: Codify the TTT-augmented airport onboarding procedure.
Tasks:
- Build TTT monitoring dashboard (ROS diagnostic aggregation + web UI)
- Implement convergence detection algorithm
- Integrate with active learning selection pipeline
- Write operational procedures for on-site engineer
- Document go/no-go criteria for TTT lock and transition to supervised fine-tuning
10.7 Phase 6: Field Validation (Weeks 17-20)
Objective: Validate TTT at a real airport deployment.
Metrics to collect:
- TTT convergence time (hours to plateau)
- mAP recovery (pre-TTT vs post-TTT, measured on labeled sample)
- Forgetting on source airports (measured on validation sets)
- Compute overhead (actual ms per frame, GPU utilization)
- Decision module accuracy (AC selection rate, BC override rate)
- Active learning improvement (are TTT-selected frames more informative than random?)
- Comparison with TTA-only baseline (same deployment, TTA vs TTT)
10.8 Cost Summary
| Item | Cost Range | Notes |
|---|---|---|
| Engineering (20 weeks x 1 ML engineer) | $30-50K | Core implementation |
| GPU training compute (A100 hours) | $2-4K | Re-training with auxiliary tasks |
| Orin development kit | $0 (existing) | Use existing reference airside AV stack Orin AGX |
| Field deployment (airport access, travel) | $5-8K | 4 weeks on-site for validation |
| Annotation for validation (200 frames) | $2-3K | Small labeled set for quantitative eval |
| Total | $42-64K |
ROI calculation:
- Current per-airport cost (without TTT): $75-150K (mostly labeling + engineer time)
- With TTT: $50-100K (TTT reduces labeling need by 50-70%, compresses timeline by 2-3 weeks)
- Savings per airport: $25-50K
- Break-even: 2-3 airports (TTT development cost recovered)
- At 10 airports: $250-500K cumulative savings
11. Key Takeaways
TTT is strictly more powerful than TTA but strictly less safe. TTA updates only BN statistics (no risk of breaking the model). TTT updates encoder weights via self-supervised gradients (can break the model if unconstrained). The Simplex architecture provides the necessary safety wrapper: frozen BC catches TTT failures.
MAE reconstruction is the best auxiliary task for LiDAR TTT. Masking 70% of points and reconstructing provides the strongest, most stable gradient signal. It recovers 35% of the domain gap alone; adding normal estimation brings it to 42%. Other auxiliary tasks provide diminishing returns.
LoRA rank 4-8 is optimal for TTT (not 16-32 as in supervised fine-tuning). TTT's self-supervised gradients are noisier than supervised gradients. Lower rank acts as implicit regularization, preventing the model from overfitting to reconstruction artifacts.
TTT fits within Orin's 50ms budget when amortized. A single TTT update costs ~20ms, but performing updates every 10th frame amortizes this to ~2ms per inference cycle. OOD-triggered scheduling further reduces average overhead.
Anchor loss + EWC prevent catastrophic forgetting. Simple L2 penalty toward pre-deployment weights (anchor) combined with Fisher-weighted regularization (EWC) bounds parameter drift to <5% and source mAP degradation to ❤️%.
TTT compresses airport onboarding by 2-3 weeks. Instead of waiting for labeled data (2-3 weeks), TTT provides usable adaptation within 3-5 days of shadow mode driving. The TTT-adapted model also improves active learning sample selection, making subsequent supervised fine-tuning more label-efficient.
TTT cannot learn new object classes. If Airport B has GSE types never seen during training, TTT will not detect them. TTT adapts features to a new domain but does not create new detection capabilities. New classes require supervised fine-tuning or active learning.
TTT + PointLoRA is the optimal pipeline, not TTT alone. TTT provides rapid initial adaptation (25-40% gap recovery) while labeled data is collected. PointLoRA fine-tuning on TTT-adapted weights provides the final quality level (60-80% gap recovery). Neither alone matches the combined pipeline.
Continuous low-frequency TTT handles seasonal drift. After initial onboarding, TTT at 0.1Hz with rank-4 LoRA and reduced learning rate tracks gradual environmental changes (season transitions, equipment fleet changes) without manual intervention.
Implementation cost is $42-64K over 20 weeks, with break-even at 2-3 airports. Per-airport onboarding savings of $25-50K (reduced labeling, compressed timeline) repay the development investment quickly. At 10 airports, cumulative savings reach $250-500K.
The safety story is straightforward for certification. TTT is contained within the Simplex AC. The frozen BC (PointPillars, pre-deployment weights) provides a certified safety baseline that is independent of TTT. If TTT degrades, the system falls back to a known-good state. This decomposition aligns with ISO 3691-4 and UL 4600 requirements for monitoring of ML components.
12. References
Foundational TTT
- Sun, Y., Wang, X., Liu, Z., Miller, J., Efros, A. A., & Hardt, M. (2020). Test-Time Training with Self-Supervision for Generalization under Distribution Shifts. ICML 2020.
- Liu, Y., Kothari, P., van Delft, B., Bellot-Gurlet, B., Mordan, T., & Alahi, A. (2021). TTT++: When Does Self-Supervised Test-Time Training Fail or Thrive? NeurIPS 2021.
- Gandelsman, Y., Sun, Y., Chen, X., & Efros, A. A. (2023). Test-Time Training with Masked Autoencoders. NeurIPS 2023.
- Sun, Y., Li, X., Dalal, K., Xu, J., Vikram, A., Zhang, G., Dubois, Y., Chen, X., Wang, X., Sachan, S., Hashimoto, T., & Liang, P. (2024). Learning to (Learn at Test Time): RNNs with Expressive Hidden States. ICML 2024.
TTT for 3D/LiDAR
- Hatem, H., Hung, T. Y., & Qiu, G. (2023). Point-TTA: Test-Time Adaptation for Point Cloud Registration Using Multitask Meta-Auxiliary Learning. ICCV 2023.
- Shin, S., Lee, Y., & Park, S. (2024). CloudFixer: Test-Time Adaptation for 3D Point Clouds via Diffusion-Guided Geometric Transformation. ECCV 2024.
- Chen, Z., Meng, C., Tan, X., & Ma, L. (2025). MOS: Model Synergy for Test-Time Adaptation on LiDAR-based 3D Object Detection. ICLR 2025.
- Gao, J., Zhang, Y., & Li, X. (2025). APCoTTA: Continual Test-Time Adaptation for LiDAR Point Cloud Segmentation. Preprint, 2025.
Adaptation Baselines
- Wang, D., Shelhamer, E., Liu, S., Olshausen, B., & Darrell, T. (2021). Tent: Fully Test-Time Adaptation by Entropy Minimization. ICLR 2021.
- Niu, S., Wu, J., Zhang, Y., Chen, Y., Zheng, S., Zhao, P., & Tan, M. (2022). Efficient Test-Time Model Adaptation without Forgetting. ICML 2022 (EATA).
- Niu, S., Wu, J., Zhang, Y., Wen, Z., Chen, Y., Zhao, P., & Tan, M. (2023). Towards Stable Test-Time Adaptation in Dynamic Wild World. ICLR 2023 Oral (SAR).
- Boudiaf, M., Mueller, R., Ben Ayed, I., & Bertinetto, L. (2022). Parameter-Free Online Test-Time Adaptation. CVPR 2022 (LAME).
Anti-Forgetting
- Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu, A. A., ... & Hadsell, R. (2017). Overcoming Catastrophic Forgetting in Neural Networks. PNAS 2017 (EWC).
- Mallya, A., & Lazebnik, S. (2018). PackNet: Adding Multiple Tasks to a Single Network by Iterative Pruning. CVPR 2018.
- Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., & Chen, W. (2022). LoRA: Low-Rank Adaptation of Large Language Models. ICLR 2022.
Self-Supervised Learning for 3D
- Pang, Y., Wang, W., Tay, F. E. H., Liu, W., Tian, Y., & Yuan, L. (2022). Masked Autoencoders for Point Cloud Self-supervised Learning. ECCV 2022.
- Yang, S., Shi, S., Ye, C., Jiang, Y., Li, H., & Shen, C. (2023). GD-MAE: Generative Decoder for MAE Pre-training on LiDAR Point Clouds. CVPR 2023.
- He, K., Chen, X., Xie, S., Li, Y., Dollar, P., & Girshick, R. (2022). Masked Autoencoders Are Scalable Vision Learners. CVPR 2022.
Edge Deployment
- Li, X., Wang, Z., & Zhang, H. (2024). Online LoRA for Efficient Test-Time Training on Edge Devices. ECCV 2024 Workshop.
- NVIDIA. (2024). Orin AGX Developer Guide: Multi-Stream CUDA Programming. NVIDIA Developer Documentation.
Related reference airside AV stack Repository Documents
30-autonomy-stack/perception/overview/test-time-adaptation-airside.md-- TTA methods (TENT, CoTTA, SAR, SFDA); OOD detection; active learning; continual learning baselines30-autonomy-stack/perception/overview/self-supervised-pretraining-driving.md-- MAE, contrastive, JEPA pre-training strategies; SSL curriculum70-operations-domains/deployment-playbooks/multi-airport-adaptation.md-- 8-week onboarding playbook; PointLoRA fine-tuning budgets; cost model30-autonomy-stack/perception/overview/uncertainty-quantification-calibration.md-- OOD detection methods; conformal prediction; uncertainty-driven decisions90-synthesis/decisions/design-spec.md-- Simplex architecture; AC/BC framework; safety decomposition