Continual Learning, Lifelong Learning, and Online Adaptation for Autonomous Vehicle Fleets
Date: 2026-03-22 Scope: Comprehensive survey of continual/lifelong learning techniques applicable to autonomous vehicle fleets, with emphasis on multi-site deployment (e.g., airport airside operations), fleet-scale data pipelines, and safety-critical adaptation.
Table of Contents
- Catastrophic Forgetting and Mitigations
- Online Adaptation from Fleet Data
- Active Learning for Driving
- Data Flywheel Architecture
- Distribution Shift Detection
- Multi-Airport / Multi-Domain Adaptation
- Curriculum Learning
- Model Versioning and Rollback for Fleets
- Federated Learning for Privacy-Preserving Fleet Learning
- World Model Online Adaptation
- Synthesis: Architecture for Continual-Learning AV Fleets
- Key Takeaways for Airport Airside Deployment
- References
1. Catastrophic Forgetting and Mitigations
1.1 The Problem
Catastrophic forgetting occurs when a neural network trained sequentially on new tasks or data distributions loses previously learned knowledge. This is the central obstacle to continual learning. For an AV fleet, this manifests when a perception model retrained on data from a new airport (or new weather conditions, new vehicle types on the ramp) degrades performance on previously mastered scenarios.
The problem is rooted in the stability-plasticity dilemma: a network must be plastic enough to learn new information but stable enough to retain old knowledge. Standard SGD-based training offers no mechanism to balance these competing demands.
1.2 Regularization-Based Methods
Elastic Weight Consolidation (EWC)
EWC adds a quadratic penalty to the loss function that discourages large changes to parameters important for previous tasks:
L_EWC(theta) = L_new(theta) + (lambda/2) * sum_i F_i * (theta_i - theta_i*)^2where F_i is the diagonal of the Fisher Information Matrix (FIM) computed on the previous task's data, and theta_i* are the optimal parameters for the previous task. Parameters with high Fisher information (i.e., those carrying significant information about past tasks) receive strong regularization, while less important parameters are free to adapt.
Strengths: Simple to implement, low memory overhead (stores one set of previous parameters + Fisher diagonal per task).
Limitations: Recent research (2025) has identified that when the network achieves high confidence in accurate predictions, EWC produces a low-magnitude FIM due to vanishing gradients, causing it to struggle to retain crucial parameter information. This leads to poor continual learning performance in practice, especially for well-trained models. Additionally, the diagonal FIM approximation loses cross-parameter interaction information.
Application to AV: Researchers have applied EWC to car-following models in autonomous driving, using Waymo and Lyft datasets divided into speed-stratified tasks. CL-EWC achieved 0% collision rates across all traffic conditions (vs. 0.19-5.51% for baselines) and reduced spacing MSE from 136.77 to 5.65 on the final task. This demonstrates EWC's viability for safety-critical driving behaviors.
Memory Aware Synapses (MAS)
MAS dynamically adjusts weight importance using gradient magnitudes rather than Fisher information:
Omega_i = (1/N) * sum_k ||g_i(x_k)||
L_MAS(theta) = L_new(theta) + lambda * sum_i Omega_i * (theta_i - theta_i*)^2MAS showed comparable safety performance to EWC (0% collisions) with slightly better adaptability within similar scenarios, exhibiting "more dynamic behavior" while maintaining safety margins.
Synaptic Intelligence (SI)
SI computes an online importance measure for each parameter by accumulating the contribution of each weight to the change in loss during training. Unlike EWC, which computes importance post-hoc, SI tracks importance during training, making it more computationally efficient for online/streaming settings.
1.3 Architecture-Based Methods
Progressive Neural Networks
Progressive networks instantiate a new neural network "column" for each task, with lateral connections to features from previously learned columns. Previous columns are frozen, making catastrophic forgetting impossible by construction.
Architecture: Each new task adds a column with lateral adapters (alpha_k) connecting to hidden representations of all prior columns. This enables forward transfer while completely preventing backward interference.
Limitations: Parameter count grows linearly with the number of tasks. Analysis reveals that only a fraction of new capacity is actually utilized, suggesting significant redundancy. For a fleet deploying across many airports, the growing model size becomes a practical concern for edge deployment on vehicle hardware.
PackNet
PackNet addresses progressive networks' scalability issues through network pruning and parameter reuse:
- Train the full unallocated portion of the network on the new task (with prior-task subnets frozen)
- Prune and reserve the top-k% weights by absolute value for the new task
- The pruned (freed) capacity becomes available for future tasks
PackNet can be seen as "packing" multiple tasks into a single network by giving each task a dedicated subnet. The approach maintains strong performance on earlier tasks (their weights are frozen) while making efficient use of network capacity.
Relevance to AV: PackNet is attractive for edge deployment where model size is constrained. A single network can serve multiple operational domains (airports) with dedicated subnets, avoiding the need to ship multiple models to each vehicle.
AdaHAT (2024)
Adaptive Hard Attention to the Task extends architecture-based approaches by allowing controlled updates to previously trained subnets. The intensity of updates is determined by heuristic indicators (parameter importance, current capacity usage), enabling the network to manage its capacity adaptively rather than rigidly partitioning it.
1.4 Replay-Based Methods
Experience Replay (ER)
ER maintains a memory buffer of representative samples from previous tasks and interleaves them with new-task data during training. Reservoir sampling provides a principled way to maintain a fixed-size buffer from a data stream of unknown length.
Replay-based methods dominate the continual learning literature, appearing in 62 of 81 approaches surveyed in a 2025 systematic review of online continual learning. Recent work (2025) demonstrates that experience replay not only prevents forgetting but also addresses the loss of plasticity problem in continual learning -- the tendency of neural networks to lose their ability to learn over time.
Generative Replay
Instead of storing raw samples, generative replay trains a generative model (GAN, VAE, or diffusion model) to synthesize pseudo-samples from previous distributions. This reduces memory requirements and avoids data retention concerns.
Brain-inspired generative replay, drawing on neuroscience research about memory consolidation during sleep, has shown comparable effectiveness to veridical (exact) replay while being more memory-efficient. Recent work on Brain Generative Replay uses EEG signal representations to generate task-relevant samples.
Adaptive Memory Replay (CVPR 2024 Workshop)
This approach dynamically adjusts the replay strategy based on task characteristics and model state, using entropy-based selection to avoid static buffers and reduce overfitting while maintaining diversity.
Practical considerations for AV fleets:
- Raw replay requires storing driving clips -- privacy and storage concerns at fleet scale
- Generative replay avoids raw data retention but adds model complexity
- Coreset selection (choosing maximally informative exemplars) reduces buffer size while maintaining representativeness
- For airport airside, replay buffers could store domain-specific edge cases (near-misses, unusual vehicle configurations, weather events)
1.5 Knowledge Distillation for Continual Learning
Knowledge distillation transfers information from a previous model ("teacher") to the updated model ("student") by adding a loss term that encourages the student's outputs to match the teacher's on new data. This is a "data-focused regularization" approach that does not require storing raw data.
In autonomous driving, distillation serves a dual role:
- Continual learning: Preserving old-task knowledge in the updated model
- Model compression: Cloud-trained large models (e.g., 32B parameters) are distilled into smaller models (e.g., 4B parameters) suitable for vehicle-edge deployment
1.6 Summary Comparison
| Method | Forgetting Prevention | Forward Transfer | Memory Cost | Compute Cost | Scalability |
|---|---|---|---|---|---|
| EWC | Moderate | None | O(params) per task | Low | Good |
| MAS | Moderate | None | O(params) per task | Low | Good |
| Progressive Nets | Perfect | Strong (lateral) | O(params) per task | Growing | Poor |
| PackNet | Strong | Moderate | Subnet masks | Low | Moderate |
| Experience Replay | Strong | Moderate | O(buffer size) | Moderate | Good |
| Generative Replay | Moderate | Moderate | Generator model | High | Good |
| Knowledge Distillation | Moderate | Good | Teacher model | Moderate | Good |
2. Online Adaptation from Fleet Data
2.1 The Challenge
Autonomous vehicle fleets operate in environments that change continuously: new construction, seasonal weather patterns, evolving traffic rules, fleet composition changes, surface degradation. Static models trained on historical data inevitably encounter distribution shift. Online adaptation enables models to improve continuously from the data stream generated by deployed vehicles.
2.2 Approaches to Fleet-Scale Online Learning
Incremental/Continual Training
Rather than periodic full retraining, models are incrementally updated on new data batches collected from the fleet. This requires the catastrophic forgetting mitigations described in Section 1. The practical pipeline is:
- Fleet vehicles collect data during operation (with trigger-based selection -- see Section 4)
- Selected data is uploaded to centralized training infrastructure
- Models are incrementally fine-tuned using EWC/replay/distillation
- Updated models are validated (simulation + shadow deployment)
- Models are deployed via OTA updates
Online Parameter Adaptation
Some parameters can be adapted on-device in real-time without centralized retraining. comma.ai's openpilot demonstrates this with lagd, a daemon that learns each car's lateral time delay online -- a parameter that "varies across car models and even between individual cars of the same model." This per-vehicle calibration happens at the edge, personalizing the driving policy without fleet-wide retraining.
Few-Shot Domain Adaptation
When deploying to a genuinely new environment (new airport), few-shot adaptation techniques enable rapid specialization from limited local data. Low-rank adaptation (LoRA) applied to pretrained models allows efficient domain-specific tuning with minimal new parameters.
2.3 Semi-Supervised Online Continual Learning
A 2024 study on semi-supervised online continual learning for 3D object detection in mobile robotics demonstrated that models can learn new distributions from streaming unlabeled LiDAR point clouds, performing 3D object detection as each scan arrives. This is directly applicable to AV fleets where labeling every frame is infeasible.
2.4 Practical Constraints
- Bandwidth: Uploading raw driving data from hundreds of vehicles is expensive; trigger-based collection and on-device preprocessing are essential
- Staleness: Models shipped to vehicles lag behind the current data distribution; the update cadence must balance freshness against validation rigor
- Heterogeneity: Different vehicles in the fleet may encounter different distributions (different airports, different shifts), requiring the aggregation strategies discussed in Section 9
3. Active Learning for Driving
3.1 Motivation
Autonomous driving datasets exhibit extreme long-tail distributions: the vast majority of collected frames show mundane driving (straight roads, normal conditions), while safety-critical scenarios are rare. Active learning selects the most informative samples for labeling, dramatically reducing annotation cost while improving model performance on tail events.
3.2 ActiveAD: Planning-Oriented Active Learning (2024)
ActiveAD is a framework specifically designed for end-to-end autonomous driving that addresses both the cold-start problem and ongoing sample selection:
Cold-Start Phase (Diversity-Driven):
- Uses diversity metrics to select an initial batch that covers the scenario space broadly
- Addresses the bootstrap problem where no model exists yet to compute uncertainty
Iterative Phase (Uncertainty-Driven):
- Computes uncertainty metrics specific to route planning (not just perception)
- Selects samples where the model is least confident about planning decisions
- Focuses on safety-critical scenarios that are underrepresented
Key Result: ActiveAD achieves comparable performance to state-of-the-art end-to-end AD methods using only 30% of the nuScenes dataset, demonstrating 70% reduction in labeling cost.
3.3 Uncertainty-Based Selection
Several uncertainty estimation approaches are used for active sample selection in driving:
- Softmax entropy: High entropy in perception model outputs indicates uncertain predictions
- Monte Carlo dropout: Running multiple forward passes with dropout to estimate predictive variance
- Ensemble disagreement: Training multiple models and flagging samples where predictions diverge
- Epistemic vs. aleatoric separation: Distinguishing model uncertainty (reducible by more data) from inherent noise (irreducible) -- critical because high aleatoric uncertainty does not indicate a useful training sample
Limitation: Predictive entropy does not distinguish between epistemic and aleatoric uncertainty, making high entropy ambiguous -- it may indicate either a genuinely novel scenario or simply a difficult (but well-represented) one.
3.4 Diversity-Based Selection
Uncertainty alone leads to redundant selections (many similar hard cases). Diversity criteria ensure broad coverage:
- Coreset methods: Select samples that are maximally different from existing labeled data in feature space
- Scenario clustering: Group driving situations by semantic type and ensure coverage across clusters
- Geographic/temporal stratification: Ensure representation across locations, times of day, weather conditions
3.5 Tesla's Trigger-Based Active Learning
Tesla implements a production-scale active learning system through trigger classifiers:
- Multiple lightweight classifiers run on top of the shared perception backbone
- Each classifier targets a specific scenario type (shopping carts, tunnel exits, unusual road markings)
- Context-aware activation: a tunnel-exit detector activates only when map data indicates a tunnel approach
- When confidence exceeds a threshold, flagged data is uploaded for human review and labeling
- Hundreds of triggers run simultaneously on each vehicle with minimal compute overhead
This is effectively active learning at fleet scale, where the "query strategy" is a set of learned classifiers that identify underrepresented or failure-prone scenarios in the wild.
3.6 Implications for Airport Airside
Airport airside environments have their own long-tail distribution: unusual aircraft types, ground support equipment configurations, construction zones, wildlife incursions, adverse weather. An active learning pipeline should:
- Deploy trigger classifiers for known rare events (e.g., wide-body pushback conflicts, jet blast scenarios)
- Use uncertainty-based selection for genuinely novel situations
- Maintain diversity across airports, times of day, and operational conditions
- Prioritize safety-critical scenarios (near-misses, hard braking events, manual takeovers)
4. Data Flywheel Architecture
4.1 Concept
The data flywheel is a self-reinforcing cycle where deployed vehicles generate data that improves models, which improves driving, which generates more diverse data (vehicles can handle more scenarios), attracting more users/deployments, generating yet more data. The key architectural components are: collection, selection, labeling, training, validation, and deployment.
4.2 Tesla's Data Engine
Tesla operates the most widely documented data flywheel in autonomous driving:
Collection Infrastructure:
- 5M+ vehicles act as sensor platforms, collecting video and telemetry
- Shadow mode: the FSD stack runs in the background comparing its intended actions against human driver actions, flagging disagreements
- Approximately 1.5 million miles of driving data captured daily under shadow mode
- 4.25 billion annual FSD Supervised miles (2025), up from 6 million in 2021
Trigger-Based Selection:
- Lightweight trigger classifiers identify edge cases and upload flagged clips
- GPS-correlated labeling: tag an intersection once, auto-label thousands of future drives through the same location across all conditions
- Long-tail scenario mining: search the fleet's database for scenarios similar to detected failures
Auto-Labeling Pipeline:
- AI-driven auto-labeling for the majority of training data
- 3D auto-labeling using multi-trip aggregated sensor data
- Human review for edge cases and quality assurance
- Datasets on the order of 1.5 petabytes from approximately 1 million clips
Training Infrastructure:
- Cortex supercomputer at Giga Texas (tens of thousands of NVIDIA H100 GPUs)
- Cortex 2 build-out underway (2026)
- R&D spend >$5B (2024), CapEx >$11B expected (2025)
Deployment:
- OTA updates push improved models to the entire fleet
- 1.1 million active FSD users globally (end of 2025)
- FSD v12+ uses end-to-end neural networks replacing 300,000 lines of rule-based code
4.3 Waymo's Approach
Waymo operates a different flywheel with a smaller but more controlled fleet:
Collection: 1,500+ vehicles across 10 US metro areas, 14 million trips completed in 2025 (3x vs. 2024), nearly 200 million fully autonomous miles logged to date.
Adaptation Process: For each new city, Waymo:
- Compares driving performance against a proven baseline
- Identifies unique local characteristics
- Refines the Waymo Driver's AI for local nuances
- Validates through real-world driving + advanced simulation
- Deploys through regular software releases
Key Insight: Waymo reports that local nuances "are becoming fewer with every city," suggesting their models are converging toward generalizable driving competence across domains.
International Adaptation: For London, Waymo employs manual driving for months to "learn the nuances, learn about the zebra crossings" before deploying autonomous operation -- demonstrating that domain adaptation still requires significant site-specific data collection for substantially different traffic environments.
4.4 comma.ai's Open Approach
comma.ai operates a data flywheel with openpilot:
Fleet Scale: 325+ supported car models, 20,000+ users, 100M+ miles driven.
Data Collection:
- By default, openpilot uploads driving data to comma's servers
- "Firehose Mode" (v0.9.8, Feb 2025) maximizes training data upload rate
- commaCarSegments v2: 3,000 hours of CAN data from the fleet
Training Architecture (v0.10):
- Entirely new end-to-end training architecture
- World model ("Tomb Raider") used for simulation-based training
- Lossy image compression during training enables ML simulator use
- Distributed, asynchronous rollout data collection (similar to IMPALA/GORILA)
Online Adaptation:
lagddaemon learns per-vehicle lateral time delay online- Demonstrates that some adaptation can happen at the edge, per-vehicle
4.5 Closed-Loop Pipeline Architecture
The canonical data-centric closed-loop pipeline (based on NVIDIA MagLev and Tesla's architectures):
[Fleet Vehicles] --> [Trigger-Based Collection] --> [Data Lake]
^ |
| [Data Selection/Mining]
| |
[OTA Update] [Auto-Labeling]
^ |
| [Model Training]
[Safety Validation] |
^ [Simulation Testing]
| |
+----------------------------------------------+Critical bottleneck: Addressing the long-tail distribution requires approximately 1,000 billion driving miles of data. Strategic data selection (active learning) is essential because raw collection at this scale is infeasible.
5. Distribution Shift Detection
5.1 Types of Distribution Shift
| Shift Type | Description | AV Example |
|---|---|---|
| Covariate shift | Input distribution changes, P(Y | X) stays same |
| Prior probability shift | Class frequencies change | Airport B has more wide-body aircraft than Airport A |
| Concept drift | P(Y | X) changes over time |
| Dataset shift | Combination of the above | Seasonal changes affect everything simultaneously |
5.2 Out-of-Distribution (OOD) Detection
OOD detection identifies inputs that fall outside the training distribution, triggering safety interventions before the model makes unreliable predictions.
Primary Methods:
Outlier Exposure (OE): Train with proxy OOD data (e.g., from COCO/ADE20K) pasted into driving scenes. Methods like RbA and Maximized Entropy use this approach. Limitation: reduces in-distribution performance by 1-2.5% mIoU and risks overfitting to seen OOD examples.
Uncertainty Estimation: Use prediction confidence metrics (softmax entropy, maximum logit scores) to flag uncertain regions. Critical limitation: does not distinguish between epistemic uncertainty (model does not know) and aleatoric uncertainty (inherently ambiguous).
Generative/Reconstruction-Based: Compare input images against model reconstructions; large reconstruction errors indicate OOD. Limitation: high inference time makes real-time deployment challenging.
Mask2Former-Based (2024-2025): Five of the top seven OOD segmentation benchmark performers leverage Mask2Former's mask-level architecture (RbA, UNO, EAM), but these are computationally heavy (~200M parameters).
Benchmarks:
- SegmentMeIfYouCan Obstacle Track (SMIYC-OT): 327 test images, 388 OOD instances, 31 obstacle categories
- LostAndFound-NoKnown (L&F): 1,043 test images, 1,709 OOD instances
Evaluation gap: Current benchmarks use threshold-free metrics (AUPRC) that do not directly translate to deployment, where threshold selection requires calibration data typically unavailable in production.
5.3 Concept Drift Detection
Concept drift -- where the relationship between inputs and correct outputs changes over time -- is particularly relevant for AV fleets operating across evolving environments.
Detection Approaches:
- Statistical tests: Monitor performance metrics over sliding windows; detect statistically significant degradation (e.g., Page-Hinkley test, ADWIN)
- Deep learning-based: Autoencoder reconstruction error as a proxy for drift; novelty-aware concept drift detection combines novelty detection with drift detection to distinguish genuinely new concepts from gradual distribution changes
- Image stream drift detection: An underexplored area due to the high-dimensional, unstructured nature of visual data; most drift detection methods were designed for tabular/time-series data
Practical deployment monitoring:
- Track prediction confidence distributions over time; systematic drops indicate drift
- Monitor disagreement between redundant perception channels (camera vs. LiDAR)
- Compare real-world behavior against simulation predictions
- Spatio-temporal OOD detection considers both spatial anomalies within frames and temporal anomalies across sequences
5.4 Response to Detected Shift
When distribution shift or OOD inputs are detected, the system should:
- Immediate: Flag for increased caution / reduced speed / human takeover
- Short-term: Upload flagged scenarios for analysis
- Medium-term: Trigger targeted data collection and model retraining
- Long-term: Update the training distribution to include the new domain
6. Multi-Airport / Multi-Domain Adaptation
6.1 The Problem
Train a perception/planning stack at Airport A, deploy it at Airport B without forgetting how to operate at Airport A. This is domain-incremental learning: the task remains the same (navigate safely on the airside) but the domain changes (different airport layouts, markings, vehicle types, traffic patterns, weather).
6.2 Domain-Incremental Learning (DIL)
DIL is a specific continual learning setting where:
- The task structure remains constant (same classes, same objectives)
- The input distribution changes (new visual appearance, new spatial configurations)
- The model must perform well on all encountered domains simultaneously
Research on domain-incremental object detection for autonomous driving has demonstrated approaches that:
- Avoid catastrophic forgetting on previous domains
- Mitigate performance degradation in source domains
- Improve detection accuracy on new target domains
6.3 Brain-Inspired Domain-Incremental Adaptive Detection
A notable approach uses a two-stage framework:
Recall Stage: When encountering a new domain, the model recalls knowledge from the most similar previously learned domain (determined by a "domain tree" that measures inter-domain divergence).
Adapt Stage: The model adapts to the new domain using:
- Domain-Mix: combines pseudo-labels from previous domains with ground-truth source labels
- Patch-based adversarial learning for fine-grained spatial alignment
- Multi-level alignment (pixel-level and instance-level) to maintain discriminability
This approach successfully transferred knowledge across virtual-to-real and across weather conditions while maintaining performance on previously learned domains.
6.4 Unsupervised Domain Adaptation (UDA)
When labeled data is unavailable at the target domain (common when deploying to a new airport before operations begin), UDA techniques enable adaptation using only unlabeled data:
- Feature alignment: Align the feature distributions of source and target domains using adversarial training or moment matching
- Self-training: Generate pseudo-labels on target data using the source-trained model, then retrain
- Style transfer: Transform source images to look like target domain images, enabling training with source labels
Research gap: Only 19% of 2020-2024 publications focus on multi-domain adaptation for weather-robust perception, and only 8% integrate heterogeneous sensors, indicating significant room for advancement.
6.5 Multi-Airport Deployment Strategy
A practical strategy combining these techniques:
- Base Training: Train on Airport A with full supervision (labeled data)
- Pre-Deployment Survey: Collect unlabeled data at Airport B (manual/remote driving)
- Domain Gap Assessment: Measure distribution divergence between A and B
- Adaptation:
- If gap is small: fine-tune with EWC/MAS regularization using limited B labels
- If gap is moderate: apply UDA with self-training on unlabeled B data
- If gap is large: collect labeled data at B; use replay buffer to prevent A forgetting
- Validation: Test on held-out data from both A and B before deployment
- Monitoring: Deploy drift detection to catch remaining distribution gaps
- Iteration: Feed operational data from B back into the training pipeline
Waymo's empirical finding -- that local nuances decrease with each new city -- suggests that after sufficient domain diversity in training, the marginal adaptation cost for each new site decreases. This is encouraging for multi-airport deployment.
7. Curriculum Learning
7.1 Concept
Curriculum learning organizes training data from simple to complex, mimicking how humans learn. For autonomous driving, this means starting with easy scenarios (straight driving, clear weather, no traffic) and progressively introducing harder ones (dense traffic, adverse weather, complex intersections).
7.2 Benefits for AV Training
- Faster convergence: Models learn basic skills quickly, then build on them for complex scenarios
- Better generalization: Progressive exposure to harder cases produces more robust models than random sampling
- Safety alignment: Early mastery of basic safe driving behaviors provides a foundation for handling edge cases
7.3 CuRLA: Curriculum Learning with Deep RL (2025)
CuRLA combines deep reinforcement learning with curriculum learning for autonomous driving in the CARLA simulator:
- Uses Proximal Policy Optimization (PPO) with a Variational Autoencoder (VAE) for state representation
- Five-phase curriculum progressively transforms the environment from simple to complex
- Each phase introduces new challenges (traffic, weather, road complexity)
7.4 CurricuVLM: VLM-Guided Curriculum Learning (2025)
CurricuVLM represents a significant advance by using Vision-Language Models (GPT-4o) as "curriculum designers":
Safety-Critical Event Analysis:
- When unsafe situations occur, GPT-4o generates narrative descriptions capturing violation type, object positioning, AV response patterns, and contextual factors
- Multiple event descriptions are analyzed collectively to identify recurring behavioral patterns
Personalized Curriculum Generation:
- Dynamically creates training scenarios targeting identified weaknesses
- Balances three objectives: realistic background vehicle trajectories, agent response distributions matching recent history, and alignment with VLM-identified weaknesses
- The result is a personalized curriculum that adapts to each training agent's specific deficiencies
Adaptive Scheduling:
- Safety-critical scenario probability increases progressively with training progress
- Early training emphasizes fundamentals; edge cases are gradually introduced
- Bounded triggering prevents overwhelming the agent with difficult scenarios
7.5 Curriculum for Multi-Domain Deployment
Curriculum learning has natural synergy with multi-airport deployment:
- Phase 1: Train on the most representative/common airport environment
- Phase 2: Introduce variations (different lighting, weather within the same airport)
- Phase 3: Add a second airport with moderate domain gap
- Phase 4: Introduce airports with progressively larger domain gaps
- Phase 5: Add rare/extreme scenarios across all domains
This structured exposure builds robust representations before challenging them with harder domains, reducing catastrophic forgetting compared to random domain ordering.
8. Model Versioning and Rollback for Fleets
8.1 Criticality
For safety-critical AV systems, the ability to rapidly detect model degradation and revert to a known-good version is essential. A new model release that performs well in simulation may exhibit unexpected failures in production due to untested edge cases or subtle distribution mismatches.
8.2 Versioning Infrastructure
Model Registry: Track all model versions with:
- Model weights and architecture definition
- Training data snapshot (or hash/reference)
- Training hyperparameters and procedure
- Validation metrics (simulation + real-world)
- Deployment status (staging, canary, production, retired)
- Associated code version and dependencies
Tools: MLflow, DVC (Data Version Control), Weights & Biases, and cloud-specific registries (Vertex AI Model Registry, Azure ML) provide model versioning capabilities. For safety-critical applications, the registry must also capture the training data provenance and validation evidence.
8.3 Deployment Strategies
Blue-Green Deployment
- Maintain two identical production environments (Blue = current, Green = new)
- Route traffic to Green gradually
- If Green shows issues, instantly revert to Blue
- Near-zero downtime rollback
Canary Deployment
- Deploy new model to 1-5% of fleet first
- Monitor safety metrics (intervention rate, hard braking, trajectory deviations)
- Gradually increase to 10%, 50%, 100% if metrics are satisfactory
- For AV fleets: "canary" vehicles could be specific vehicles in specific (lower-risk) operational areas
Shadow Deployment
- Run new model in parallel with production model
- Compare outputs without affecting vehicle behavior
- Tesla's shadow mode is essentially fleet-scale shadow deployment
- Enables risk-free evaluation on real-world data
8.4 Rollback Mechanisms
Three components must function together:
Continuous Monitoring: Detect performance degradation in real-time
- Perception accuracy metrics (compared against high-confidence ensemble)
- Planning quality metrics (smoothness, safety margins, route efficiency)
- Intervention/disengagement rate
- Latency and compute utilization
State Preservation: Maintain previous model versions in deployable state
- Pre-compiled models for target hardware
- Validated and certified for deployment
- Associated calibration parameters
Rapid Restoration: Switch back to previous version with minimal disruption
- For AV fleets: OTA rollback mechanism
- Immediate fallback to rule-based safe behavior during transition
- "Behavioral rollback": revert from AI-driven behavior to simpler rule-based control when anomalies surface
8.5 Safety-Critical Considerations
| Trigger Type | Response Time | Application |
|---|---|---|
| Safety violation detected | Milliseconds | Immediate behavioral rollback to rule-based control |
| Performance degradation | Minutes-hours | Model rollback via OTA for affected subset |
| Systematic bias discovered | Hours-days | Fleet-wide model rollback and investigation |
| Regulatory directive | Hours-days | Controlled fleet-wide rollback with documentation |
Regulatory context: EU mandates ADAS features since July 2024 with driver attention monitoring required starting 2026. OTA update frameworks must satisfy cybersecurity, performance validation, and traceability requirements. UNECE/GRVA's NATM mandates standardized scenario execution in simulation, proving grounds, and on-road testing, plus auditability and in-service performance monitoring.
9. Federated Learning for Privacy-Preserving Fleet Learning
9.1 Motivation
AV fleets generate enormous volumes of driving data that may contain:
- Personally identifiable information (passengers, pedestrians, license plates)
- Sensitive location data (airport security areas, restricted zones)
- Proprietary operational data (customer airline ground operations)
Federated learning enables fleet-wide model improvement without centralizing raw data.
9.2 Federated Learning Fundamentals
FedAvg (Federated Averaging):
- Central server distributes global model to all vehicles
- Each vehicle trains locally on its own data for several epochs
- Vehicles send model updates (gradients or weight deltas) to server
- Server aggregates updates via weighted averaging
- Updated global model is distributed back
Limitation: FedAvg struggles with highly non-IID (non-independent and identically distributed) data -- which is the norm for AV fleets where different vehicles operate in different environments.
9.3 Handling Non-IID Data in AV Fleets
Different vehicles at different airports see fundamentally different data distributions. Standard aggregation strategies fail because local optima diverge.
FedProx: Adds a proximal term to each client's local objective, penalizing large deviations from the global model. Converges better than FedAvg under data heterogeneity and is recommended for AV applications.
FedNova: Normalizes and scales local updates based on local iteration count before aggregation, addressing heterogeneity in both data and computation.
FedLGA (2024): Uses Taylor expansion to approximate full local gradient updates for resource-constrained clients. Improved accuracy from 60.91% (FedAvg) to 64.44% on non-IID CIFAR-10.
Personalized FL: Rather than training one global model, personalized FL produces client-specific models that share some parameters globally while maintaining local specialization. This is attractive for multi-airport deployment where each site needs some local adaptation.
9.4 Privacy-Preserving Mechanisms
Differential Privacy (DP):
- Add calibrated noise to model updates before transmission
- Provides mathematical guarantee: no single training example can significantly influence the model
- Correlated differential privacy (2024) reduces noise while maintaining privacy guarantees by exploiting correlations between model parameters
- Localized differential privacy (LDP) adds noise on-device before any transmission
Secure Aggregation:
- Cryptographic protocols that allow the server to compute aggregate updates without seeing individual contributions
- Lightweight homomorphic encryption (LHE) enables secure computation on encrypted model updates
- Byzantine-robust aggregation (Krum rule) selects updates most similar to the majority, filtering poisoned or anomalous gradients
Communication Efficiency:
- Gradient compression/sparsification: send only significant updates
- LoRA-based FL: only transmit low-rank adapter updates rather than full model weights
- Reduces bandwidth requirements critical for vehicles with limited connectivity
9.5 FLAD: Federated Learning for LLM-based AD (2025)
FLAD represents the frontier of federated learning for autonomous driving:
Three-Layer Architecture:
- Vehicle layer: local model training on onboard sensor data
- Edge layer: regional aggregation and intermediate model storage
- Cloud layer: global aggregation and model distribution
Key Innovations:
- Communication scheduling mechanism optimizing training efficiency
- Knowledge distillation for personalizing LLMs to heterogeneous edge data
- Intelligent parallelized collaborative training leveraging otherwise idle edge devices
- Prototyped on NVIDIA Jetson hardware, demonstrating practical feasibility
9.6 Federated Continual Learning
The intersection of federated learning and continual learning addresses the scenario where fleet vehicles encounter new tasks/domains over time while collaborating without sharing data:
- Each vehicle performs continual learning locally (with forgetting mitigation)
- Federated aggregation combines progress across vehicles
- Challenge: different vehicles may be at different stages of learning new domains
- Solution: federated replay where vehicles share synthetic (generative replay) data rather than raw data
9.7 Relevance to Airport Operations
Airport airside data is particularly sensitive:
- Security-restricted areas visible in camera feeds
- Airline operational patterns that may be commercially sensitive
- Worker and passenger privacy concerns
Federated learning enables:
- Each airport to maintain data sovereignty
- Cross-airport model improvement without data sharing
- Compliance with data residency regulations (different countries)
- Per-airport model personalization with global knowledge sharing
10. World Model Online Adaptation
10.1 World Models for Autonomous Driving
A world model is a generative spatio-temporal neural system that compresses multi-sensor observations into a compact latent state and rolls it forward under hypothetical actions, enabling the vehicle to "rehearse futures before they occur." World models serve multiple roles:
- Planning: Evaluate candidate trajectories by simulating their outcomes
- Simulation: Generate training data for rare scenarios
- Prediction: Forecast other agents' behavior
- Safety: Detect when the world deviates from expectations (OOD detection)
10.2 Key Architectures
Transformer-Based:
- GAIA-1: Treats video, text, and control as one token stream; generates minute-long controllable clips
- DriveDreamer series: Progresses from 2D to 4D generation with LLM-based prompting
- InfinityDrive: Multi-resolution spatiotemporal modeling with memory mechanisms for long-range dependencies
Diffusion-Based:
- Drive-WM, Vista: Latent diffusion for geometry-consistent video from BEV layouts, text prompts, optical flow
- HoloDrive, BEVGen: Fuse camera and LiDAR, lifting BEV layouts to street-level frames
10.3 AdaWM: Adaptive World Model Planning (ICLR 2025)
AdaWM is the most directly relevant work for online world model adaptation:
Problem: When pretrained world models encounter distribution shift in new environments, both the dynamics model and the planning policy can become mismatched, causing performance degradation.
Approach:
Mismatch Identification: Quantifies two sources of error:
- Policy mismatch (E_pi): pretrained policy is suboptimal for new tasks (Total Variation distance between state visitation distributions)
- Dynamics model mismatch (E_P): model predictions are inaccurate in new environment (TV distance between state-action visitation distributions)
- Decision criterion: update dynamics model if E_P >= C1*E_pi - C2
Alignment-Driven Finetuning:
- For dynamics models: LoRA-based low-rank adaptation updating only weight B of base vectors
- For policies: decompose into weighted convex combination of sub-units, update only weight vectors
- Selective: only update whichever component has greater mismatch
Results (CARLA):
- Roundabout tasks: Success rate 0.82 vs. 0.40 baseline
- Left turn in dense traffic: Success rate 0.70 vs. 0.35 baseline
- Outperforms both supervised methods (VAD, UniAD) and model-based RL (DreamerV3)
10.4 Waymo World Model (February 2026)
Waymo's World Model, built on Google DeepMind's Genie 3, represents the current state-of-the-art:
Foundation: Genie 3's pre-training on an "extremely large and diverse set of videos" provides broad world knowledge. Specialized post-training converts 2D video knowledge into 3D lidar outputs matching Waymo's hardware.
Multi-Sensor Generation: Generates high-fidelity camera AND lidar data simultaneously -- critical for validating multi-modal perception pipelines.
Three Control Mechanisms:
- Driving action control: Simulate counterfactual scenarios and alternative routes
- Scene layout control: Customize road layouts, traffic signals, other road user behavior
- Language control: Adjust time-of-day, weather, or generate entirely synthetic scenarios via text prompts
Adaptation Implication: Language-controlled generation enables testing the AV stack against scenarios it has never encountered in the real world, proactively preparing for distribution shift before deployment in new cities. Waymo has logged nearly 200 million real autonomous miles but "racks up billions of miles in virtual worlds."
10.5 comma.ai's World Model (v0.10)
comma.ai's world model ("Tomb Raider") is used for simulation-based on-policy training:
Architecture:
- Compressor: Stable Diffusion's image VAE reduces states to latent representations
- Dynamics Model: Video Diffusion Transformer predicts latent transitions given history and actions
- Plan Head: Predicts curvature and acceleration trajectories trained on human driving
Future Anchoring: The model receives future states at fixed time steps ahead, enabling recovery from compounding prediction errors -- a critical innovation for long-horizon world model rollouts.
Deployment: Policies trained through this world model are live in openpilot (v0.8.15 for lateral, v0.9.0 for longitudinal, v0.10 for world model simulation).
10.6 Online Adaptation of World Models
As environments change (new construction at an airport, seasonal changes, new vehicle types), world models must adapt without catastrophic forgetting. The key challenges:
- Scarce data: Safety-critical edge cases are rare in real-world collection
- Long-horizon reliability: Prediction errors compound over multi-step rollouts
- Physics violations: World models may generate physically impossible scenarios (sudden vehicle appearances, incorrect speeds)
- Multi-sensor consistency: Adapted models must maintain consistency between camera and lidar outputs
Promising approaches:
- AdaWM's selective finetuning (update only the mismatched component)
- LoRA-based adaptation (few new parameters, fast to update)
- Uncertainty-aware ensembles for epistemic/aleatoric risk separation
- VLM-guided approaches: label logs with VLM safety scores, generate safety-aware rollouts
11. Synthesis: Architecture for Continual-Learning AV Fleets
Drawing on all sections above, a reference architecture for a continual-learning AV fleet:
+-------------------------------------------------------------------+
| CLOUD / DATA CENTER |
| |
| [Model Registry] <-> [Training Pipeline] <-> [Data Lake] |
| | | ^ |
| [Validation & [Active [Auto-Labeling] |
| Simulation] Learning] ^ |
| | | | |
| [A/B Test & [Curriculum [Data |
| Canary Mgmt] Scheduler] Selection] |
| | ^ |
+-------|--------------------------------------------|--------------+
| |
v OTA Updates | Trigger Uploads
+-------------------------------------------------------------------+
| EDGE / VEHICLE FLEET |
| |
| [Perception] -> [World Model] -> [Planning] -> [Control] |
| | | | |
| [OOD Detection] [Drift Monitor] [Safety Monitor] |
| | | | |
| [Trigger [Online [Behavioral |
| Classifiers] Adaptation] Rollback] |
| |
+-------------------------------------------------------------------+
| |
v FL Aggregation v
+-------------------------------------------------------------------+
| FEDERATED LEARNING LAYER |
| |
| [Local Training] -> [Gradient Compression] -> [Secure Agg] |
| [DP Noise Addition] -> [Personalization] -> [Global Model] |
| |
+-------------------------------------------------------------------+Key design principles:
- Defense in depth against forgetting: Combine EWC/MAS regularization with replay buffers and architecture-based isolation for maximum robustness
- Selective data collection: Use trigger classifiers and uncertainty-based active learning to collect only informative data, not everything
- Graduated deployment: Shadow mode -> canary fleet -> staged rollout -> full deployment, with rollback at every stage
- Multi-domain awareness: Domain-incremental learning with curriculum ordering of new sites
- Privacy by design: Federated learning with differential privacy for cross-airport knowledge sharing
- World model as force multiplier: Generate synthetic scenarios for domains not yet encountered; test adaptation before physical deployment
12. Key Takeaways for Airport Airside Deployment
Start with strong base training, then adapt incrementally. A model trained on a diverse initial airport with rich scenarios provides a better foundation than training from scratch at each site. Waymo's experience shows that nuances decrease with each new domain.
EWC + replay is the practical sweet spot for preventing catastrophic forgetting during multi-airport adaptation. EWC is simple to implement and has demonstrated 0% collision rates in AV car-following experiments. Supplement with a replay buffer of domain-specific edge cases from each airport.
Deploy trigger classifiers for airside-specific edge cases: GSE conflicts, jet blast zones, wide-body pushback, wildlife, construction zones, unusual aircraft types. These enable targeted data collection from the fleet.
Use world models for proactive adaptation: Before deploying at a new airport, generate synthetic scenarios with the world model (varying airport layout, weather, traffic density). Test the model's performance in simulation before physical deployment.
Domain-incremental learning with curriculum ordering: Deploy at airports in order of increasing domain gap from the training distribution. Each deployment adds to the model's domain coverage, making subsequent deployments easier.
Federated learning enables cross-airport improvement without data sharing: Each airport retains data sovereignty while contributing to global model improvement. Personalized FL allows site-specific specialization.
Build rollback into the deployment pipeline from day one: Canary deployment at each new airport, shadow mode validation, behavioral rollback to rule-based control as the ultimate safety net. Track disengagement rates and intervention triggers per model version.
Monitor for drift continuously: Airport environments change (new construction, seasonal weather, new aircraft types, new ground procedures). Drift detection should trigger automated data collection and retraining workflows.
Budget for labeling infrastructure: Active learning reduces but does not eliminate labeling needs. Auto-labeling with GPS-correlated annotations (similar to Tesla's intersection labeling) can reduce per-frame cost dramatically.
Plan for compute at the edge: PackNet or knowledge distillation enables deploying multi-domain models on vehicle hardware. LoRA-based adaptation enables on-device fine-tuning for per-vehicle calibration.
13. References
Catastrophic Forgetting and Continual Learning
- Overcoming Catastrophic Forgetting in Neural Networks (EWC) -- Kirkpatrick et al., original EWC paper
- Elastic Weight Consolidation Done Right for Continual Learning -- 2026 analysis of EWC limitations
- On the Computation of the Fisher Information in Continual Learning -- Fisher information analysis for CL
- Continual Learning for Adaptable Car-Following in Dynamic Traffic Environments -- EWC/MAS applied to autonomous driving
- Progressive Neural Networks -- Rusu et al., DeepMind
- Architecture-Based Continual Learning Algorithms -- PackNet, AdaHAT overview
- Experience Replay Addresses Loss of Plasticity in Continual Learning -- 2025 study
- Adaptive Memory Replay for Continual Learning (CVPR 2024)
- Online Continual Learning: A Systematic Literature Review
- Catastrophic Forgetting in Neural Networks -- IBM overview
Surveys and Overviews
- Advancing Autonomy Through Lifelong Learning: A Survey -- 2024 survey of lifelong learning for AIS
- Continual Learning for Real-World Autonomous Systems -- Springer survey
- A Survey of Autonomous Driving from a Deep Learning Perspective -- ACM Computing Surveys 2025
Data Flywheel and Active Learning
- How Tesla Turned Every Driver Into a Data Source -- Tesla's data flywheel
- Tesla Data Engine: Trigger Classifiers -- Detailed pipeline analysis
- Autonomous Vehicle Training and Tesla's Data Engine Explained
- ActiveAD: Planning-Oriented Active Learning for End-to-End Autonomous Driving
- Data-Centric Evolution in Autonomous Driving: Comprehensive Survey -- Closed-loop pipeline architectures
- Tesla's FSD: The Software Flywheel
Domain Adaptation and Multi-Domain Deployment
- Brain-Inspired Domain-Incremental Adaptive Detection for Autonomous Driving
- Multi-Domain Adaptation for Autonomous Driving Perception
- Domain Adaptation for Autonomous Driving
- Cross-Domain Autonomous Driving Visual Segmentation
- Gradual Divergence for Seamless Adaptation: Domain Incremental Learning
Distribution Shift and OOD Detection
- Out-of-Distribution Segmentation in Autonomous Driving: Problems and State of the Art
- OOD Detection for Safety Assurance
- Novelty-Aware Concept Drift Detection for Neural Networks
- Concept Drift Detection in Image Data Stream: Survey
- Efficient Spatio-Temporal OOD Detection for Autonomous Systems
Curriculum Learning
- CurricuVLM: Safety-Critical Curriculum Learning with VLMs
- CuRLA: Curriculum Learning Based Deep RL for Autonomous Driving
- Decision Making for AVs: Mixed Curriculum RL
Model Versioning and Deployment
- Hitting the Undo Button: The Critical Role of Rollback in AI Systems
- Model Versioning: Top Tools and Best Practices
- Future-Proofing Automotive Software via OTA Updates
- Model Rollbacks Through Versioning
Federated Learning
- Federated Learning for Connected and Automated Vehicles: Survey
- FLAD: Federated Learning for LLM-based Autonomous Driving -- Vehicle-edge-cloud architecture
- Personalized FL for Autonomous Driving with Correlated DP
- FLAV: Federated Learning for Autonomous Vehicle Privacy Protection
- Deep Federated Learning: Systematic Review
- Federated Learning: Survey on Privacy-Preserving Collaborative Intelligence
World Models
- A Survey of World Models for Autonomous Driving -- Comprehensive 2025 survey
- The Role of World Models in Shaping Autonomous Driving
- AdaWM: Adaptive World Model Based Planning (ICLR 2025)
- The Waymo World Model (February 2026)
- Learning to Drive from a World Model -- comma.ai
- Waymo: Safe, Routine, Ready -- Autonomous Driving in New Cities
Industry and Deployment
- comma.ai openpilot -- 325+ car models, 20K+ users
- openpilot 0.10 Release -- World model training architecture
- Waymo's 2025 Year in Review
- Tesla 2026: Cortex 2 Confirmed
- Decoding Tesla's Core AI and Hardware Architecture