Skip to content

Reinforcement Learning for Autonomous Driving Policy Learning

Comprehensive guide to model-free and offline reinforcement learning for learning driving policies — covering on-policy (PPO, IMPALA), off-policy (SAC, TD3, TQC, CrossQ), offline RL (CQL, IQL, EDAC, Decision Transformer), behavior cloning bootstrapping, constrained/safe RL (CPO, CMDP, Lagrangian), policy distillation for edge deployment, and offline-to-online fine-tuning. Focused on practical applicability to airport airside GSE operations with the reference ROS Noetic airside stack.

Relation to existing docs: Complements rl-with-world-models.md (model-based RL with Dreamer/TD-MPC), neural-motion-planning.md (IL-based planning), diffusion-trajectory-planning.md (diffusion-based generation), safety-critical-planning-cbf.md (CBF safety filters), causal-reasoning-counterfactual.md (causal policy evaluation). This document focuses on model-free and offline RL policy learning — the algorithms that directly optimize a policy from environment interaction or fixed datasets, without requiring a learned dynamics model.

Key Takeaway: For airside autonomous GSE, offline RL from recorded fleet data is the most realistic path to safe policy learning — online exploration on an active apron is unacceptable. CaRL (CoRL 2025) demonstrates that PPO with simple route-completion rewards scales to complex driving and is the best open-source RL planner on both CARLA Leaderboard 2.0 and nuPlan. IQL emerges as the most practical offline RL algorithm for driving (consistent across traffic densities, no need for explicit behavior policy). The recommended approach is a three-phase pipeline: (1) behavior cloning from Frenet planner demonstrations as a warm start, (2) offline RL fine-tuning on fleet data with IQL/CQL, (3) online refinement in simulation (CARLA airport env) with PPO + CBF safety filter. Policy distillation compresses the learned policy to run within the 14.8ms multi-task perception budget on Orin.


Table of Contents

  1. Why RL for Driving Policy Learning
  2. RL Fundamentals for Driving
  3. On-Policy Methods
  4. Off-Policy Methods
  5. Offline Reinforcement Learning
  6. Behavior Cloning and Bootstrapping
  7. Safe and Constrained RL
  8. Offline-to-Online Fine-Tuning
  9. Policy Distillation for Edge Deployment
  10. RL Benchmarks and Evaluation
  11. Practical Implementation for Airside
  12. Key Takeaways
  13. References

1. Why RL for Driving Policy Learning

1.1 The Limitations of Current Approaches

the reference airside AV stack's current Frenet planner generates 420 trajectory candidates per cycle and selects the lowest-cost one via hand-crafted cost functions. This works well for structured, low-speed airside operations but has fundamental limitations:

LimitationImpact
Hand-crafted cost functionsCannot capture all interaction nuances; adding new behaviors requires manual engineering
Combinatorial explosion420 candidates sample sparsely in high-dimensional trajectory space
No learning from experienceSame behavior whether first or thousandth time at a stand
Poor multi-agent reasoningCost functions don't model other agents' reactions to ego actions
Conservative by defaultRule-based safety margins are uniform; can't adapt to context

1.2 What RL Offers

Reinforcement learning directly optimizes a policy π(a|s) to maximize cumulative reward through interaction with an environment (or dataset). For driving:

  • Learns from outcome, not demonstration: Can discover behaviors better than the human/rule-based teacher
  • Handles sequential decisions: Naturally reasons about long-horizon consequences
  • Adapts to distribution shift: Online RL continuously improves from new data
  • Multimodal behavior: Stochastic policies capture multiple valid driving strategies

1.3 Why Not Just Use IL?

Imitation learning (behavior cloning, DAgger) learns from expert demonstrations. It's simpler than RL but suffers from:

IssueRL Advantage
Compounding errorBC drifts from expert distribution; errors compound over trajectory. RL optimizes closed-loop performance directly
Distribution mismatchIL only sees expert states; at test time, minor errors push to unseen states. RL explores and recovers
Bounded by teacherIL policy is at best as good as the expert. RL can exceed the expert
Reward hacking vs. copyingIL copies behavior (including irrelevant correlations). RL optimizes the actual objective

CaRL (CoRL 2025) demonstrated this concretely: RL with simple rewards outperforms all IL baselines on CARLA Leaderboard 2.0 longest6 v2 and achieves SOTA on nuPlan, while being more scalable with training compute.

1.4 Model-Free vs. Model-Based RL

This document focuses on model-free (and offline) RL. The distinction:

AspectModel-Based (Dreamer, TD-MPC)Model-Free (PPO, SAC, IQL)
Learns dynamicsYes — explicit world modelNo — learns policy/value directly
Sample efficiencyHigher (imagined rollouts)Lower (needs more real/sim data)
Compounding model errorYes — model errors accumulateNo — no model to be wrong
Compute at inferencePlanning in model (higher)Forward pass through policy (lower)
Best forComplex dynamics, long horizonSimple/well-understood dynamics, abundant data

For airside at low speeds (5-25 km/h) with relatively simple dynamics, model-free RL is viable and avoids the complexity of learned dynamics models. Model-based RL (covered in rl-with-world-models.md) is better when dynamics are complex or data is scarce.


2. RL Fundamentals for Driving

2.1 MDP Formulation for Driving

Autonomous driving as a Markov Decision Process (MDP):

M = (S, A, T, R, γ)

S: State space — ego state (x, y, θ, v, κ) + perception output (detected objects, free space, map)
A: Action space — trajectory waypoints (x, y, θ, v) at future timesteps, or direct control (steering, throttle)
T: Transition dynamics — vehicle kinematics + environment evolution
R: Reward function — safety, progress, comfort, efficiency
γ: Discount factor — typically 0.99 for driving (long-horizon)

2.2 Action Space Design

The choice of action space profoundly affects learning:

Action SpaceDimensionalityProsCons
Direct control (δ, a)2D continuousSimple, low-dimJerky, no trajectory coherence
Waypoint sequence (x_t, y_t)×H2H (e.g., 20D for H=10)Smooth, interpretableHigh-dimensional, harder to learn
Lateral offset + speed (d, v)2D continuousMaps to Frenet frameLimited expressivity
Trajectory indexDiscrete (K=420)Matches Frenet candidatesFixed set, no interpolation
Residual on plannerLow-dim continuousRefines existing plannerCoupled to planner quality

Recommendation for reference airside AV stack: Start with lateral offset + longitudinal speed in Frenet frame. This maps directly to the existing Frenet planner's output space, enables incremental deployment (RL as a "selector" over Frenet candidates), and keeps the action space low-dimensional.

2.3 State Representation

What the RL agent observes:

python
class DrivingState:
    """State representation for RL driving policy."""
    
    # Ego state (from GTSAM localization)
    ego_position: np.ndarray  # (x, y) in map frame
    ego_heading: float        # θ radians
    ego_speed: float          # m/s
    ego_curvature: float      # κ 1/m
    ego_acceleration: float   # m/s²
    
    # Route information (from mission planner)
    route_waypoints: np.ndarray  # (N, 2) upcoming waypoints
    distance_to_goal: float
    
    # Detected objects (from PointPillars/multi-task head)
    objects: List[DetectedObject]  # position, velocity, class, uncertainty
    
    # Occupancy/free space (from occupancy head)
    bev_occupancy: np.ndarray  # (H, W) binary/probabilistic
    
    # Map features (from Lanelet2 + semantic map)
    lane_boundaries: np.ndarray  # relative to ego
    speed_limits: np.ndarray
    right_of_way: int  # priority level (0-8 from neuro-symbolic doc)
    
    # Operational context
    weather_condition: int  # ODD state from runtime monitor
    time_of_day: int
    airport_zone: int  # apron, taxiway, service road, etc.

Encoding for RL: Flatten ego state + route into vector, encode objects via PointNet-style aggregation or attention, encode BEV as CNN features. Total state dimension: ~256-512D after encoding.

2.4 Reward Design

Reward design is the most critical and difficult aspect of RL for driving.

CaRL's insight (CoRL 2025): Complex shaped rewards (summing 10+ terms) cause PPO to fail at scale because conflicting gradients from different reward terms become harder to reconcile with larger batch sizes. A single primary reward (route completion) with multiplicative infraction penalties and episode termination scales much better.

Reward structure for airside GSE:

python
def airside_reward(state, action, next_state, info):
    """
    CaRL-inspired reward: route completion + infraction penalties.
    """
    # Primary reward: progress along route
    route_progress = info['route_completion_delta']  # fraction of route completed this step
    
    # Infraction penalties (multiplicative, not additive)
    infraction_multiplier = 1.0
    
    # Safety infractions — terminate episode
    if info['collision']:
        return -1.0  # terminal
    if info['runway_incursion']:
        return -1.0  # terminal
    if info['aircraft_proximity'] < AIRCRAFT_MIN_DISTANCE:
        return -1.0  # terminal
    if info['personnel_proximity'] < PERSONNEL_MIN_DISTANCE:
        return -1.0  # terminal
    
    # Soft infractions — reduce reward multiplicatively
    if info['speed_violation']:
        infraction_multiplier *= 0.5
    if info['wrong_zone']:
        infraction_multiplier *= 0.3
    if info['excessive_jerk']:
        infraction_multiplier *= 0.8
    if info['off_route'] > OFF_ROUTE_THRESHOLD:
        infraction_multiplier *= 0.5
    
    # Comfort bonus (small, doesn't dominate)
    comfort = -0.01 * abs(info['lateral_acceleration'])
    
    return route_progress * infraction_multiplier + comfort

2.5 Discount Factor and Horizon

ParameterTypical RoadAirside GSERationale
γ (discount)0.990.995Longer missions (10-30 min), need long-horizon planning
Episode length20-60s120-600sFull stand-to-stand mission
Decision frequency10 Hz10 HzMatches perception pipeline
Effective horizon1/(1-γ) = 100 steps200 steps20s lookahead at 10 Hz

3. On-Policy Methods

On-policy methods update the policy using data collected by the current policy. They're sample-inefficient but stable.

3.1 PPO (Proximal Policy Optimization)

PPO (Schulman et al., 2017) is the most widely used RL algorithm for driving, and the backbone of CaRL.

Why PPO dominates driving RL:

  • Clipped objective prevents destructive policy updates
  • Works with both discrete and continuous actions
  • Parallelizes well across many environments
  • Simple to implement and tune

PPO objective:

L^CLIP(θ) = E_t [min(r_t(θ) Â_t, clip(r_t(θ), 1-ε, 1+ε) Â_t)]

where r_t(θ) = π_θ(a_t|s_t) / π_θ_old(a_t|s_t)  (probability ratio)
      Â_t = advantage estimate (GAE-λ)
      ε = clip range (typically 0.2)

CaRL configuration (SOTA on CARLA + nuPlan):

HyperparameterCaRL ValueNotes
Learning rate3e-4Adam with linear warmup
Clip range ε0.2Standard
GAE λ0.95High bias-variance tradeoff
Discount γ0.99
Mini-batch size2048+Key finding: scales with simple rewards
Entropy coefficient0.01Encourages exploration
Value function coeff0.5
Max gradient norm0.5Gradient clipping
NetworkMLP (256, 256)Privileged state input
RewardRoute completionSingle term + infraction penalties

CaRL's scaling insight: With complex shaped rewards (10+ weighted terms), PPO's performance degrades when mini-batch size increases. With simple route-completion reward, performance improves with larger batches. This makes RL scalable with more compute — a fundamental requirement for production.

python
class PPODrivingAgent:
    """PPO agent for airside driving."""
    
    def __init__(self, state_dim, action_dim, config):
        self.actor = nn.Sequential(
            nn.Linear(state_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, action_dim * 2),  # mean + log_std
        )
        self.critic = nn.Sequential(
            nn.Linear(state_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, 1),
        )
        self.optimizer = torch.optim.Adam(
            list(self.actor.parameters()) + list(self.critic.parameters()),
            lr=config.lr,
        )
        self.clip_range = config.clip_range
        self.gamma = config.gamma
        self.gae_lambda = config.gae_lambda
    
    def get_action(self, state):
        """Sample action from current policy."""
        output = self.actor(state)
        mean, log_std = output.chunk(2, dim=-1)
        std = log_std.clamp(-5, 2).exp()
        dist = torch.distributions.Normal(mean, std)
        action = dist.sample()
        log_prob = dist.log_prob(action).sum(-1)
        return action, log_prob
    
    def compute_gae(self, rewards, values, dones):
        """Generalized Advantage Estimation."""
        advantages = torch.zeros_like(rewards)
        last_gae = 0
        for t in reversed(range(len(rewards))):
            if t == len(rewards) - 1:
                next_value = 0
            else:
                next_value = values[t + 1]
            delta = rewards[t] + self.gamma * next_value * (1 - dones[t]) - values[t]
            advantages[t] = delta + self.gamma * self.gae_lambda * (1 - dones[t]) * last_gae
            last_gae = advantages[t]
        returns = advantages + values
        return advantages, returns
    
    def update(self, batch):
        """PPO clipped objective update."""
        states, actions, old_log_probs, advantages, returns = batch
        
        # Normalize advantages
        advantages = (advantages - advantages.mean()) / (advantages.std() + 1e-8)
        
        for _ in range(self.n_epochs):  # typically 10
            # Get current policy distribution
            output = self.actor(states)
            mean, log_std = output.chunk(2, dim=-1)
            std = log_std.clamp(-5, 2).exp()
            dist = torch.distributions.Normal(mean, std)
            new_log_probs = dist.log_prob(actions).sum(-1)
            entropy = dist.entropy().sum(-1).mean()
            
            # Policy ratio
            ratio = (new_log_probs - old_log_probs).exp()
            
            # Clipped objective
            surr1 = ratio * advantages
            surr2 = torch.clamp(ratio, 1 - self.clip_range, 1 + self.clip_range) * advantages
            policy_loss = -torch.min(surr1, surr2).mean()
            
            # Value loss
            values = self.critic(states).squeeze(-1)
            value_loss = 0.5 * (returns - values).pow(2).mean()
            
            # Total loss
            loss = policy_loss + 0.5 * value_loss - 0.01 * entropy
            
            self.optimizer.zero_grad()
            loss.backward()
            nn.utils.clip_grad_norm_(self.parameters(), 0.5)
            self.optimizer.step()

3.2 IMPALA (Importance Weighted Actor-Learner Architecture)

IMPALA decouples acting from learning, enabling massive parallelism:

  • Actors: Many parallel environment instances collecting experience
  • Learner: Single GPU training on batched experience
  • V-trace correction: Corrects for off-policy data from slightly stale actor policies
v_s = V(x_s) + Σ_{t=s}^{s+n-1} γ^{t-s} (Π_{i=s}^{t-1} c_i) δ_t V

where c_i = min(c̄, π(a_i|x_i) / μ(a_i|x_i))  (truncated importance weight)
      δ_t V = ρ_t(r_t + γV(x_{t+1}) - V(x_t))
      ρ_t = min(ρ̄, π(a_t|x_t) / μ(a_t|x_t))

Relevance for airside: IMPALA enables training with 100+ parallel CARLA instances. At ~10 FPS per instance, 100 instances provide 1,000 environment steps/second — enough to train a driving policy in 24-48 hours.

3.3 On-Policy Performance Comparison

AlgorithmCARLA Longest6 v2nuPlan CLS-RStabilitySample Efficiency
PPO (CaRL)SOTA (open-source)SOTA (open-source)HighLow (needs 50M+ steps)
IMPALAGoodHighMedium (V-trace helps)
A3CModerateLow (async instability)Low
TRPOGoodVery highVery low

4. Off-Policy Methods

Off-policy methods learn from data collected by any policy (including old policies or expert demonstrations). Much more sample-efficient than on-policy.

4.1 SAC (Soft Actor-Critic)

SAC (Haarnoja et al., 2018) adds maximum entropy to the RL objective:

π* = arg max_π E [Σ_t γ^t (R(s_t, a_t) + α H(π(·|s_t)))]

where α = entropy temperature (auto-tuned)
      H(π) = -E[log π(a|s)]

Why entropy matters for driving: The entropy bonus encourages exploration of multiple valid driving strategies (faster lane, slower but safer route) rather than collapsing to a single behavior. For airside, this helps discover alternative routes around obstacles.

SAC components:

  • Actor: Squashed Gaussian policy π_θ(a|s) — outputs mean + std, samples via reparameterization, applies tanh squashing
  • Twin critics: Two Q-networks Q_φ1, Q_φ2 — take minimum to prevent overestimation
  • Target networks: Exponential moving average for stability (τ = 0.005)
  • Auto-tuned α: Adjusts entropy weight to maintain target entropy = -dim(A)
HyperparameterDriving ValueNotes
Learning rate3e-4Same for actor and critics
Replay buffer1M transitions~28 hours of 10 Hz driving
Batch size256
Target update τ0.005Soft update
Discount γ0.99
Target entropy-dim(A)Auto-tuned α

4.2 TD3 (Twin Delayed DDPG)

TD3 (Fujimoto et al., 2018) addresses overestimation in DDPG with three tricks:

  1. Clipped double Q-learning: min(Q_φ1, Q_φ2) — same as SAC
  2. Delayed policy updates: Update actor every 2 critic updates
  3. Target policy smoothing: Add clipped noise to target actions
python
# Target Q-value computation in TD3
target_action = target_actor(next_state) + clipped_noise
target_q1 = target_critic1(next_state, target_action)
target_q2 = target_critic2(next_state, target_action)
target_q = reward + gamma * (1 - done) * min(target_q1, target_q2)

TD3 vs SAC for driving: SAC generally outperforms TD3 on driving tasks due to entropy-regularized exploration. TD3 is simpler but more brittle with reward design.

4.3 TQC (Truncated Quantile Critics)

TQC (Kuznetsov et al., 2020) extends SAC with distributional critics:

  • Uses N=5 quantile critic networks, each predicting M=25 quantiles of the return distribution
  • Drops the top d=2 atoms from the combined quantile distribution for pessimism
  • Achieves more accurate Q-value estimates and better exploration

First applied to driving in 2025: TQC outperformed SAC, TD3, and DDPG on urban CARLA scenarios, particularly in intersection navigation where value estimation is challenging.

4.4 CrossQ

CrossQ (Bhatt et al., 2024) simplifies SAC by removing target networks entirely:

  • Uses batch normalization in critics to stabilize training
  • Processes current and next states in the same forward pass (cross-batch normalization)
  • Achieves SAC-level performance with 50% fewer parameters and no target network updates

Advantage for Orin: Smaller critic networks mean faster training if doing on-device fine-tuning (relevant for federated RL).

4.5 Off-Policy Performance Comparison

AlgorithmCARLA UrbanSample EfficiencyHyperparameter SensitivityImplementation Complexity
SACStrongHighLow (auto-α)Medium
TD3ModerateHighMediumLow
TQCStrong+HighLowMedium-High
CrossQStrongHighLowMedium
DDPGWeakMediumHighLow

5. Offline Reinforcement Learning

Offline RL learns policies entirely from a fixed dataset of previously collected transitions, without any environment interaction. This is the most relevant paradigm for airside deployment.

5.1 Why Offline RL for Airside

Online RL requires environment interaction — impossible on an active airport apron:

ConstraintImpact
SafetyRandom exploration near aircraft risks $250K+ damage per incident
AvailabilityCan't monopolize a real stand for RL training
RegulatoryNo regulatory framework for "learning" vehicles on apron
CostEach real-world episode requires operator oversight

Offline RL learns from:

  • Frenet planner logs: Thousands of hours of rule-based driving (state, action, reward can be computed post-hoc)
  • Human operator demonstrations: Recorded during safety-operator-present deployments
  • Simulation data: CARLA airport environment (see sim-to-real doc)
  • Fleet data: Continuously growing dataset from deployed vehicles

5.2 The Distribution Shift Problem

The fundamental challenge: the learned policy π encounters states not in the dataset D (collected by behavior policy β), causing Q-value overestimation for unseen (state, action) pairs.

Online RL: if Q is wrong → agent visits that state → gets corrected
Offline RL: if Q is wrong → agent never visits → error persists and compounds

5.3 CQL (Conservative Q-Learning)

CQL (Kumar et al., 2020) addresses overestimation by adding a regularizer that pushes down Q-values for out-of-distribution actions:

L_CQL(φ) = α * (E_{s~D, a~π} [Q_φ(s,a)] - E_{(s,a)~D} [Q_φ(s,a)]) + L_TD(φ)
                ^^^^^^^^^^^^^^^^^^^^^^^^^^^^    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
                Push down OOD Q-values          Push up in-distribution Q-values

Effect: The learned Q-function is a lower bound on the true Q-function for in-distribution actions, preventing the policy from exploiting overestimated values for unseen actions.

CQL HyperparameterDriving ValueNotes
α (conservative weight)1.0-5.0Higher = more conservative. Start at 1.0 for expert data, 5.0 for mixed-quality data
Min Q-weight1.0For SAC-style entropy regularization
Network(256, 256) MLP or ResNet-18 encoder
Batch size256
Learning rate3e-4

CQL for driving: AD4RL benchmark showed CQL achieves reasonable performance on highway and urban driving from offline datasets, but tends to be overly conservative — the vehicle drives slowly and hesitates at intersections.

5.4 IQL (Implicit Q-Learning)

IQL (Kostrikov et al., 2022) avoids querying OOD actions entirely by using expectile regression:

L_V(ψ) = E_{(s,a)~D} [L_τ(Q_φ(s,a) - V_ψ(s))]

where L_τ(u) = |τ - 1(u < 0)| * u²
      τ = expectile (0.5-0.9, typically 0.7)

Key insight: IQL never evaluates Q(s, a) for actions a not in the dataset. The value function V(s) is trained to estimate the τ-th expectile of Q-values in the dataset, effectively extracting the best actions without explicit maximization.

Advantages for driving:

  • No need to sample/evaluate OOD actions — more stable than CQL
  • Simpler to tune (single hyperparameter τ)
  • Consistent performance across traffic densities (2026 AEB study)
  • Works well with mixed-quality data (different drivers/planners)
python
class IQLDrivingAgent:
    """Implicit Q-Learning for offline driving policy."""
    
    def __init__(self, state_dim, action_dim, config):
        self.q1 = QNetwork(state_dim, action_dim, hidden=256)
        self.q2 = QNetwork(state_dim, action_dim, hidden=256)
        self.v = VNetwork(state_dim, hidden=256)
        self.actor = GaussianActor(state_dim, action_dim, hidden=256)
        
        self.tau = config.expectile  # 0.7 default
        self.beta = config.awr_temperature  # 3.0 for advantage-weighted regression
    
    def update_value(self, states, actions):
        """Expectile regression for V-function."""
        with torch.no_grad():
            q1 = self.q1(states, actions)
            q2 = self.q2(states, actions)
            q = torch.min(q1, q2)
        
        v = self.v(states)
        diff = q - v
        weight = torch.where(diff > 0, self.tau, 1 - self.tau)
        v_loss = (weight * diff.pow(2)).mean()
        return v_loss
    
    def update_q(self, states, actions, rewards, next_states, dones):
        """Standard Bellman backup using V-function (no max over actions)."""
        with torch.no_grad():
            next_v = self.v(next_states)
            target_q = rewards + self.gamma * (1 - dones) * next_v
        
        q1_loss = ((self.q1(states, actions) - target_q).pow(2)).mean()
        q2_loss = ((self.q2(states, actions) - target_q).pow(2)).mean()
        return q1_loss + q2_loss
    
    def update_actor(self, states, actions):
        """Advantage-weighted regression (AWR) for policy extraction."""
        with torch.no_grad():
            q = torch.min(self.q1(states, actions), self.q2(states, actions))
            v = self.v(states)
            advantage = q - v
            # Exponential advantage weighting
            weights = torch.exp(self.beta * advantage)
            weights = torch.clamp(weights, max=100.0)  # prevent explosion
        
        log_prob = self.actor.log_prob(states, actions)
        actor_loss = -(weights * log_prob).mean()
        return actor_loss

5.5 EDAC (Ensemble-Diversified Actor-Critic)

EDAC (An et al., 2021) uses a large ensemble of Q-functions (N=10-50) with a diversity regularizer:

  • Penalizes Q-functions that agree on OOD actions (forces disagreement = uncertainty)
  • Uses the mean - λ*std of ensemble Q-values as the target
  • More fine-grained uncertainty estimation than CQL's blanket pessimism

For driving: EDAC is promising for mixed-quality datasets (some expert, some novice demonstrations) because the ensemble uncertainty is higher for states visited only by poor drivers.

5.6 Decision Transformer and Sequence Models

Decision Transformer (Chen et al., 2021) casts offline RL as conditional sequence generation:

Input:  (R̂_1, s_1, a_1, R̂_2, s_2, a_2, ..., R̂_t, s_t)
Output: a_t

where R̂_t = desired return-to-go (sum of future rewards)

At test time: Condition on a high desired return to generate expert-quality actions.

Trajectory Transformer (Janner et al., 2021) takes this further — discretizes everything into tokens and uses beam search for planning.

For airside driving:

  • Natural fit: driving is already sequential decision-making
  • Can condition on different "quality levels" — e.g., low return-to-go generates cautious driving
  • Scales with data and compute (Transformer scaling laws apply)
  • Limitation: no stitching — can only reproduce behaviors seen in the dataset, not combine sub-trajectories from different demonstrations

5.7 Offline RL Algorithm Comparison for Driving

AlgorithmAEB Performance (2026)ConservatismStabilityData RequirementsStitching
BC (baseline)ModerateN/A — copies dataHighAnyNo
CQLGoodHigh (overly cautious)MediumExpert + mixedYes
IQLBest (consistent)ModerateHighAnyLimited
EDACGoodAdaptiveMediumMixed qualityYes
Decision TransformerModerateConditionedHighExpert preferredNo
BPPOGoodModerateMediumAnyYes

6. Behavior Cloning and Bootstrapping

6.1 BC as Warm Start

Behavior cloning pre-trains the policy on expert demonstrations before RL fine-tuning:

L_BC(θ) = E_{(s,a)~D_expert} [-log π_θ(a|s)]

Why BC first:

  • RL from scratch requires millions of steps even in simulation
  • BC provides a reasonable initial policy in ~10K gradient steps
  • Subsequent RL fine-tuning corrects BC's compounding errors

For reference airside AV stack: The Frenet planner generates thousands of hours of (state, action) pairs. BC on this data produces a neural policy that mimics the Frenet planner, which RL then improves upon.

6.2 DAgger (Dataset Aggregation)

DAgger (Ross et al., 2011) iteratively corrects the distribution mismatch:

  1. Train initial policy π_0 from expert data
  2. Run π_i in the environment, collecting states s_i
  3. Query the expert for actions a* at states s_i
  4. Aggregate D = D ∪ {(s_i, a*)} and retrain

Adaptation for airside: Instead of querying a human expert, use the Frenet planner as the oracle. DAgger with the Frenet planner is safe (Frenet planner always available as fallback) and automatable.

python
def dagger_airside(frenet_planner, initial_policy, env, n_iterations=10):
    """DAgger with Frenet planner as expert oracle."""
    dataset = collect_expert_data(frenet_planner, env, n_episodes=100)
    policy = train_bc(initial_policy, dataset)
    
    for i in range(n_iterations):
        # Roll out current policy (with Frenet safety fallback)
        states = collect_states_with_policy(policy, env, n_episodes=50)
        
        # Query Frenet planner at visited states
        expert_actions = [frenet_planner.plan(s) for s in states]
        
        # Aggregate and retrain
        dataset.extend(zip(states, expert_actions))
        policy = train_bc(policy, dataset)  # or fine-tune
    
    return policy

6.3 BC → Offline RL → Online RL Pipeline

The recommended three-phase approach:

PhaseMethodData SourceDurationExpected Improvement
Phase 1: BCSupervised learningFrenet planner logs (1000+ hours)2-4 hours trainingBaseline policy ~90% of Frenet
Phase 2: Offline RLIQL/CQL on fleet dataFleet logs + simulation data8-16 hours training+5-15% over BC
Phase 3: Online RLPPO in CARLA airport envSimulated interaction24-72 hours training+10-20% over offline

7. Safe and Constrained RL

7.1 Why Standard RL Is Unsafe

Standard RL maximizes expected return, which can include rare catastrophic failures offset by good average performance:

Standard: max E[Σ γ^t r_t]  — average performance
Needed:   max E[Σ γ^t r_t] s.t. P(collision) < ε  — worst-case constraints

7.2 Constrained MDP (CMDP) Formulation

CMDP augments the MDP with cost constraints:

max_π E_π [Σ γ^t R(s_t, a_t)]
s.t. E_π [Σ γ^t C_i(s_t, a_t)] ≤ d_i  for i = 1, ..., m

where C_i = cost function for constraint i
      d_i = maximum allowed cumulative cost

Airside constraints:

ConstraintCost FunctionThreshold d
Collision1 if collision, 0 otherwise0 (zero tolerance)
Aircraft proximitymax(0, d_min - d_aircraft)0
Speed limitmax(0, v - v_limit)0.1 (minor violations OK)
Geofence1 if outside permitted zone0
Comfortjerk

7.3 Lagrangian Methods

Convert CMDP to unconstrained optimization with adaptive Lagrange multipliers:

L(π, λ) = E_π [Σ γ^t R] - Σ_i λ_i (E_π [Σ γ^t C_i] - d_i)

Update policy: θ ← θ + α_θ ∇_θ L(π_θ, λ)
Update multipliers: λ_i ← max(0, λ_i + α_λ (E[C_i] - d_i))

PPO-Lagrangian simply adds Lagrangian cost terms to PPO's objective. Implemented in OmniSafe library.

7.4 CPO (Constrained Policy Optimization)

CPO (Achiam et al., 2017) provides a trust-region method with hard constraint satisfaction:

  • Each policy update is projected onto the constraint-satisfying region
  • Guarantees near-constraint satisfaction at every iteration (not just convergence)
  • More conservative than Lagrangian but safer during training

7.5 Safety Layer / CBF Integration

Instead of learning safe behavior from scratch, combine RL with a safety filter (see safety-critical-planning-cbf.md):

a_safe = argmin_a ||a - a_RL||²
         s.t. h(f(s) + g(s)a) ≥ -α(h(s))  (CBF constraint)

Architecture for reference airside AV stack:

RL Policy → proposed action → CBF-QP filter → safe action → vehicle

                        Safety constraints from
                        runtime monitor (STL specs)

Advantages:

  • RL doesn't need to learn safety — focuses on performance
  • CBF provides formal safety guarantees (see CBF doc)
  • Matches Simplex architecture: RL as advanced controller, Frenet as fallback, CBF as filter
python
class SafeRLController:
    """RL policy with CBF safety filter."""
    
    def __init__(self, rl_policy, cbf_filter, frenet_fallback):
        self.rl_policy = rl_policy
        self.cbf_filter = cbf_filter
        self.frenet_fallback = frenet_fallback
        self.use_rl = True
    
    def get_action(self, state):
        if not self.use_rl:
            return self.frenet_fallback.plan(state)
        
        # Get RL proposed action
        action_rl = self.rl_policy(state)
        
        # Apply CBF safety filter
        action_safe, feasible = self.cbf_filter.filter(state, action_rl)
        
        if not feasible:
            # CBF can't make RL action safe → switch to Frenet (Simplex)
            self.trigger_simplex_switch()
            return self.frenet_fallback.plan(state)
        
        return action_safe
    
    def trigger_simplex_switch(self):
        """Log intervention and switch to safe controller."""
        self.use_rl = False
        rospy.logwarn("Simplex: RL → Frenet fallback triggered")

7.6 Recovery RL

Recovery RL (Thananjeyan et al., 2021) trains two policies:

  • Task policy: Optimizes performance (may be unsafe)
  • Recovery policy: Trained to return to safe states when risk is detected

The system switches to the recovery policy when the task policy's proposed action enters a "danger zone" estimated by a learned safety critic.

For airside: The recovery policy could be trained specifically on "near-miss" scenarios — learning aggressive but safe evasive maneuvers that the conservative Frenet planner can't generate.


8. Offline-to-Online Fine-Tuning

8.1 The Problem

Offline RL policies, while safe to train, are bounded by the quality and coverage of the offline dataset. Online fine-tuning improves them but risks catastrophic forgetting and initial performance collapse.

8.2 Cal-QL (Calibrated Conservative Q-Learning)

Cal-QL (Nakamoto et al., 2024) addresses the "initial dip" problem in offline-to-online:

  • During online fine-tuning, gradually relaxes CQL's conservatism
  • Calibrates the conservative Q-function to the true Q-function as online data accumulates
  • Eliminates the performance dip when transitioning from offline to online

8.3 RLPD (Reinforcement Learning with Prior Data)

RLPD (Ball et al., 2023) simply mixes offline data with online experience in the replay buffer:

Mini-batch = 50% offline data + 50% online data
Train standard SAC on mixed mini-batches

Surprisingly effective: This simple approach matches or exceeds sophisticated offline-to-online methods on many benchmarks.

8.4 Practical Offline-to-Online for Airside

Phase 1 (Offline): Train IQL on fleet data → conservative but safe policy
Phase 2 (Sim Online): PPO fine-tuning in CARLA airport env with CBF safety filter
Phase 3 (Real Online): RLPD with fleet data (offline) + shadow mode data (online)
                        Safety: Simplex architecture, Frenet fallback always available

Shadow mode integration (see 60-safety-validation/verification-validation/shadow-mode.md):

  • RL policy runs in shadow (no vehicle control)
  • Compares RL actions to Frenet planner actions
  • When RL would have performed better, adds to online replay buffer
  • When RL would have been worse/unsafe, adds as negative example

9. Policy Distillation for Edge Deployment

9.1 Why Distillation

RL training uses large networks (256-512 hidden units, ensembles) and runs on GPU servers. Deployment on Orin requires:

ConstraintTrainingOrin Deployment
LatencyNot critical<5ms per inference
Memory16-80 GB GPUShared with perception
Network size5-50M params0.5-2M params
PrecisionFP32FP16/INT8

9.2 Knowledge Distillation

Train a small "student" policy to mimic the large "teacher":

L_distill = E_s [(1-α) * L_BC(π_student, D_expert) + α * L_KD(π_student, π_teacher)]

where L_KD = KL(π_teacher(·|s) || π_student(·|s))
      α = distillation weight (0.5-0.9)

9.3 Privileged-to-Sensor Distillation

CaRL and many driving RL methods train with privileged state (perfect object positions, ground-truth map) but deploy with sensor input (LiDAR point clouds, images):

Teacher: π_privileged(a | ground_truth_state)     ← trained with RL
Student: π_sensor(a | lidar_features, map_features) ← trained with distillation

Two-stage training:

  1. Train teacher with PPO/IQL using privileged state (fast, converges well)
  2. Distill into student that takes sensor features as input (supervised, stable)

This is the approach used by comma.ai (see 80-industry-intel/companies/comma-ai/production-world-model.md): 2B parameter DiT world model as teacher → small FastViT+Transformer policy for on-device deployment.

9.4 Distilled Policy Architecture for Orin

python
class DistilledDrivingPolicy(nn.Module):
    """Compact policy for Orin deployment (~500K params, <3ms FP16)."""
    
    def __init__(self, feature_dim=128, action_dim=2):
        super().__init__()
        # Input: concatenated features from perception backbone
        # (reuse BEV features from multi-task head, no extra encoder)
        self.policy_head = nn.Sequential(
            nn.Linear(feature_dim, 128),
            nn.ReLU(),
            nn.Linear(128, 64),
            nn.ReLU(),
            nn.Linear(64, action_dim),  # (lateral_offset, target_speed)
        )
    
    def forward(self, bev_features):
        """
        bev_features: (B, 128) from shared perception backbone
        Returns: (B, 2) — (lateral_offset_m, target_speed_mps)
        """
        return self.policy_head(bev_features)

# Orin inference budget:
# Perception backbone (shared): 14.8ms (from multi-task doc)
# Policy head: ~0.5ms FP16
# CBF safety filter: ~1.0ms
# Total: ~16.3ms → 60 Hz feasible

10. RL Benchmarks and Evaluation

10.1 Simulation Benchmarks

BenchmarkEnvironmentMetricsBest RL Method
CARLA Leaderboard 2.0CARLA simulatorRoute completion, infractions, driving scoreCaRL (PPO)
nuPlanReal-world replay + simulationCLS-R, OLS, reactive metricsCaRL (PPO)
AD4RLHighway + urban offline datasetsNormalized return, collision rateIQL
MetaDriveProcedural environmentsSuccess rate, efficiencyPPO
SMARTSMulti-agent trafficCompletion, safety, comfortSAC
WaymaxWaymo data, JAX-basedLog-likelihood, collision, off-road

10.2 nuPlan: The Gold Standard

nuPlan provides the most realistic evaluation:

  • 1282 hours of real driving from 4 cities
  • Closed-loop simulation: ego actions affect the environment
  • Reactive agents: Other vehicles respond to ego's behavior
  • CLS-R metric: Closed-loop score with reactive agents (composite of progress, safety, comfort)
MethodCLS-R (Val14)Type
PDM-Closed (rule-based)~92Rule-based
CaRL (PPO)~89 (best open-source RL)RL
Diffusion-ES~90Diffusion + search
BC baseline~75Imitation learning

Key insight: Rule-based PDM still leads on average, but RL methods (CaRL) outperform on the hardest interactive scenarios where rule-based logic fails.

10.3 Metrics for Airside RL Evaluation

Standard road metrics don't capture airside requirements:

MetricDescriptionTarget
Mission completion rateStand-to-stand success>99.5%
Aircraft proximity violationEnters aircraft safety buffer<0.1% of missions
Personnel safety distanceMin distance to ground crew>3m 100% of time
Speed complianceWithin zone speed limits>99%
Geofence complianceWithin permitted zones100%
Turnaround time contributionArrival within assigned window>95%
Comfort scoreMax lateral accel, jerk<2 m/s², <5 m/s³
Simplex intervention rateHow often fallback needed<1% of decisions
Energy efficiencykWh per mission vs baseline<110% of optimal

11. Practical Implementation for Airside

11.1 Training Infrastructure

ComponentSpecificationCost
Training server4x A100 80GB (or 4x RTX 4090)$15-30K one-time
CARLA instances32-64 parallel on same serverIncluded
CARLA airport envCustom airport map (see sim-to-real doc)$10-20K development
Offline dataset1000+ hours Frenet planner logsAlready available
nuPlan licenseAcademic/commercialFree/negotiable

11.2 Phased Deployment Plan

Phase 0: BC Baseline (Weeks 1-4, $5-10K)

  • Extract (state, action) pairs from Frenet planner ROS bags
  • Train BC policy on fleet data
  • Evaluate in CARLA airport environment
  • Deliverable: Neural policy that mimics Frenet planner

Phase 1: Offline RL (Weeks 5-10, $10-15K)

  • Implement IQL on extracted fleet data
  • Add reward labels to fleet data post-hoc (route progress, proximity violations)
  • Train offline RL policy, evaluate closed-loop in simulation
  • Deliverable: Offline RL policy that outperforms BC by 5-15%

Phase 2: Online RL in Simulation (Weeks 11-18, $15-25K)

  • Set up CaRL-style PPO training in CARLA airport env
  • Integrate CBF safety filter during training
  • 50M+ environment steps (2-5 days wall time on 4x A100)
  • Deliverable: Online RL policy with SOTA simulation performance

Phase 3: Distillation + Deployment (Weeks 19-24, $10-15K)

  • Distill privileged RL policy to sensor-input student
  • TensorRT optimization for Orin (FP16, <3ms)
  • Shadow mode evaluation on real fleet (2-4 weeks)
  • Deliverable: Orin-deployable RL policy, shadow mode validation results

Phase 4: Closed-Loop Deployment (Weeks 25-32, $5-10K)

  • Integrate with Simplex architecture (RL as advanced controller)
  • CBF safety filter in real-time
  • Frenet planner as always-available fallback
  • A/B testing: RL vehicles vs Frenet-only vehicles
  • Deliverable: Production RL deployment with safety guarantees

Total: $45-75K over 32 weeks

11.3 ROS Integration Architecture

┌─────────────────────────────────────────────────┐
│                  Decision Layer                   │
│                                                   │
│  ┌──────────────┐  ┌──────────────┐              │
│  │  RL Policy   │  │   Frenet     │              │
│  │  (advanced)  │  │   Planner    │              │
│  │              │  │  (fallback)  │              │
│  └──────┬───────┘  └──────┬───────┘              │
│         │                  │                      │
│         ▼                  │                      │
│  ┌──────────────┐          │                      │
│  │  CBF Safety  │          │                      │
│  │   Filter     │          │                      │
│  └──────┬───────┘          │                      │
│         │                  │                      │
│         ▼                  ▼                      │
│  ┌──────────────────────────────┐                │
│  │     Simplex Decision Module   │                │
│  │  (switches RL↔Frenet based   │                │
│  │   on runtime monitor STL)    │                │
│  └──────────────┬───────────────┘                │
│                  │                                │
│                  ▼                                │
│         /cmd_vel or /trajectory                   │
└─────────────────────────────────────────────────┘

ROS topics:

yaml
# Inputs to RL policy
/perception/bev_features:        sensor_msgs/Image  # (H,W,C) BEV feature map
/localization/ego_state:         nav_msgs/Odometry   # pose + velocity
/planning/route_waypoints:       nav_msgs/Path        # upcoming route
/perception/detected_objects:    vision_msgs/Detection3DArray
/runtime_monitor/odd_state:      std_msgs/Int32      # ODD operational state

# RL policy output
/planning/rl_trajectory:         nav_msgs/Path       # proposed trajectory
/planning/rl_confidence:         std_msgs/Float32    # policy entropy (uncertainty)

# After CBF filter
/planning/safe_trajectory:       nav_msgs/Path       # filtered trajectory

# Simplex output
/planning/active_controller:     std_msgs/String     # "rl" or "frenet"
/control/cmd_vel:                geometry_msgs/Twist  # final command

11.4 Sim-to-Real Considerations

ChallengeMitigation
Dynamics gapTrain on randomized dynamics (mass, friction, steering delay)
Sensor gapPrivileged→sensor distillation (see Section 9.3)
Scenario gapAdversarial scenario generation (see testing doc)
Reward gapUse real-world metrics in simulation reward
Latency gapAdd random 10-50ms action delay during training

11.5 Continuous Improvement Loop

Fleet Data → Offline RL Update (monthly)

        Simulation Validation (automated)

        Shadow Mode Testing (1-2 weeks)

        A/B Testing (2-4 weeks, subset of fleet)

        Full Fleet Rollout (OTA update)

        Fleet Data → ... (loop)

This integrates with the data flywheel (see 50-cloud-fleet/mlops/data-flywheel-airside.md) and the federated learning pipeline (see 50-cloud-fleet/mlops/federated-learning-fleet.md) for multi-airport policy adaptation.


12. Key Takeaways

  1. CaRL (CoRL 2025) is SOTA for open-source RL driving: PPO with simple route-completion reward + infraction penalties, first public codebase for RL on CARLA Leaderboard 2.0 and nuPlan. Key insight: complex shaped rewards prevent PPO from scaling with batch size.

  2. IQL is the best offline RL algorithm for driving: Consistent performance across traffic densities, no need to evaluate OOD actions, single hyperparameter (expectile τ). 2026 AEB study confirms IQL outperforms CQL and BPPO for autonomous emergency braking.

  3. Offline RL is mandatory for airside initial policy learning: Online exploration on an active apron is unacceptable. The thousands of hours of Frenet planner logs provide a natural offline dataset. Rewards can be computed post-hoc from logged states.

  4. BC → Offline RL → Online RL pipeline is the recommended approach: BC warm start (90% of Frenet performance), IQL fine-tuning (+5-15%), PPO in simulation (+10-20%). Each phase builds on the previous, with decreasing risk.

  5. Safety filter (CBF-QP) decouples performance from safety: RL focuses on route completion and efficiency; CBF guarantees collision avoidance and constraint satisfaction. This matches the Simplex architecture with RL as advanced controller and Frenet as fallback.

  6. Simple rewards scale; complex rewards don't: CaRL proved that summing 10+ reward terms causes PPO to fail at large batch sizes. For airside: route completion × infraction multiplier + episode termination on collision. Resist the urge to add more reward terms.

  7. Privileged-to-sensor distillation enables Orin deployment: Train with ground-truth state (fast convergence), distill to sensor-input student (500K params, <3ms FP16 on Orin). This is the comma.ai approach at smaller scale.

  8. Policy distillation fits within the multi-task perception budget: The distilled policy head adds only ~0.5ms to the shared backbone (14.8ms). Total decision pipeline: 14.8ms perception + 0.5ms policy + 1.0ms CBF = 16.3ms → 60 Hz.

  9. SAC is the best off-policy algorithm for driving: Entropy regularization provides robust exploration, auto-tuned temperature eliminates a key hyperparameter. TQC and CrossQ are promising but less battle-tested.

  10. Lagrangian PPO is the simplest safe RL approach: Adds constraint costs as Lagrangian terms to PPO objective, auto-tunes multipliers. CPO provides stronger guarantees but is harder to implement. Both available in OmniSafe library.

  11. Shadow mode enables safe offline-to-online transition: RL policy runs in parallel with Frenet planner, decisions compared but not executed. Positive comparisons enter online replay buffer; negative ones become training signal. Zero risk during transition.

  12. Recovery RL trains an emergency maneuver policy: Separate from the task policy, activated when safety critic detects danger. For airside: aggressive but safe evasive maneuvers that the conservative Frenet planner can't generate (e.g., tight swerve around suddenly-placed obstacle).

  13. AD4RL benchmark provides airside-relevant offline RL evaluation: Highway and urban driving datasets with proper offline RL evaluation protocols. Complements nuPlan for offline algorithm selection.

  14. DAgger with Frenet planner is a free lunch: Use the Frenet planner as an always-available oracle for dataset aggregation. The neural policy runs in the loop, visits states the Frenet planner wouldn't, and gets corrected. Zero safety risk (Frenet always available as fallback).

  15. RLPD's simple 50/50 mixing matches sophisticated offline-to-online methods: No need for complex calibration (Cal-QL) or curriculum — just mix offline and online data in the replay buffer. Start here before trying anything fancier.

  16. Continuous fleet RL improvement integrates with existing pipelines: Monthly offline RL updates from fleet data → simulation validation → shadow mode → A/B testing → full rollout. Same cadence as the data flywheel retraining cycle.

  17. RL policy size is negligible compared to perception: The RL policy (0.5-2M params) is <1% of the perception backbone (5-60M params). Compression and distillation focus should be on perception, not the policy.

  18. Reward shaping for airside should encode airport zone physics: Different zones have different safety requirements — apron near aircraft (ultra-conservative), service road (moderate), remote taxiway (can be more efficient). Zone-conditioned reward or separate policies per zone type.

  19. Total cost $45-75K over 32 weeks: Phase 0-4 from BC baseline through production deployment. Incremental — each phase delivers usable artifacts and de-risks the next.

  20. No public airside RL benchmark exists: Building a CARLA airport environment + defining airside RL metrics would be a significant contribution and competitive advantage. The evaluation metrics (Section 10.3) provide the foundation.


13. References

Foundational RL

  • Schulman, J., et al. (2017). "Proximal Policy Optimization Algorithms." arXiv:1707.06347 — PPO
  • Haarnoja, T., et al. (2018). "Soft Actor-Critic: Off-Policy Maximum Entropy Deep Reinforcement Learning with a Stochastic Actor." ICML — SAC
  • Fujimoto, S., et al. (2018). "Addressing Function Approximation Error in Actor-Critic Methods." ICML — TD3
  • Lillicrap, T. P., et al. (2016). "Continuous control with deep reinforcement learning." ICLR — DDPG
  • Espeholt, L., et al. (2018). "IMPALA: Scalable Distributed Deep-RL with Importance Weighted Actor-Learner Architectures." ICML

Offline RL

  • Kumar, A., et al. (2020). "Conservative Q-Learning for Offline Reinforcement Learning." NeurIPS — CQL
  • Kostrikov, I., et al. (2022). "Offline Reinforcement Learning with Implicit Q-Learning." ICLR — IQL
  • An, G., et al. (2021). "Uncertainty-Based Offline Reinforcement Learning with Diversified Q-Ensemble." NeurIPS — EDAC
  • Chen, L., et al. (2021). "Decision Transformer: Reinforcement Learning via Sequence Modeling." NeurIPS
  • Janner, M., et al. (2021). "Offline Reinforcement Learning as One Big Sequence Modeling Problem." NeurIPS — Trajectory Transformer
  • Fujimoto, S., et al. (2019). "Off-Policy Deep Reinforcement Learning without Exploration." ICML — BCQ
  • Yu, T., et al. (2020). "MOPO: Model-based Offline Policy Optimization." NeurIPS
  • Yu, T., et al. (2021). "COMBO: Conservative Offline Model-Based Policy Optimization." NeurIPS

RL for Driving

  • Jaeger, B., et al. (2025). "CaRL: Learning Scalable Planning Policies with Simple Rewards." CoRL — SOTA open-source RL planner
  • Chen, D., et al. (2024). "AD4RL: Autonomous Driving Benchmarks for Offline Reinforcement Learning with Value-based Dataset." arXiv:2404.02429
  • Dauner, D., et al. (2024). "Towards learning-based planning: The nuPlan benchmark for real-world autonomous driving." ICRA
  • "A Comparative Study of Deep Reinforcement Learning Algorithms for Urban Autonomous Driving." Applied Sciences (2025) — TQC, CrossQ for CARLA
  • "Offline Reinforcement Learning using Human-Aligned Reward Labeling for Autonomous Emergency Braking." arXiv:2504.08704 (2025) — IQL for AEB
  • "V-Max: A Reinforcement Learning Framework for Autonomous Driving." RLC (2025)

Safe RL

  • Achiam, J., et al. (2017). "Constrained Policy Optimization." ICML — CPO
  • Thananjeyan, B., et al. (2021). "Recovery RL: Safe Reinforcement Learning with Learned Recovery Zones." RA-L
  • Chow, Y., et al. (2019). "Lyapunov-based Safe Policy Optimization for Continuous Control." ICML
  • Ray, A., et al. (2019). "Benchmarking Safe Exploration in Deep Reinforcement Learning." arXiv — Safety Gym
  • Ji, J., et al. (2024). "OmniSafe: An Infrastructure for Accelerating Safe Reinforcement Learning Research." JMLR

Offline-to-Online

  • Nakamoto, M., et al. (2024). "Cal-QL: Calibrated Offline RL Pre-Training for Efficient Online Fine-Tuning." NeurIPS
  • Ball, P. J., et al. (2023). "Efficient Online Reinforcement Learning with Offline Data." ICML — RLPD
  • Lee, S., et al. (2022). "Offline-to-Online Reinforcement Learning via Balanced Replay and Pessimistic Q-Ensemble." CoRL

Distillation and Deployment

  • Hinton, G., et al. (2015). "Distilling the Knowledge in a Neural Network." NeurIPS Workshop
  • Chen, D., et al. (2020). "Learning by Cheating." CoRL — Privileged→sensor distillation
  • Ross, S., et al. (2011). "A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning." AISTATS — DAgger

Additional

  • Kuznetsov, A., et al. (2020). "Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics." ICML — TQC
  • Bhatt, A., et al. (2024). "CrossQ: Batch Normalization in Deep Reinforcement Learning." ICLR

Public research notes collected from public sources.