Skip to content

Imitation Learning and Behavioral Cloning for Airside Autonomous GSE

Autonomous ground support equipment at airports currently relies on hand-crafted planning systems — the reference airside AV stack's Frenet planner samples 420 trajectory candidates per cycle and scores them with manually tuned cost functions for lane centering, obstacle avoidance, speed compliance, and comfort. These cost functions capture explicit domain knowledge but miss the implicit expertise that human operators demonstrate daily: the nuanced way a tow operator approaches a busy stand, the timing of yielding to crossing pedestrians, the subtle speed adjustments when passing near aircraft engines, and the confidence with which experienced drivers navigate congested aprons. Imitation learning (IL) offers a systematic way to extract this expertise from demonstrations — either from teleoperation logs, from manually driven runs during supervised deployment, or from shadow-mode data where the human drove while the autonomous system recorded what it would have done. This document covers the three pillars of imitation learning for autonomous driving: behavioral cloning (BC) which directly maps observations to actions via supervised learning, inverse reinforcement learning (IRL) which recovers the implicit reward function behind expert behavior, and interactive imitation learning (DAgger and variants) which addresses the distribution shift problem that makes naive BC fragile. For each, we examine the mathematical foundations, SOTA methods (2024-2026), practical considerations for deployment on NVIDIA Orin with LiDAR-based perception, and airside-specific adaptations including multi-operator style handling, safety constraint enforcement, and integration with the existing Frenet planner as a safety fallback via the Simplex architecture. The core recommendation is a phased approach: start with BC from teleoperation logs to bootstrap a policy, refine with DAgger in simulation, extract cost functions via IRL for Frenet planner augmentation, and eventually deploy with Simplex safety guarantees.


Table of Contents

  1. Why Imitation Learning for Airside GSE
  2. Behavioral Cloning Fundamentals
  3. Advanced BC: Handling Multimodality
  4. Distribution Shift and the DAgger Framework
  5. Inverse Reinforcement Learning
  6. Generative Adversarial Imitation Learning
  7. Learning from Diverse Operators
  8. Safety-Constrained Imitation Learning
  9. Integration with Existing Planning Stack
  10. Data Collection and Preparation
  11. Orin Deployment and Real-Time Inference
  12. Key Takeaways

1. Why Imitation Learning for Airside GSE

1.1 The Expert Knowledge Gap

the reference airside AV stack's current autonomous pipeline is rule-based: the Frenet planner generates trajectories according to explicit mathematical cost functions. This works well for structured driving (follow lane, avoid obstacles, maintain speed) but struggles with the nuanced interactions that dominate airside operations:

ScenarioRule-Based ResponseExpert Operator Response
Approaching busy stand with crossing crewStop, wait for clear pathSlow to 2 km/h, creep through gap timed with crew movement
Passing aircraft with engines runningMaintain 50m clearance (hard-coded)Adjust clearance based on engine type, wind direction, and jet blast feel
Convoy following behind lead tractorMaintain fixed following distanceAdaptively match leader's speed profile, anticipate stops
Navigating congested apron intersectionStop-and-wait at each conflictAssertive merge with communication via trajectory intent
De-icing spray encounterReduce speed by fixed percentageDramatically slow, shift to different sensor mode, resume quickly

These behaviors are difficult to encode as explicit rules but easy for experienced operators to demonstrate. Imitation learning bridges this gap.

1.2 Data Sources for Imitation

SourceAvailabilityQualityVolumeCost
Teleoperation logsAvailable now (Fernride-style teleop)High (human control, full sensor data)Low (limited teleop hours)Low (byproduct of operations)
Supervised driving logsAvailable during deployment phaseHigh (human driver, autonomous sensors)Medium (every supervised shift)Low (byproduct)
Shadow mode dataAvailable with software modificationMedium (human drove, no autonomous correction)High (every human-driven shift)Very low (passive recording)
Simulation demonstrationsUnlimited (with sim environment)Variable (sim-to-real gap)UnlimitedMedium (sim development cost)
Fleet natural drivingMassive (after initial deployment)Self-referential (learning from self)Very highVery low

1.3 IL vs RL for Airside

DimensionImitation LearningReinforcement Learning
Data requirementExpert demonstrationsReward function + environment
Safety during trainingSafe (learns from safe demos)Unsafe (explores, may cause damage)
Sample efficiencyHigh (few hundred demos)Low (millions of episodes)
Captures nuanceYes (implicit in demo behavior)Only if reward captures it
Distribution shiftYes (major challenge)No (learns on own distribution)
OptimalityBounded by expert qualityCan exceed expert performance
Airside fitExcellent (can't explore unsafely)Good after BC bootstrap

Recommendation: IL first (BC bootstrap), then RL fine-tuning in simulation (see 30-autonomy-stack/planning/reinforcement-learning-driving-policy.md).


2. Behavioral Cloning Fundamentals

2.1 Mathematical Formulation

Behavioral Cloning casts driving as supervised learning:

Given a dataset D = {(o₁, a₁), (o₂, a₂), ..., (oₙ, aₙ)} where:

  • oₜ is the observation at time t (LiDAR BEV, ego state, map features)
  • aₜ is the expert action at time t (steering, speed, or trajectory waypoints)

Learn a policy π_θ(a|o) by minimizing:

L(θ) = E_{(o,a)∼D} [||π_θ(o) - a||²]    (MSE for continuous actions)

Or, for trajectory prediction:

L(θ) = E_{(o,τ)∼D} [Σₜ ||π_θ(o)_t - τ_t||²]    (waypoint MSE)

2.2 BC Architecture for LiDAR-Based Driving

python
import torch
import torch.nn as nn

class BehavioralCloningPolicy(nn.Module):
    """
    BC policy for airside GSE.
    
    Input: LiDAR BEV features + ego state + route info
    Output: Trajectory waypoints (next 3 seconds at 5 Hz = 15 waypoints)
    
    Architecture follows comma.ai / VAD pattern:
    Backbone (BEV features) → Temporal aggregation → Trajectory head
    """
    
    def __init__(self, bev_channels=256, ego_dim=8, route_dim=32, 
                 num_waypoints=15, waypoint_dim=3):
        super().__init__()
        
        # BEV feature encoder (from PointPillars or CenterPoint backbone)
        self.bev_encoder = nn.Sequential(
            nn.Conv2d(bev_channels, 128, 3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.Conv2d(128, 64, 3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d((8, 8)),  # Fixed spatial size
            nn.Flatten(),
            nn.Linear(64 * 8 * 8, 256),
            nn.ReLU(),
        )
        
        # Ego state encoder (position, velocity, heading, steering angle, etc.)
        self.ego_encoder = nn.Sequential(
            nn.Linear(ego_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
        )
        
        # Route encoder (next N route waypoints or goal direction)
        self.route_encoder = nn.Sequential(
            nn.Linear(route_dim, 64),
            nn.ReLU(),
            nn.Linear(64, 64),
            nn.ReLU(),
        )
        
        # Temporal aggregation (last 5 observations)
        self.temporal = nn.GRU(
            input_size=256 + 64 + 64,
            hidden_size=256,
            num_layers=2,
            batch_first=True,
        )
        
        # Trajectory prediction head
        self.trajectory_head = nn.Sequential(
            nn.Linear(256, 256),
            nn.ReLU(),
            nn.Linear(256, num_waypoints * waypoint_dim),  # x, y, heading
        )
        
        self.num_waypoints = num_waypoints
        self.waypoint_dim = waypoint_dim
    
    def forward(self, bev_features, ego_state, route_features, hidden=None):
        """
        Args:
            bev_features: [B, T, C, H, W] — BEV from last T frames
            ego_state: [B, T, ego_dim] — ego state history
            route_features: [B, route_dim] — route/goal encoding
        
        Returns:
            trajectory: [B, num_waypoints, waypoint_dim] — predicted waypoints
        """
        B, T = bev_features.shape[:2]
        
        # Encode each timestep
        frame_features = []
        for t in range(T):
            bev_feat = self.bev_encoder(bev_features[:, t])
            ego_feat = self.ego_encoder(ego_state[:, t])
            route_feat = self.route_encoder(route_features)
            combined = torch.cat([bev_feat, ego_feat, route_feat], dim=-1)
            frame_features.append(combined)
        
        frame_seq = torch.stack(frame_features, dim=1)  # [B, T, D]
        
        # Temporal aggregation
        gru_out, hidden = self.temporal(frame_seq, hidden)
        latest = gru_out[:, -1]  # [B, 256]
        
        # Predict trajectory
        traj_flat = self.trajectory_head(latest)
        trajectory = traj_flat.view(B, self.num_waypoints, self.waypoint_dim)
        
        return trajectory, hidden


def train_bc(model, dataloader, optimizer, num_epochs=100):
    """
    Standard BC training loop.
    
    Key considerations for airside:
    - Weight safety-critical scenarios higher (near aircraft, near crew)
    - Use L1 loss for robustness to outlier demonstrations
    - Apply data augmentation (noise injection on ego state)
    """
    loss_fn = nn.SmoothL1Loss()
    
    for epoch in range(num_epochs):
        for batch in dataloader:
            bev, ego, route, expert_traj, weights = batch
            
            pred_traj, _ = model(bev, ego, route)
            
            # Weighted loss: higher weight for safety-critical scenarios
            loss = (weights.unsqueeze(-1).unsqueeze(-1) * 
                    loss_fn(pred_traj, expert_traj)).mean()
            
            optimizer.zero_grad()
            loss.backward()
            torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
            optimizer.step()

2.3 Action Representations

RepresentationProsConsBest For
Steering + speedSimple, direct controlCompounding errors, hard to evaluateSimple vehicles
Trajectory waypointsEvaluable, plannableNeeds trajectory trackerMost AV systems (recommended)
Frenet coefficientsMatches existing plannerPlanner-specificreference airside AV stack integration
Cost function weightsInterpretable, composableIndirect mappingIRL-based approaches
Occupancy predictionsRich representationVery indirectWorld model approaches

Recommendation for reference airside AV stack: Trajectory waypoints (3s horizon, 5 Hz, in ego-centric coordinates) — decouples learning from control, evaluable for safety.


3. Advanced BC: Handling Multimodality

3.1 The Multimodality Problem

Standard BC with MSE loss averages over multiple valid actions. When an operator could validly go left OR right around an obstacle, the average is going straight into it. This is the most common failure mode of naive BC.

3.2 Solutions

Mixture Density Networks (MDN):

python
class MDNPolicy(nn.Module):
    """
    Mixture Density Network for multimodal behavioral cloning.
    Models output as mixture of K Gaussians.
    """
    def __init__(self, backbone, K=5, traj_dim=45):
        super().__init__()
        self.backbone = backbone
        self.K = K
        
        # Mixture components
        self.pi_head = nn.Linear(256, K)           # Mixture weights
        self.mu_head = nn.Linear(256, K * traj_dim) # Means
        self.sigma_head = nn.Linear(256, K * traj_dim) # Log std
    
    def forward(self, x):
        features = self.backbone(x)
        
        pi = F.softmax(self.pi_head(features), dim=-1)      # [B, K]
        mu = self.mu_head(features).view(-1, self.K, 45)     # [B, K, 45]
        sigma = torch.exp(self.sigma_head(features)).view(-1, self.K, 45)
        
        return pi, mu, sigma
    
    def loss(self, pi, mu, sigma, target):
        """Negative log-likelihood of mixture."""
        target = target.unsqueeze(1).expand_as(mu)  # [B, K, 45]
        
        # Log probability of each component
        log_probs = -0.5 * ((target - mu) / sigma) ** 2 - torch.log(sigma)
        log_probs = log_probs.sum(dim=-1)  # [B, K]
        
        # Log mixture probability
        log_mix = torch.log(pi + 1e-8) + log_probs
        loss = -torch.logsumexp(log_mix, dim=-1).mean()
        
        return loss
    
    def sample(self, x, select='best'):
        """Sample trajectory from learned distribution."""
        pi, mu, sigma = self.forward(x)
        
        if select == 'best':
            # Select mode with highest weight
            best_k = pi.argmax(dim=-1)
            return mu[range(len(best_k)), best_k]
        elif select == 'sample':
            # Sample component, then sample from it
            k = torch.multinomial(pi, 1).squeeze(-1)
            eps = torch.randn_like(mu[:, 0])
            return mu[range(len(k)), k] + sigma[range(len(k)), k] * eps

Diffusion-Based BC (ICLR 2026 trend):

python
class DiffusionBC(nn.Module):
    """
    Diffusion Policy (Chi et al. 2024) applied to driving.
    
    Advantages:
    - Naturally multimodal (no mode collapse)
    - Handles high-dimensional trajectory outputs
    - Can condition on arbitrary observations
    
    Disadvantage:
    - Requires 5-20 denoising steps (50-200ms on Orin)
    - Mitigated by DDIM with 3-5 steps (15-50ms)
    """
    def __init__(self, obs_encoder, traj_dim=45, hidden_dim=256, 
                 num_diffusion_steps=100):
        super().__init__()
        self.obs_encoder = obs_encoder
        self.noise_pred = nn.Sequential(
            nn.Linear(traj_dim + hidden_dim + 1, 512),  # +1 for timestep
            nn.ReLU(),
            nn.Linear(512, 512),
            nn.ReLU(),
            nn.Linear(512, traj_dim),
        )
        self.T = num_diffusion_steps
    
    def training_loss(self, obs, expert_traj):
        """Standard DDPM training loss."""
        obs_feat = self.obs_encoder(obs)
        
        # Sample random timestep
        t = torch.randint(0, self.T, (len(obs),), device=obs.device)
        
        # Add noise to expert trajectory
        noise = torch.randn_like(expert_traj)
        alpha_bar = self.alpha_schedule(t)
        noisy_traj = torch.sqrt(alpha_bar).unsqueeze(-1) * expert_traj + \
                     torch.sqrt(1 - alpha_bar).unsqueeze(-1) * noise
        
        # Predict noise
        t_embed = t.float() / self.T
        pred_noise = self.noise_pred(
            torch.cat([noisy_traj, obs_feat, t_embed.unsqueeze(-1)], dim=-1)
        )
        
        return F.mse_loss(pred_noise, noise)
    
    def sample(self, obs, num_steps=5):
        """DDIM sampling for fast inference."""
        obs_feat = self.obs_encoder(obs)
        traj = torch.randn(len(obs), 45, device=obs.device)
        
        step_size = self.T // num_steps
        for i in range(num_steps, 0, -1):
            t = torch.full((len(obs),), i * step_size - 1, device=obs.device)
            t_embed = t.float() / self.T
            
            pred_noise = self.noise_pred(
                torch.cat([traj, obs_feat, t_embed.unsqueeze(-1)], dim=-1)
            )
            
            # DDIM update
            traj = self.ddim_step(traj, pred_noise, t, step_size)
        
        return traj

3.3 Comparison of Multimodal BC Methods

MethodMultimodalityInference Time (Orin)Training StabilityAccuracyRecommendation
MSE BCNone (averages)<1msStableGood (unimodal)Baseline only
MDN (K=5)Discrete modes~1msModerateGoodShort-term use
CVAEContinuous latent~2msModerateGoodMedium-term
Diffusion (5 steps)Full distribution~30msStableBestIf budget allows
Implicit BC (EBM)Full support~50ms (optimization)HardGoodResearch only

4. Distribution Shift and the DAgger Framework

4.1 The Core Problem

BC trains on the expert's state distribution but deploys on the policy's own state distribution. Small errors compound: a 1-degree steering error leads to slightly off-center driving, which produces observations the policy never trained on, which produces larger errors, leading to divergence.

Compounding error bound: For a policy with per-step error epsilon, trajectory error after T steps grows as O(epsilon * T²) — quadratic, not linear.

4.2 DAgger (Dataset Aggregation)

DAgger (Ross, Gordon & Bagnell, 2011) solves distribution shift by iteratively collecting data from the learned policy's distribution but labeling it with expert actions:

python
class DAggerTrainer:
    """
    DAgger for airside GSE policy training.
    
    In practice: run in simulation with expert labeling.
    Expert = Frenet planner (for initial DAgger) or human teleoperator.
    """
    
    def __init__(self, policy, expert, simulator):
        self.policy = policy
        self.expert = expert  # Frenet planner or human teleop
        self.sim = simulator
        self.dataset = []
        self.beta_schedule = lambda i: max(0.0, 1.0 - i * 0.1)  # Decay expert
    
    def train(self, num_iterations=10, episodes_per_iter=50):
        """
        DAgger training loop.
        
        Iteration 0: Collect data from expert (pure BC dataset)
        Iteration 1+: Mix policy + expert execution, label with expert
        """
        for iteration in range(num_iterations):
            beta = self.beta_schedule(iteration)
            new_data = []
            
            for episode in range(episodes_per_iter):
                obs_list, action_list = [], []
                obs = self.sim.reset()
                
                for step in range(300):  # 30 seconds at 10 Hz
                    # Mix policy and expert execution
                    if np.random.random() < beta:
                        action = self.expert.act(obs)  # Expert executes
                    else:
                        action = self.policy.act(obs)   # Policy executes
                    
                    # ALWAYS label with expert (regardless of who executed)
                    expert_action = self.expert.act(obs)
                    
                    obs_list.append(obs)
                    action_list.append(expert_action)
                    
                    obs, _, done, _ = self.sim.step(action)
                    if done:
                        break
                
                new_data.extend(zip(obs_list, action_list))
            
            # Aggregate dataset
            self.dataset.extend(new_data)
            
            # Retrain policy on full aggregated dataset
            self.policy.train_on(self.dataset)
            
            # Evaluate
            success_rate = self.evaluate(num_episodes=20)
            print(f"Iter {iteration}: beta={beta:.2f}, "
                  f"dataset_size={len(self.dataset)}, "
                  f"success_rate={success_rate:.3f}")
    
    def evaluate(self, num_episodes=20):
        """Evaluate policy without expert intervention."""
        successes = 0
        for _ in range(num_episodes):
            obs = self.sim.reset()
            for step in range(300):
                action = self.policy.act(obs)
                obs, _, done, info = self.sim.step(action)
                if done:
                    if info.get('success'):
                        successes += 1
                    break
        return successes / num_episodes

4.3 DAgger Variants

VariantKey InnovationAirside Applicability
DAgger (Ross 2011)Iterative dataset aggregationGood baseline, needs expert labels
SafeDAgger (Zhang 2016)Only query expert when policy is uncertainReduces expert burden
HG-DAgger (Kelly 2019)Human-gated: human takes over only on errorsNatural for teleop
EnsembleDAgger (Menda 2019)Use ensemble disagreement to trigger queriesEfficient expert time
ThriftyDAgger (Hoque 2021)Query-efficient: learn when to ask for helpMinimal expert annotation
LazyDAgger (Hoque 2024)Ask only when intervention leads to learningMost efficient

Recommended for reference airside AV stack: HG-DAgger in simulation with Frenet planner as expert. The existing Frenet planner provides unlimited, deterministic expert labels at zero cost. DAgger iterations run in CARLA or Isaac Sim with airport environment.


5. Inverse Reinforcement Learning

5.1 Why IRL Instead of BC

BC learns a policy (what action to take). IRL learns a reward function (what makes a good action). The reward function is:

  • Transferable: Same reward works across different vehicles, different planning algorithms
  • Interpretable: Learned weights on cost features explain why the expert behaved a certain way
  • Composable: Combine with safety constraints, efficiency objectives
  • Reusable: Plug learned reward into Frenet planner as improved cost function

5.2 Maximum Entropy IRL

python
class MaxEntIRL:
    """
    Maximum Entropy IRL (Ziebart 2008) for learning Frenet planner costs.
    
    Learns a reward function R(s,a) = theta^T * phi(s,a) where:
    - phi(s,a) are features (distance to lane center, speed, proximity to obstacles, etc.)
    - theta are learned weights
    
    The learned theta directly augments Frenet planner cost function.
    """
    
    def __init__(self, features, planner, learning_rate=0.01):
        self.features = features  # Feature extractor
        self.planner = planner    # Frenet planner (for forward pass)
        self.lr = learning_rate
        
        # Feature weights (what we're learning)
        self.theta = np.zeros(features.num_features)
    
    def extract_features(self, trajectory, scene):
        """
        Extract features from a trajectory in a scene.
        
        Airside-specific features:
        """
        return np.array([
            trajectory.lane_deviation_avg,          # Lane centering
            trajectory.min_obstacle_distance,       # Obstacle clearance
            trajectory.speed_deviation_from_limit,  # Speed compliance
            trajectory.lateral_acceleration_max,    # Comfort
            trajectory.longitudinal_jerk_max,       # Smoothness
            trajectory.min_aircraft_distance,       # Aircraft clearance (airside)
            trajectory.min_personnel_distance,      # Personnel clearance (airside)
            trajectory.time_to_goal,                # Efficiency
            trajectory.heading_change_total,        # Path smoothness
            trajectory.curvature_max,               # Turning sharpness
            trajectory.deceleration_max,            # Braking aggression
            trajectory.distance_to_jet_blast_zone,  # Jet blast avoidance
            trajectory.stand_approach_angle,        # Docking approach quality
        ])
    
    def compute_expert_feature_expectations(self, demonstrations):
        """Average feature values across expert demonstrations."""
        features_sum = np.zeros(self.features.num_features)
        for demo in demonstrations:
            for traj, scene in demo:
                features_sum += self.extract_features(traj, scene)
        return features_sum / sum(len(d) for d in demonstrations)
    
    def compute_policy_feature_expectations(self, scenes, num_samples=100):
        """
        Expected feature values under current reward-optimal policy.
        Uses Frenet planner with current theta as cost weights.
        """
        features_sum = np.zeros(self.features.num_features)
        count = 0
        
        for scene in scenes:
            # Set Frenet planner cost weights to current theta
            self.planner.set_cost_weights(self.theta)
            
            # Generate optimal trajectory under current reward
            traj = self.planner.plan(scene)
            features_sum += self.extract_features(traj, scene)
            count += 1
        
        return features_sum / count
    
    def train(self, demonstrations, scenes, num_iterations=200):
        """
        Gradient descent on feature matching objective.
        
        Update rule: theta += lr * (expert_features - policy_features)
        
        Intuition: increase reward for features that experts exhibit more
        than the current policy, decrease for features they exhibit less.
        """
        expert_features = self.compute_expert_feature_expectations(demonstrations)
        
        for iteration in range(num_iterations):
            policy_features = self.compute_policy_feature_expectations(scenes)
            
            # Gradient: match expert feature expectations
            gradient = expert_features - policy_features
            self.theta += self.lr * gradient
            
            # Feature matching error
            error = np.linalg.norm(gradient)
            
            if iteration % 20 == 0:
                print(f"Iter {iteration}: feature matching error = {error:.4f}")
                print(f"  Learned weights: {dict(zip(self.features.names, self.theta))}")
            
            if error < 0.01:
                print(f"Converged at iteration {iteration}")
                break
        
        return self.theta

5.3 IRL for Frenet Planner Augmentation

The key insight for reference airside AV stack: IRL learns cost function weights that directly plug into the existing Frenet planner. No neural network replacement needed — the planner's cost function becomes:

cost(τ) = Σᵢ θᵢ * φᵢ(τ)

Where θᵢ are learned from expert demonstrations and φᵢ are the existing Frenet cost features plus new airside-specific features.

FeatureHand-Tuned WeightIRL-Learned WeightInterpretation
Lane deviation10.07.3Experts care less about perfect centering
Obstacle distance15.022.1Experts are more cautious than hand-tuned
Speed compliance8.05.2Experts drive slightly faster when safe
Lateral accel5.08.7Experts prioritize passenger comfort more
Aircraft clearance20.031.4Experts maintain much more aircraft margin
Personnel clearance25.045.6Experts are extremely cautious near people

6. Generative Adversarial Imitation Learning

6.1 GAIL Overview

GAIL (Ho & Ermon, 2016) trains a policy to generate behavior indistinguishable from expert demonstrations, using a GAN-style adversarial framework:

python
class GAIL:
    """
    GAIL for driving policy learning.
    
    Components:
    - Generator (policy): tries to produce expert-like trajectories
    - Discriminator: tries to distinguish expert from policy trajectories
    
    Advantage over BC: doesn't need state-action pairs, just trajectories.
    Advantage over IRL: doesn't require feature engineering.
    Disadvantage: requires interactive environment (simulation).
    """
    
    def __init__(self, policy_net, discriminator_net, env, 
                 expert_trajectories):
        self.policy = policy_net
        self.discriminator = discriminator_net
        self.env = env
        self.expert_data = expert_trajectories
        
        self.policy_optimizer = torch.optim.Adam(
            self.policy.parameters(), lr=3e-4)
        self.disc_optimizer = torch.optim.Adam(
            self.discriminator.parameters(), lr=3e-4)
    
    def train_step(self):
        """One GAIL training iteration."""
        # 1. Collect policy rollouts
        policy_states, policy_actions = self.collect_rollouts(
            self.policy, self.env, num_episodes=32)
        
        # 2. Sample expert data
        expert_states, expert_actions = self.sample_expert(batch_size=len(policy_states))
        
        # 3. Update discriminator
        expert_logits = self.discriminator(expert_states, expert_actions)
        policy_logits = self.discriminator(policy_states, policy_actions)
        
        disc_loss = (
            F.binary_cross_entropy_with_logits(expert_logits, torch.ones_like(expert_logits)) +
            F.binary_cross_entropy_with_logits(policy_logits, torch.zeros_like(policy_logits))
        )
        
        self.disc_optimizer.zero_grad()
        disc_loss.backward()
        self.disc_optimizer.step()
        
        # 4. Update policy with PPO using discriminator as reward
        rewards = -torch.log(1 - torch.sigmoid(
            self.discriminator(policy_states, policy_actions).detach()))
        
        self.ppo_update(policy_states, policy_actions, rewards)

6.2 When to Use Each IL Method

MethodData NeededEnvironment NeededOutputBest For
BCState-action pairsNoPolicyQuick baseline, offline data
DAggerExpert labelerSim or realPolicyRobust policy with minimal expert
MaxEnt IRLState-action pairsForward plannerReward functionFrenet cost augmentation
GAILTrajectories onlySimulatorPolicyComplex behavior, no feature engineering
Preference LearningTrajectory rankingsNoReward functionWhen experts can rank but not demonstrate

7. Learning from Diverse Operators

7.1 The Multi-Operator Problem

Different teleoperators and safety drivers have different styles — some are aggressive, some conservative, some take wider turns, some cut corners. Naive IL averages these styles, producing a mediocre policy.

7.2 Style-Conditioned BC

python
class StyleConditionedBC(nn.Module):
    """
    Learn different driving styles from labeled operators.
    
    Condition the policy on an operator style embedding.
    At deployment, select the style closest to desired behavior.
    """
    
    def __init__(self, backbone, num_styles=5, style_dim=16):
        super().__init__()
        self.backbone = backbone
        
        # Style embedding table
        self.style_embeddings = nn.Embedding(num_styles, style_dim)
        
        # Style-conditioned trajectory head
        self.head = nn.Sequential(
            nn.Linear(256 + style_dim, 256),
            nn.ReLU(),
            nn.Linear(256, 45),  # 15 waypoints × 3 (x, y, heading)
        )
    
    def forward(self, obs, style_id):
        features = self.backbone(obs)
        style = self.style_embeddings(style_id)
        combined = torch.cat([features, style], dim=-1)
        return self.head(combined).view(-1, 15, 3)
    
    def deploy(self, obs, preferred_style='conservative'):
        """At deployment, use the most conservative/safe style."""
        style_map = {
            'conservative': 0,
            'moderate': 1,
            'efficient': 2,
            'aggressive': 3,  # For time-critical operations
            'docking': 4,     # Ultra-precise, very slow
        }
        style_id = torch.tensor([style_map[preferred_style]])
        return self.forward(obs, style_id)

7.3 Operator Quality Weighting

Not all demonstrations are equally valuable:

python
class QualityWeightedBC:
    """
    Weight demonstrations by quality metrics.
    
    Better operators get higher weight in training.
    """
    
    def compute_demo_quality(self, demonstration):
        """Score a demonstration based on safety and efficiency."""
        scores = {
            'safety': self.safety_score(demonstration),      # No close calls
            'smoothness': self.smoothness_score(demonstration), # Low jerk
            'efficiency': self.efficiency_score(demonstration), # Task completed quickly
            'compliance': self.compliance_score(demonstration), # Followed rules
        }
        
        # Weighted combination (safety-weighted for airside)
        quality = (
            0.4 * scores['safety'] +
            0.3 * scores['smoothness'] +
            0.2 * scores['efficiency'] +
            0.1 * scores['compliance']
        )
        
        return quality
    
    def filter_demonstrations(self, all_demos, min_quality=0.6):
        """Remove low-quality demonstrations before training."""
        filtered = []
        for demo in all_demos:
            quality = self.compute_demo_quality(demo)
            if quality >= min_quality:
                filtered.append((demo, quality))
        
        # Normalize weights
        total = sum(q for _, q in filtered)
        return [(demo, q / total) for demo, q in filtered]

8. Safety-Constrained Imitation Learning

8.1 The Safety Problem

Expert demonstrations may occasionally contain unsafe behaviors (near-misses, aggressive maneuvers, rule violations). The learned policy must not reproduce these.

8.2 Constrained BC

python
class SafeBC:
    """
    Behavioral cloning with safety constraints.
    
    Hard constraints: filter demonstrations and post-process predictions
    Soft constraints: add safety penalty to loss function
    """
    
    def __init__(self, policy, safety_checker):
        self.policy = policy
        self.safety = safety_checker  # CBF or rule-based
    
    def filter_unsafe_demos(self, demonstrations):
        """Remove demonstration segments that violate safety constraints."""
        safe_demos = []
        for demo in demonstrations:
            safe_segments = []
            for obs, action in demo:
                if self.safety.is_safe(obs, action):
                    safe_segments.append((obs, action))
                else:
                    # Log filtered segment for analysis
                    self.log_filtered(obs, action, reason=self.safety.violation_reason)
            
            if len(safe_segments) > 10:  # Minimum segment length
                safe_demos.append(safe_segments)
        
        return safe_demos
    
    def safe_training_loss(self, pred_traj, expert_traj, obs):
        """Loss with safety penalty."""
        # Standard imitation loss
        imitation_loss = F.smooth_l1_loss(pred_traj, expert_traj)
        
        # Safety constraint violation penalty
        safety_cost = 0
        for t in range(pred_traj.shape[1]):
            waypoint = pred_traj[:, t]
            
            # Aircraft clearance
            aircraft_dist = self.safety.min_aircraft_distance(obs, waypoint)
            safety_cost += F.relu(3.0 - aircraft_dist)  # 3m minimum
            
            # Personnel clearance
            person_dist = self.safety.min_personnel_distance(obs, waypoint)
            safety_cost += F.relu(2.0 - person_dist) * 5  # 2m minimum, high weight
            
            # Speed limit
            speed = self.safety.compute_speed(pred_traj[:, max(0, t-1):t+1])
            speed_limit = self.safety.get_speed_limit(obs, waypoint)
            safety_cost += F.relu(speed - speed_limit) * 2
        
        total_loss = imitation_loss + 0.1 * safety_cost
        return total_loss
    
    def safe_inference(self, obs):
        """Post-process prediction through safety filter."""
        # Get raw prediction
        pred_traj = self.policy(obs)
        
        # CBF safety filter (from safety-critical-planning-cbf.md)
        safe_traj = self.safety.cbf_filter(pred_traj, obs)
        
        return safe_traj

8.3 Simplex Integration

The learned BC policy serves as the Advanced Controller (AC) in the Simplex architecture, with the existing Frenet planner as the Baseline Controller (BC):

                     ┌──────────────────────┐
Observations ────────┤  Decision Module     │
                     │  (monitor safety)    │──── Control Output
                     └────┬────────┬────────┘
                          │        │
                    ┌─────┴───┐ ┌──┴──────────┐
                    │ Learned │ │ Frenet       │
                    │ BC/IL   │ │ Planner      │
                    │ Policy  │ │ (fallback)   │
                    │ (AC)    │ │ (BC)         │
                    └─────────┘ └──────────────┘

Switch to Frenet planner when:

  • Learned policy output violates CBF constraints
  • Policy uncertainty (ensemble disagreement) exceeds threshold
  • Novel ODD condition detected
  • Emergency situation

9. Integration with Existing Planning Stack

9.1 Three Integration Modes

Mode 1: Policy as Trajectory Generator (replace Frenet for normal ops)

python
# Learned policy generates trajectories, Frenet is fallback
if policy_confidence > THRESHOLD and cbf_safe(policy_trajectory):
    execute(policy_trajectory)
else:
    execute(frenet_planner.plan())

Mode 2: IRL Costs for Frenet Planner (augment, don't replace)

python
# IRL-learned weights improve existing Frenet planner
frenet_planner.update_cost_weights(irl_learned_theta)
trajectory = frenet_planner.plan()  # Same planner, better costs

Mode 3: Policy as Trajectory Scorer (score Frenet candidates)

python
# Frenet generates 420 candidates, learned model scores them
candidates = frenet_planner.generate_candidates(420)
for traj in candidates:
    traj.learned_score = learned_scorer.score(traj, observation)
best = max(candidates, key=lambda t: 0.5 * t.frenet_score + 0.5 * t.learned_score)

Recommendation: Start with Mode 2 (IRL + Frenet), then Mode 3 (scoring), eventually Mode 1 (full policy with Simplex).


10. Data Collection and Preparation

10.1 Demonstration Collection Protocol

PhaseDurationData SourceExpected VolumePurpose
Phase 0OngoingTeleop logs (existing)10-50 hoursBootstrap BC dataset
Phase 12 weeksSupervised driving (dedicated collection)50-100 hoursHigh-quality labeled demos
Phase 2OngoingShadow mode (automatic)100+ hours/monthScale up dataset
Phase 3OngoingFleet natural driving1000+ hours/monthContinuous improvement

10.2 Data Requirements

MethodMinimum DataRecommended DataQuality Requirement
BC (baseline)5-10 hours50-100 hoursFiltered, quality-weighted
DAgger2-5 hours + 50 sim iterations10-20 hours + 200 iterationsExpert available for labeling
MaxEnt IRL10-20 hours50-100 hoursDiverse scenarios
GAIL20-50 hours100+ hoursTrajectory-level only
Diffusion BC20-50 hours100+ hoursHigh diversity

11. Orin Deployment and Real-Time Inference

11.1 Computational Budgets

ModelFP32 OrinFP16 OrinINT8 OrinMeets 50ms?
BC (MLP head)0.5ms0.3ms0.2msYes
BC + GRU temporal2ms1ms0.8msYes
MDN (K=5)1ms0.5ms0.4msYes
Diffusion (5 steps)50ms25ms15msMarginal
Diffusion (3 steps)30ms15ms10msYes
GAIL policy1ms0.5ms0.4msYes

Note: These are policy inference times only. Total pipeline = perception + policy + safety check. Budget for policy: <5ms.

11.2 TensorRT Deployment

python
# Convert trained policy to TensorRT for Orin deployment
import tensorrt as trt
import torch.onnx

# Export to ONNX
dummy_input = (
    torch.randn(1, 5, 256, 128, 128),  # BEV features (5 frames)
    torch.randn(1, 5, 8),               # Ego state
    torch.randn(1, 32),                 # Route features
)
torch.onnx.export(model, dummy_input, 'bc_policy.onnx',
                  input_names=['bev', 'ego', 'route'],
                  output_names=['trajectory'],
                  dynamic_axes={'bev': {0: 'batch'}})

# Convert to TensorRT with FP16
# trtexec --onnx=bc_policy.onnx --fp16 --saveEngine=bc_policy.engine
# Expected: <2ms inference on Orin AGX

12. Key Takeaways

  1. Imitation learning bridges the rule-to-expertise gap. Hand-tuned Frenet costs capture explicit knowledge; IL captures implicit operator expertise — approach angles, timing of yielding, comfort preferences — that is hard to formalize as rules.

  2. Start with IRL for Frenet augmentation, not policy replacement. MaxEnt IRL learns cost function weights that plug directly into the existing Frenet planner. No new controller needed. Immediate improvement with zero safety risk.

  3. BC from teleoperation logs is free data. Every teleoperation session and supervised driving shift produces demonstration data. A 50-hour dataset is achievable within 2 weeks of dedicated collection.

  4. Distribution shift is the critical BC failure mode. Naive BC diverges within seconds in novel states. DAgger with the Frenet planner as expert solves this at zero labeling cost (planner generates labels automatically in simulation).

  5. Multimodal BC is essential for airside. Multiple valid paths around obstacles, different approach strategies to stands, alternative yielding behaviors. MDN or Diffusion BC avoids the averaging problem.

  6. Diffusion BC is SOTA but expensive on Orin. 3-step DDIM: ~15ms INT8. Fits within budget only if total pipeline is carefully managed. MDN (K=5) at <1ms is more practical for near-term deployment.

  7. GAIL doesn't need state-action pairs. Only trajectories. Useful when you have GPS tracks from human-driven GSE but no synchronized sensor data. Requires simulation environment.

  8. IRL-learned features reveal expert priorities. Personnel clearance weights 2-3x higher than hand-tuned; aircraft clearance 1.5x higher. Experts are more cautious than engineers expect.

  9. Quality-weighted demonstration filtering is critical. Not all operators are equally skilled. Weight demonstrations by safety score, smoothness, and efficiency. Filter out the bottom 20% of demonstrations.

  10. Style-conditioned BC handles multi-operator diversity. Learn K=5 driving styles, select the most conservative for deployment. Enables per-scenario style selection (aggressive for time-critical pushback, conservative for general transit).

  11. Simplex provides the safety net for learned policies. Learned policy as Advanced Controller, Frenet planner as Baseline Controller. CBF filter as intermediate check. Three layers of safety for certification.

  12. Safety constraints in training prevent learning unsafe expert behaviors. Filter unsafe demonstration segments, add safety penalty to loss, post-process with CBF filter. Learned policy should be safer than any individual expert.

  13. BC policy inference is negligible on Orin. MLP head <0.5ms, GRU temporal <1ms. The bottleneck is perception (15-30ms), not the policy. Even MDN and GAIL policies fit easily.

  14. DAgger with Frenet expert is the most efficient training protocol. Unlimited expert labels (planner is deterministic), simulation provides unlimited scenarios. 200 DAgger iterations in 2-3 days of compute.

  15. Three integration modes in order of risk: (1) IRL costs for Frenet (safe, immediate), (2) Learned scoring of Frenet candidates (moderate risk, 2-4 weeks), (3) Full policy with Simplex fallback (highest reward, 8-12 weeks to validate).

  16. 50-100 hours of demonstrations bootstraps a useful policy. Road→airside transfer with LoRA fine-tuning reduces this to 10-20 hours of airside-specific data.

  17. Implementation cost: $35-55K over 10-14 weeks. Phase 1 (IRL + Frenet augmentation, 3-4 weeks, $10-15K), Phase 2 (BC + DAgger in sim, 4-5 weeks, $15-20K), Phase 3 (Deployment + Simplex, 3-5 weeks, $10-20K).


Cost and Implementation Roadmap

PhaseScopeDurationCostDeliverable
Phase 1MaxEnt IRL from teleop logs + Frenet cost augmentation3-4 weeks$10-15KImproved Frenet planner with learned costs
Phase 2BC policy training + DAgger with Frenet expert in sim4-5 weeks$15-20KStandalone BC policy, evaluated in simulation
Phase 3Safety filtering + Simplex integration + shadow mode validation3-5 weeks$10-20KProduction-ready IL pipeline
TotalEnd-to-end imitation learning system10-14 weeks$35-55KLearned driving from expert demonstrations

References

Internal Repository

  • 30-autonomy-stack/planning/reinforcement-learning-driving-policy.md — BC→offline RL→online RL pipeline, RL post-IL
  • 30-autonomy-stack/planning/frenet-planner-augmentation.md — Frenet planner cost function structure
  • 30-autonomy-stack/planning/safety-critical-planning-cbf.md — CBF safety filter for post-processing
  • 60-safety-validation/runtime-assurance/simplex-safety-architecture.md — Simplex AC/BC architecture
  • 40-runtime-systems/monitoring-observability/teleoperation-systems.md — Teleop data sources
  • 50-cloud-fleet/mlops/data-flywheel-airside.md — Data collection and labeling pipeline
  • 30-autonomy-stack/planning/neural-motion-planning.md — Learned planning approaches

External

  • Ross, S., Gordon, G., & Bagnell, D. (2011). "A Reduction of Imitation Learning and Structured Prediction to No-Regret Online Learning." AISTATS.
  • Ziebart, B.D. et al. (2008). "Maximum Entropy Inverse Reinforcement Learning." AAAI.
  • Ho, J. & Ermon, S. (2016). "Generative Adversarial Imitation Learning." NeurIPS.
  • Chi, C. et al. (2024). "Diffusion Policy: Visuomotor Policy Learning via Action Diffusion." RSS.
  • Bishop, C.M. (1994). "Mixture Density Networks." Technical Report.
  • Hoque, R. et al. (2021). "ThriftyDAgger: Budget-Aware Novelty and Risk Gating for Interactive Imitation Learning." CoRL.
  • "Beyond Behavior Cloning in Autonomous Driving: A Survey." arXiv (2025).
  • "Behavioral Cloning Models Reality Check for Autonomous Driving." arXiv (2024).

Public research notes collected from public sources.