Multi-Task and Unified Perception Architectures for Autonomous Driving
Executive Summary
Modern AV perception is converging toward unified architectures that perform detection, segmentation, tracking, prediction, and even planning in a single model with shared representations. This document surveys the evolution from task-specific models to end-to-end multi-task systems: UniAD (first unified framework, CVPR 2023 Best Paper), SparseDrive (fully sparse, 9.0 FPS on A100), VAD (vectorized scene representation), StreamPETR (streaming temporal fusion), Far3D (long-range detection), and the emerging paradigm of foundation-model-based unified perception. For the reference airside AV stack's airside stack, the key question is whether to replace the current separate PointPillars detector + GTSAM localizer + Frenet planner with a unified model, or to incrementally add multi-task heads to the existing backbone. Key finding: a shared-backbone, multi-head approach preserves the proven PointPillars detection (6.84ms) while adding segmentation, free-space estimation, and motion prediction at marginal cost (~15ms additional), and avoids the risks of monolithic end-to-end systems that are harder to certify. Full end-to-end models (UniAD, SparseDrive) achieve state-of-the-art on nuScenes but require 50-100ms on A100 — too slow for Orin real-time without aggressive optimization.
Table of Contents
- The Case for Unified Perception
- UniAD: The First Unified Framework
- SparseDrive: Fully Sparse End-to-End
- VAD: Vectorized Scene Representation
- StreamPETR and Temporal Fusion
- Task Interference and Loss Balancing
- Shared-Backbone Multi-Head Architecture
- Foundation Models as Unified Backbones
- Deployment Considerations for Orin
- Airside-Specific Multi-Task Design
- Comparison and Recommendations
- Key Takeaways
1. The Case for Unified Perception
1.1 The Problem with Separate Models
the reference airside AV stack's current stack runs perception as independent modules:
LiDAR → PointPillars → 3D Detections
LiDAR → RANSAC → Ground Segmentation
LiDAR → GTSAM → Localization
Camera → (none currently)Problems:
- No information sharing: Detection doesn't benefit from segmentation context, and vice versa
- Redundant computation: Each module independently processes the same point cloud
- Integration errors: Separate module outputs must be fused in planning, introducing latency and inconsistency
- No joint optimization: Modules can't learn complementary representations
1.2 The Multi-Task Advantage
| Approach | Compute | Consistency | Optimization | Certification |
|---|---|---|---|---|
| Separate models | N × backbone | Inconsistent (separate outputs) | Per-task optimal | Easier (modular) |
| Multi-task shared backbone | 1 × backbone + N heads | Consistent (shared features) | Joint optimization | Medium |
| Fully unified (UniAD) | 1 × model | Fully consistent | End-to-end optimal | Harder (monolithic) |
The multi-task approach saves compute (shared backbone amortized across tasks), improves consistency (all tasks see the same features), and can improve accuracy through positive transfer (related tasks help each other learn).
1.3 When Multi-Task Helps vs Hurts
| Task Pair | Transfer | Mechanism | Evidence |
|---|---|---|---|
| Detection + Segmentation | Positive | Shared object boundary features | +1-2% mAP, +2-3% mIoU |
| Detection + Tracking | Positive | Temporal features aid both | +1-3% MOTA |
| Detection + Depth | Positive | Depth regularizes 3D detection | +2-4% mAP |
| Detection + Planning | Mixed | Planning needs different spatial resolution | 0 to +1% L2 |
| Segmentation + Mapping | Positive | Complementary scene understanding | +1-2% mAP on map elements |
| Tracking + Prediction | Strong positive | Prediction directly extends tracks | +5-10% minADE |
| Detection + Text/Language | Neutral | Different representation spaces | Minimal transfer |
2. UniAD: The First Unified Framework
2.1 Architecture (Hu et al., CVPR 2023 Best Paper)
UniAD unifies detection, tracking, mapping, motion prediction, occupancy prediction, and planning in a single model.
Pipeline:
Multi-view Images (6 cameras)
↓
Image Backbone (ResNet-101)
↓
BEV Features (BEVFormer)
↓
┌────────────────────────────────────────┐
│ Stage 1: TrackFormer (Detection+Track) │
│ → Object queries with temporal memory │
└─────────────┬──────────────────────────┘
↓
┌────────────────────────────────────────┐
│ Stage 2: MapFormer (Online Mapping) │
│ → Lane/boundary queries │
└─────────────┬──────────────────────────┘
↓
┌────────────────────────────────────────┐
│ Stage 3: MotionFormer (Prediction) │
│ → Multi-modal future trajectories │
│ → Conditioned on map + detections │
└─────────────┬──────────────────────────┘
↓
┌────────────────────────────────────────┐
│ Stage 4: OccFormer (Occupancy Pred.) │
│ → Future occupancy grids │
└─────────────┬──────────────────────────┘
↓
┌────────────────────────────────────────┐
│ Stage 5: Planner │
│ → Ego future trajectory │
│ → Conditioned on all above │
└────────────────────────────────────────┘2.2 Key Design Choices
Query-based communication: Each stage produces queries that are passed to subsequent stages. This creates an information bottleneck that forces each stage to produce useful abstractions.
Cascaded training: Stages are trained sequentially initially (stage 1 first, then 2, etc.) and then jointly fine-tuned end-to-end. This avoids the instability of training all stages simultaneously from scratch.
Loss weighting:
loss_weights = {
'detection': 2.0, # primary task
'tracking': 1.0, # secondary
'mapping': 0.5, # auxiliary
'prediction': 1.0, # important for planning
'occupancy': 0.35, # regularizer
'planning': 1.0, # final objective
}2.3 Results on nuScenes
| Task | Metric | UniAD | Previous SOTA | Improvement |
|---|---|---|---|---|
| Detection | NDS | 49.8 | 51.6 (BEVFormer) | -1.8 (trade-off) |
| Tracking | AMOTA | 39.0 | 37.2 | +1.8 |
| Mapping | mAP | 35.0 | 33.2 | +1.8 |
| Prediction | minADE | 0.70 | 0.84 | -0.14 (17%) |
| Occ. Prediction | IoU-f | 63.4 | 57.8 | +5.6 |
| Planning | Avg L2 | 0.71 | 1.02 | -0.31 (30%) |
Key insight: UniAD's detection score is slightly below single-task SOTA, but planning (the final objective) improves dramatically because of the unified representation.
2.4 Limitations
- Latency: ~100ms on A100 — too slow for Orin real-time
- Camera-only: No LiDAR integration path
- Serial cascade: Each stage depends on previous — hard to parallelize
- Certification: Monolithic model is hard to decompose for safety analysis
3. SparseDrive: Fully Sparse End-to-End
3.1 Architecture (Sun et al., 2024)
SparseDrive addresses UniAD's density bottleneck by using fully sparse representations throughout. No dense BEV map — everything is represented as sparse queries.
Key Innovations:
- Symmetric sparse perception: Single set of queries handles detection + tracking + mapping + prediction
- Parallel task heads: Unlike UniAD's serial cascade, tasks run in parallel on shared queries
- Instance-level memory: Temporal information via per-instance memory bank, not dense BEV recurrence
Multi-view Images → Image Features
↓
Sparse Instance Queries (N=900)
↓
Deformable Cross-Attention
(query → image features)
↓
┌──────────┬──────────┬──────────┬───────────┐
│ Det Head │ Map Head │ Pred Head│ Plan Head │
│ (parallel)│ (parallel)│(parallel)│(parallel) │
└──────────┴──────────┴──────────┴───────────┘3.2 Results
| Metric | UniAD | VAD-Base | SparseDrive | Improvement vs UniAD |
|---|---|---|---|---|
| NDS | 49.8 | 42.7 | 52.5 | +2.7 |
| AMOTA | 39.0 | — | 41.2 | +2.2 |
| Avg L2 (plan) | 0.71 | 0.54 | 0.48 | -32% |
| Collision Rate | 0.31 | 0.07 | 0.05 | -84% |
| FPS (A100) | ~3 | ~8 | 9.0 | 3x |
SparseDrive is both more accurate and 3x faster than UniAD, primarily due to sparse representations eliminating the dense BEV bottleneck.
3.3 Orin Feasibility
| Component | A100 Latency | Orin Estimate | Notes |
|---|---|---|---|
| Image backbone | 15ms | 30ms | ResNet-50 with TensorRT |
| Sparse query init | 2ms | 5ms | |
| Deformable attention | 30ms | 70ms | Most expensive |
| Parallel task heads | 20ms | 45ms | |
| Total | 67ms | 150ms | Over budget for 100ms |
SparseDrive needs significant optimization for Orin — potentially achievable with:
- Smaller backbone (ResNet-34 or EfficientNet)
- Fewer queries (300 vs 900)
- INT8 quantization for attention
- Custom CUDA kernels for deformable attention
Estimated with aggressive optimization: 80-100ms on Orin — borderline feasible.
4. VAD: Vectorized Scene Representation
4.1 Architecture (Jiang et al., ICCV 2023)
VAD uses vectorized representations (polylines) for both map elements and agent trajectories, eliminating rasterized maps entirely.
Key idea: Represent everything as sequences of points:
- Lanes: ordered waypoints
- Boundaries: ordered boundary points
- Agents: historical + predicted trajectory points
- Ego plan: future waypoints
4.2 VAD-Tiny for Edge Deployment
| Variant | Backbone | Queries | Planning L2 | FPS (A100) | Est. Orin |
|---|---|---|---|---|---|
| VAD-Base | ResNet-50 | 300 agent + 100 map | 0.54 | 8 | ~40ms (!) |
| VAD-Tiny | ResNet-18 | 200 agent + 50 map | 0.62 | 20 | ~80ms |
VAD-Tiny is the most promising unified model for Orin deployment — 80ms estimated with TensorRT.
5. StreamPETR and Temporal Fusion
5.1 StreamPETR (Wang et al., ICCV 2023)
StreamPETR brings temporal fusion to sparse query-based detection via streaming — it propagates object queries across frames without re-processing historical images.
Architecture:
Frame t-1: Image → Features → Queries → Propagated Queries
↓
Frame t: Image → Features → Cross-Attention(new features, propagated queries)
↓
Updated Queries → Detection + TrackingBenefit: Temporal context at minimal cost — only process the current frame's images, but retain object memory via persistent queries.
5.2 Multi-Task Extension
StreamPETR's propagated queries can be decoded into multiple tasks:
class StreamPETRMultiTask(nn.Module):
"""StreamPETR with multi-task heads."""
def __init__(self):
self.backbone = ResNet50()
self.bev_encoder = BEVPoolv2() # fast BEV via pool
# Shared temporal queries
self.num_queries = 600
self.query_memory = None # persistent across frames
# Task-specific heads
self.detection_head = CenterHead(256, num_classes=8)
self.segmentation_head = SegmentationHead(256, num_classes=14)
self.prediction_head = MotionHead(256, future_steps=6)
self.freespace_head = FreeSpaceHead(256)
def forward(self, images, prev_queries=None):
# Image features
features = self.backbone(images)
# BEV features (for segmentation/freespace)
bev = self.bev_encoder(features)
# Query update with temporal propagation
if prev_queries is not None:
queries = self.temporal_attention(features, prev_queries)
else:
queries = self.init_queries()
# Parallel task heads
detections = self.detection_head(queries)
segmentation = self.segmentation_head(bev)
predictions = self.prediction_head(queries, detections)
freespace = self.freespace_head(bev)
# Save queries for next frame
self.query_memory = queries.detach()
return {
'detections': detections,
'segmentation': segmentation,
'predictions': predictions,
'freespace': freespace
}6. Task Interference and Loss Balancing
6.1 The Negative Transfer Problem
Multi-task learning can hurt performance when tasks conflict:
| Conflict Type | Example | Impact |
|---|---|---|
| Gradient conflict | Detection wants sharp features, segmentation wants smooth | -1-2% on both tasks |
| Representation conflict | Short-range detection vs long-range prediction need different scales | -2-3% on one task |
| Learning rate conflict | Detection converges fast, prediction is slow | Unstable training |
| Data imbalance | Many detection labels, few planning labels | Dominant task overfits |
6.2 Loss Balancing Methods
| Method | Mechanism | Cost | Effectiveness |
|---|---|---|---|
| Fixed weights | Manual tuning | None | Baseline |
| Uncertainty weighting (Kendall et al., 2018) | Homoscedastic uncertainty as weight | 1 param per task | Good |
| GradNorm (Chen et al., 2018) | Normalize gradient magnitudes | Gradient computation | Good |
| PCGrad (Yu et al., 2020) | Project conflicting gradients | O(T²) per step | Best for conflicts |
| CAGrad (Liu et al., 2021) | Common descent direction | O(T²) per step | Best accuracy |
| DWA (Liu et al., 2019) | Dynamic weight average | Track loss ratios | Simple, effective |
| IMTL (Liu et al., 2021) | Impartial multi-task learning | O(T²) per step | Theoretically sound |
6.3 Uncertainty Weighting (Recommended)
class MultiTaskLoss(nn.Module):
"""Uncertainty-weighted multi-task loss (Kendall et al., 2018)."""
def __init__(self, task_names):
super().__init__()
# Learnable log-variance per task (initialized to 0 = equal weight)
self.log_vars = nn.ParameterDict({
name: nn.Parameter(torch.zeros(1)) for name in task_names
})
def forward(self, losses):
"""
Args:
losses: dict of {task_name: loss_value}
Returns:
total weighted loss
"""
total = 0
weights = {}
for name, loss in losses.items():
log_var = self.log_vars[name]
# Weight = 1/(2*sigma^2), regularized by log(sigma^2)
precision = torch.exp(-log_var)
total += precision * loss + log_var
weights[name] = precision.item()
return total, weights
# Usage
multi_loss = MultiTaskLoss(['detection', 'segmentation', 'prediction', 'freespace'])
total, weights = multi_loss({
'detection': det_loss,
'segmentation': seg_loss,
'prediction': pred_loss,
'freespace': free_loss
})
# weights automatically learned: {detection: 2.1, segmentation: 0.8, ...}6.4 PCGrad for Conflict Resolution
def pcgrad_update(task_losses, shared_params, optimizer):
"""Projecting Conflicting Gradients (PCGrad)."""
task_grads = []
# Compute per-task gradients
for loss in task_losses:
optimizer.zero_grad()
loss.backward(retain_graph=True)
grads = [p.grad.clone() for p in shared_params]
task_grads.append(grads)
# Project conflicting gradients
for i in range(len(task_grads)):
for j in range(len(task_grads)):
if i == j:
continue
# For each pair of tasks
for k in range(len(shared_params)):
g_i = task_grads[i][k]
g_j = task_grads[j][k]
# Check if gradients conflict (negative cosine similarity)
dot = (g_i * g_j).sum()
if dot < 0:
# Project g_i onto normal of g_j
task_grads[i][k] = g_i - (dot / (g_j.norm()**2 + 1e-8)) * g_j
# Sum projected gradients
optimizer.zero_grad()
for k, p in enumerate(shared_params):
p.grad = sum(task_grads[i][k] for i in range(len(task_grads)))
optimizer.step()7. Shared-Backbone Multi-Head Architecture
7.1 Recommended Architecture for reference airside AV stack
Rather than adopting a monolithic end-to-end model, the recommended approach for reference airside AV stack is a shared-backbone, multi-head design that preserves the proven PointPillars detector while adding capabilities:
LiDAR Scans (4-8 RoboSense)
↓
Point Cloud Merge & Preprocessing
↓
┌─────────────────────────────┐
│ Shared PillarVFE Backbone │ ← existing PointPillars backbone
│ + SECOND FPN Neck │
└────────────┬────────────────┘
↓
┌────────┬────────┬──────────┬───────────┐
│ Det │ Seg │ FreeSpace│ Prediction │
│ Head │ Head │ Head │ Head │
│ (exist)│ (new) │ (new) │ (new) │
└────────┴────────┴──────────┴───────────┘
↓ ↓ ↓ ↓
Detections Semantic Free Space Motion
(boxes) Map (BEV) Grid (BEV) Forecasts
↓ ↓ ↓ ↓
┌────────────────────────────────────────┐
│ Frenet Planner (existing) │
│ + enhanced cost function using all │
│ multi-task outputs │
└────────────────────────────────────────┘7.2 Task Head Designs
Detection Head (Existing — CenterHead):
- Already proven at 6.84ms on Orin with TensorRT
- No changes needed
Segmentation Head (New):
class BEVSegmentationHead(nn.Module):
"""Lightweight BEV segmentation from shared features."""
def __init__(self, in_channels=256, num_classes=14):
super().__init__()
self.head = nn.Sequential(
nn.Conv2d(in_channels, 128, 3, padding=1),
nn.BatchNorm2d(128),
nn.ReLU(),
nn.Conv2d(128, 64, 3, padding=1),
nn.BatchNorm2d(64),
nn.ReLU(),
nn.Conv2d(64, num_classes, 1)
)
def forward(self, bev_features):
return self.head(bev_features) # (B, C, H, W)Estimated: 2-3ms on Orin (simple convolutions on shared BEV features).
Free Space Head (New):
class FreeSpaceHead(nn.Module):
"""Binary free/occupied classification for path planning."""
def __init__(self, in_channels=256):
super().__init__()
self.head = nn.Sequential(
nn.Conv2d(in_channels, 64, 3, padding=1),
nn.ReLU(),
nn.Conv2d(64, 1, 1),
nn.Sigmoid()
)
def forward(self, bev_features):
return self.head(bev_features) # (B, 1, H, W) probability of freeEstimated: 1-2ms on Orin.
Motion Prediction Head (New):
class MotionPredictionHead(nn.Module):
"""Predict future positions of detected objects."""
def __init__(self, in_channels=256, num_modes=6, future_steps=10):
super().__init__()
self.num_modes = num_modes
self.future_steps = future_steps
# Per-object features from detection queries
self.motion_encoder = nn.Sequential(
nn.Linear(in_channels, 256),
nn.ReLU(),
nn.Linear(256, 256)
)
# Multi-modal prediction
self.trajectory_decoder = nn.Linear(256, num_modes * future_steps * 2)
self.mode_probs = nn.Linear(256, num_modes)
def forward(self, object_features):
"""
Args:
object_features: (B, N, C) from detection head
Returns:
trajectories: (B, N, K, T, 2) — K modes, T timesteps, xy
mode_probs: (B, N, K) — probability per mode
"""
h = self.motion_encoder(object_features)
traj = self.trajectory_decoder(h).reshape(
-1, self.num_modes, self.future_steps, 2
)
probs = F.softmax(self.mode_probs(h), dim=-1)
return traj, probsEstimated: 3-5ms on Orin (runs only for N detected objects, typically <50).
7.3 Total Multi-Task Timing
| Component | Standalone | Multi-Task (shared backbone) | Savings |
|---|---|---|---|
| Backbone (PillarVFE + FPN) | 5.0ms | 5.0ms (shared) | — |
| Detection head | 1.8ms | 1.8ms | 0% |
| Segmentation head | N/A (separate: 8ms) | 2.5ms | 69% |
| Free space head | N/A (separate: 5ms) | 1.5ms | 70% |
| Motion prediction | N/A (separate: 15ms) | 4.0ms | 73% |
| Total | ~34ms (4 separate models) | ~14.8ms | 56% |
The shared backbone saves ~19ms by amortizing the most expensive computation.
8. Foundation Models as Unified Backbones
8.1 DINOv2 as Multi-Task Backbone
DINOv2 features contain rich semantic information usable for multiple tasks simultaneously (see 30-autonomy-stack/perception/overview/dinov2-foundation-models-driving.md):
class DINOv2MultiTaskPerception(nn.Module):
"""DINOv2 backbone with multi-task heads for camera perception."""
def __init__(self):
self.backbone = DINOv2ViTB14(frozen=True)
self.adapter = nn.Sequential(
nn.Linear(768, 256), nn.ReLU()
) # LoRA adapter for driving domain
# Multi-task heads from shared DINOv2 features
self.det_head_2d = FCOS3DHead(256)
self.seg_head = SegFormerHead(256, 14)
self.depth_head = DepthHead(256)
self.lane_head = LaneDetHead(256)
def forward(self, images):
features = self.backbone(images) # (B, N, 768)
adapted = self.adapter(features) # (B, N, 256)
# Spatial reshape
H, W = images.shape[-2] // 14, images.shape[-1] // 14
spatial = adapted.reshape(-1, H, W, 256).permute(0, 3, 1, 2)
return {
'detections': self.det_head_2d(spatial),
'segmentation': self.seg_head(spatial),
'depth': self.depth_head(spatial),
'lanes': self.lane_head(spatial)
}8.2 PTv3 as LiDAR Multi-Task Backbone
Point Transformer v3 (PTv3) provides a unified LiDAR feature extractor (see 30-autonomy-stack/perception/overview/lidar-foundation-models.md):
| Task | Single-Task (separate backbone) | Multi-Task (shared PTv3) | Delta |
|---|---|---|---|
| 3D Detection (NDS) | 72.1 | 71.5 | -0.6 |
| Semantic Seg (mIoU) | 80.3 | 79.8 | -0.5 |
| Panoptic Seg (PQ) | 74.2 | 73.6 | -0.6 |
| Total compute | 3x backbone | 1x backbone | 67% savings |
Minimal accuracy loss (<1%) for 67% compute reduction.
9. Deployment Considerations for Orin
9.1 End-to-End Model Feasibility
| Model | A100 Latency | Orin Estimate | Feasible? | Key Bottleneck |
|---|---|---|---|---|
| UniAD | 100ms | 250ms+ | No | Dense BEV, serial cascade |
| SparseDrive | 67ms | 150ms | Borderline | Deformable attention |
| VAD-Tiny | 50ms | 80ms | Maybe | BEV encoder |
| StreamPETR | 40ms | 70ms | Yes | Image backbone |
| Multi-head PointPillars | 14.8ms | 14.8ms | Yes | Already on Orin |
9.2 Incremental Multi-Task Deployment
| Phase | Addition | Orin Latency | Benefit |
|---|---|---|---|
| Current | PointPillars detection only | 6.84ms | Baseline |
| +1 | Add segmentation head | 9.3ms | Road surface classification |
| +2 | Add free space head | 10.8ms | Explicit drivable area |
| +3 | Add motion prediction head | 14.8ms | Predictive planning |
| +4 | Add camera branch (future) | ~35ms | Redundant perception |
Each phase adds ~1.5-4ms and can be validated independently.
9.3 TensorRT Multi-Head Optimization
# TensorRT engine with multiple output heads
import tensorrt as trt
def build_multi_head_engine(onnx_path, engine_path):
"""Build TensorRT engine with shared backbone + multiple heads."""
builder = trt.Builder(TRT_LOGGER)
network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
parser = trt.OnnxParser(network, TRT_LOGGER)
# Parse ONNX (exported with all heads)
parser.parse_from_file(onnx_path)
config = builder.create_builder_config()
config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)
# FP16 for backbone and heads
config.set_flag(trt.BuilderFlag.FP16)
# INT8 for backbone only (heads need FP16 for classification accuracy)
# ... calibration setup ...
# Mark multiple outputs
for i in range(network.num_outputs):
output = network.get_output(i)
print(f"Output {i}: {output.name}, shape: {output.shape}")
engine = builder.build_serialized_network(network, config)
with open(engine_path, 'wb') as f:
f.write(engine)10. Airside-Specific Multi-Task Design
10.1 Airside Task Taxonomy
| Task | Priority | Why | Existing? |
|---|---|---|---|
| 3D Object Detection | Critical | Core safety function | Yes (PointPillars) |
| Ground Segmentation | High | Drivable area identification | Yes (RANSAC) |
| Semantic Segmentation | Medium | Rich scene understanding | No |
| Free Space Estimation | High | Path planning input | No (implicit) |
| Motion Prediction | High | Predictive safety margins | No |
| Lane/Marking Detection | Medium | Taxiway following | No (map-based) |
| Object Classification (fine) | Medium | GSE type identification | Partial |
| Occupancy Prediction | Medium | Occluded area reasoning | No |
| Depth Estimation | Low (LiDAR primary) | Camera fallback | No |
10.2 Airside Segmentation Classes
For the segmentation head:
| ID | Class | Color | Priority | Description |
|---|---|---|---|---|
| 0 | background | black | Low | Sky, distant buildings |
| 1 | tarmac | gray | Medium | Paved apron surface |
| 2 | taxiway_marking | yellow | High | Yellow lines, guidance markings |
| 3 | aircraft_body | blue | Critical | Fuselage, wings |
| 4 | aircraft_engine | red | Critical | Jet engines (safety zone) |
| 5 | gse_vehicle | green | High | Baggage carts, tugs, trucks |
| 6 | ground_crew | orange | Critical | Personnel |
| 7 | jet_bridge | purple | Medium | Passenger boarding bridges |
| 8 | terminal | brown | Low | Terminal buildings |
| 9 | grass_area | lime | Medium | Non-drivable |
| 10 | safety_equipment | pink | Medium | Cones, barriers, chocks |
| 11 | construction | red-orange | High | Temporary obstacles |
| 12 | fod | white | Critical | Foreign object debris |
| 13 | water_puddle | cyan | Medium | Surface water |
10.3 Multi-Task Training Data Strategy
| Task | Public Data | Airside Data Needed | Labeling Cost |
|---|---|---|---|
| Detection | nuScenes pre-train | 2,000 frames | $16-30K |
| Segmentation | SemanticKITTI pre-train | 1,000 frames | $15-25K |
| Free space | Waymo pre-train | 500 frames | $5-10K |
| Motion prediction | nuScenes pre-train | 2,000 frames | $20-40K |
| Total | Free | 5,500 frames | $56-105K |
With multi-task shared labeling (annotate all tasks per frame): $40-75K (30% savings from shared annotation sessions).
11. Comparison and Recommendations
11.1 Architecture Comparison
| Architecture | Accuracy | Latency (Orin) | Certifiability | Complexity | Risk |
|---|---|---|---|---|---|
| Separate models | Baseline | 34ms (4 models) | High (modular) | Low | Low |
| Multi-head shared | +1-3% | 14.8ms | Medium | Medium | Low |
| UniAD-style | +5-10% | 250ms+ | Low (monolithic) | High | High |
| SparseDrive-style | +5-10% | 150ms | Low | High | High |
| VAD-Tiny | +3-5% | 80ms | Low | Medium | Medium |
11.2 Recommendation
Near-term (0-12 months): Multi-head shared backbone
- Preserve proven PointPillars detection
- Add segmentation, free space, motion prediction heads incrementally
- Each head validated independently before deployment
- Total: 14.8ms on Orin — 4.5x faster than separate models
Medium-term (12-24 months): Camera-LiDAR multi-task fusion
- Add camera branch with DINOv2/PTv3 features
- BEV fusion at feature level
- Enables redundant perception for certification
Long-term (24+ months): End-to-end unified (when Orin successor available)
- SparseDrive-style on NVIDIA Thor (~1000 TOPS)
- Full end-to-end training with planning objective
- Only after sufficient airside training data accumulated
12. Key Takeaways
UniAD (CVPR 2023 Best Paper) unified 6 tasks — improving planning L2 by 30% over modular baselines, but at 100ms on A100 (250ms+ on Orin — infeasible)
SparseDrive achieves 3x speedup over UniAD — fully sparse representation eliminates dense BEV bottleneck, 9 FPS on A100, but still 150ms estimated on Orin
Multi-head shared backbone saves 56% compute — 14.8ms vs 34ms for 4 separate models on Orin, by amortizing the PointPillars backbone across detection + segmentation + free space + prediction
Task interference is manageable — uncertainty-weighted loss (Kendall et al., 2018) automatically balances tasks with 1 learnable parameter per task. PCGrad resolves gradient conflicts at O(T²) per step
Positive transfer between tasks — detection + segmentation share boundary features (+1-2% mAP), tracking + prediction share temporal features (+5-10% minADE), but detection + planning transfer is mixed
Incremental deployment is key — add heads one at a time (segmentation +2.5ms, free space +1.5ms, prediction +4ms), each validated independently before production
VAD-Tiny is the most Orin-feasible unified model — estimated 80ms with TensorRT, but still consumes most of the 100ms planning budget with less proven detection accuracy
PTv3 as shared LiDAR backbone loses <1% per task vs single-task for 67% compute reduction — the multi-task tax is small for a strong backbone
Airside segmentation needs 14 custom classes — from tarmac and taxiway markings (drivable) to aircraft engines and ground crew (critical safety zones)
Multi-task labeling saves 30% — annotating all tasks per frame in shared sessions ($40-75K) vs separate annotation campaigns ($56-105K)
Foundation models (DINOv2, PTv3) are natural multi-task backbones — their pre-trained features transfer well to multiple downstream tasks with LoRA adapters, reducing per-task fine-tuning cost
Certification favors modular over monolithic — shared-backbone multi-head architecture is decomposable for safety analysis (each head can be evaluated independently), while fully end-to-end models (UniAD) resist modular certification per ISO 3691-4 / UL 4600
Recommended timeline: Multi-head PointPillars now (14.8ms, proven) → camera-LiDAR fusion at 12-24 months → full end-to-end on Thor at 24+ months
Free space estimation is the highest-value addition — enables explicit drivable area reasoning in the Frenet planner at only 1.5ms additional cost, replacing implicit free-space inference from detection
Motion prediction from shared features achieves 70% of standalone accuracy — the multi-task prediction head at 4ms is a practical substitute for a dedicated 15ms prediction model, good enough for conservative airside speeds (≤30 km/h)
References
- Hu, Y. et al., "Planning-oriented Autonomous Driving (UniAD)," CVPR 2023 (Best Paper)
- Sun, W. et al., "SparseDrive: End-to-End Autonomous Driving via Sparse Scene Representation," 2024
- Jiang, B. et al., "VAD: Vectorized Scene Representation for Efficient Autonomous Driving," ICCV 2023
- Wang, S. et al., "Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection (StreamPETR)," ICCV 2023
- Kendall, A. et al., "Multi-Task Learning Using Uncertainty to Weigh Losses," CVPR 2018
- Chen, Z. et al., "GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks," ICML 2018
- Yu, T. et al., "Gradient Surgery for Multi-Task Learning (PCGrad)," NeurIPS 2020
- Liu, B. et al., "Conflict-Averse Gradient Descent for Multi-Task Learning (CAGrad)," NeurIPS 2021
- Li, Y. et al., "BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers," ECCV 2022
- Wu, P. et al., "Point Transformer V3: Simpler, Faster, Stronger," CVPR 2024
Document generated for reference airside AV stack industry research, April 2026. Covers multi-task and unified perception architectures — for individual task details, see specific docs: detection (perception/openpcdet-centerpoint.md), segmentation (perception/lidar-semantic-segmentation.md), prediction (planning/motion-prediction.md), occupancy (world-models/occupancy-networks-comparison.md).