Multi-Task and Unified Perception Architectures for Autonomous Driving

Executive Summary

Modern AV perception is converging toward unified architectures that perform detection, segmentation, tracking, prediction, and even planning in a single model with shared representations. This document surveys the evolution from task-specific models to end-to-end multi-task systems: UniAD (first unified framework, CVPR 2023 Best Paper), SparseDrive (fully sparse, 9.0 FPS on A100), VAD (vectorized scene representation), StreamPETR (streaming temporal fusion), Far3D (long-range detection), and the emerging paradigm of foundation-model-based unified perception. For the reference airside AV stack's airside stack, the key question is whether to replace the current separate PointPillars detector + GTSAM localizer + Frenet planner with a unified model, or to incrementally add multi-task heads to the existing backbone. Key finding: a shared-backbone, multi-head approach preserves the proven PointPillars detection (6.84ms) while adding segmentation, free-space estimation, and motion prediction at marginal cost (~15ms additional), and avoids the risks of monolithic end-to-end systems that are harder to certify. Full end-to-end models (UniAD, SparseDrive) achieve state-of-the-art on nuScenes but require 50-100ms on A100 — too slow for Orin real-time without aggressive optimization.

The Case for Unified Perception
UniAD: The First Unified Framework
SparseDrive: Fully Sparse End-to-End
VAD: Vectorized Scene Representation
StreamPETR and Temporal Fusion
Task Interference and Loss Balancing
Shared-Backbone Multi-Head Architecture
Foundation Models as Unified Backbones
Deployment Considerations for Orin
Airside-Specific Multi-Task Design
Comparison and Recommendations
Key Takeaways

1. The Case for Unified Perception

1.1 The Problem with Separate Models

the reference airside AV stack's current stack runs perception as independent modules:

LiDAR → PointPillars → 3D Detections
LiDAR → RANSAC → Ground Segmentation  
LiDAR → GTSAM → Localization
Camera → (none currently)

Problems:

No information sharing: Detection doesn't benefit from segmentation context, and vice versa
Redundant computation: Each module independently processes the same point cloud
Integration errors: Separate module outputs must be fused in planning, introducing latency and inconsistency
No joint optimization: Modules can't learn complementary representations

1.2 The Multi-Task Advantage

Approach	Compute	Consistency	Optimization	Certification
Separate models	N × backbone	Inconsistent (separate outputs)	Per-task optimal	Easier (modular)
Multi-task shared backbone	1 × backbone + N heads	Consistent (shared features)	Joint optimization	Medium
Fully unified (UniAD)	1 × model	Fully consistent	End-to-end optimal	Harder (monolithic)

The multi-task approach saves compute (shared backbone amortized across tasks), improves consistency (all tasks see the same features), and can improve accuracy through positive transfer (related tasks help each other learn).

1.3 When Multi-Task Helps vs Hurts

Task Pair	Transfer	Mechanism	Evidence
Detection + Segmentation	Positive	Shared object boundary features	+1-2% mAP, +2-3% mIoU
Detection + Tracking	Positive	Temporal features aid both	+1-3% MOTA
Detection + Depth	Positive	Depth regularizes 3D detection	+2-4% mAP
Detection + Planning	Mixed	Planning needs different spatial resolution	0 to +1% L2
Segmentation + Mapping	Positive	Complementary scene understanding	+1-2% mAP on map elements
Tracking + Prediction	Strong positive	Prediction directly extends tracks	+5-10% minADE
Detection + Text/Language	Neutral	Different representation spaces	Minimal transfer

2. UniAD: The First Unified Framework

2.1 Architecture (Hu et al., CVPR 2023 Best Paper)

UniAD unifies detection, tracking, mapping, motion prediction, occupancy prediction, and planning in a single model.

Pipeline:

Multi-view Images (6 cameras)
         ↓
    Image Backbone (ResNet-101)
         ↓
    BEV Features (BEVFormer)
         ↓
┌────────────────────────────────────────┐
│ Stage 1: TrackFormer (Detection+Track) │
│ → Object queries with temporal memory  │
└─────────────┬──────────────────────────┘
              ↓
┌────────────────────────────────────────┐
│ Stage 2: MapFormer (Online Mapping)    │
│ → Lane/boundary queries               │
└─────────────┬──────────────────────────┘
              ↓
┌────────────────────────────────────────┐
│ Stage 3: MotionFormer (Prediction)     │
│ → Multi-modal future trajectories      │
│ → Conditioned on map + detections      │
└─────────────┬──────────────────────────┘
              ↓
┌────────────────────────────────────────┐
│ Stage 4: OccFormer (Occupancy Pred.)   │
│ → Future occupancy grids               │
└─────────────┬──────────────────────────┘
              ↓
┌────────────────────────────────────────┐
│ Stage 5: Planner                       │
│ → Ego future trajectory                │
│ → Conditioned on all above             │
└────────────────────────────────────────┘

2.2 Key Design Choices

Query-based communication: Each stage produces queries that are passed to subsequent stages. This creates an information bottleneck that forces each stage to produce useful abstractions.

Cascaded training: Stages are trained sequentially initially (stage 1 first, then 2, etc.) and then jointly fine-tuned end-to-end. This avoids the instability of training all stages simultaneously from scratch.

Loss weighting:

python

loss_weights = {
    'detection': 2.0,      # primary task
    'tracking': 1.0,       # secondary
    'mapping': 0.5,        # auxiliary
    'prediction': 1.0,     # important for planning
    'occupancy': 0.35,     # regularizer
    'planning': 1.0,       # final objective
}

2.3 Results on nuScenes

Task	Metric	UniAD	Previous SOTA	Improvement
Detection	NDS	49.8	51.6 (BEVFormer)	-1.8 (trade-off)
Tracking	AMOTA	39.0	37.2	+1.8
Mapping	mAP	35.0	33.2	+1.8
Prediction	minADE	0.70	0.84	-0.14 (17%)
Occ. Prediction	IoU-f	63.4	57.8	+5.6
Planning	Avg L2	0.71	1.02	-0.31 (30%)

Key insight: UniAD's detection score is slightly below single-task SOTA, but planning (the final objective) improves dramatically because of the unified representation.

2.4 Limitations

Latency: ~100ms on A100 — too slow for Orin real-time
Camera-only: No LiDAR integration path
Serial cascade: Each stage depends on previous — hard to parallelize
Certification: Monolithic model is hard to decompose for safety analysis

3. SparseDrive: Fully Sparse End-to-End

3.1 Architecture (Sun et al., 2024)

SparseDrive addresses UniAD's density bottleneck by using fully sparse representations throughout. No dense BEV map — everything is represented as sparse queries.

Key Innovations:

Symmetric sparse perception: Single set of queries handles detection + tracking + mapping + prediction
Parallel task heads: Unlike UniAD's serial cascade, tasks run in parallel on shared queries
Instance-level memory: Temporal information via per-instance memory bank, not dense BEV recurrence

Multi-view Images → Image Features
                        ↓
          Sparse Instance Queries (N=900)
                        ↓
              Deformable Cross-Attention
              (query → image features)
                        ↓
┌──────────┬──────────┬──────────┬───────────┐
│ Det Head │ Map Head │ Pred Head│ Plan Head │
│ (parallel)│ (parallel)│(parallel)│(parallel) │
└──────────┴──────────┴──────────┴───────────┘

3.2 Results

Metric	UniAD	VAD-Base	SparseDrive	Improvement vs UniAD
NDS	49.8	42.7	52.5	+2.7
AMOTA	39.0	—	41.2	+2.2
Avg L2 (plan)	0.71	0.54	0.48	-32%
Collision Rate	0.31	0.07	0.05	-84%
FPS (A100)	~3	~8	9.0	3x

SparseDrive is both more accurate and 3x faster than UniAD, primarily due to sparse representations eliminating the dense BEV bottleneck.

3.3 Orin Feasibility

Component	A100 Latency	Orin Estimate	Notes
Image backbone	15ms	30ms	ResNet-50 with TensorRT
Sparse query init	2ms	5ms
Deformable attention	30ms	70ms	Most expensive
Parallel task heads	20ms	45ms
Total	67ms	150ms	Over budget for 100ms

SparseDrive needs significant optimization for Orin — potentially achievable with:

Smaller backbone (ResNet-34 or EfficientNet)
Fewer queries (300 vs 900)
INT8 quantization for attention
Custom CUDA kernels for deformable attention

Estimated with aggressive optimization: 80-100ms on Orin — borderline feasible.

4. VAD: Vectorized Scene Representation

4.1 Architecture (Jiang et al., ICCV 2023)

VAD uses vectorized representations (polylines) for both map elements and agent trajectories, eliminating rasterized maps entirely.

Key idea: Represent everything as sequences of points:

Lanes: ordered waypoints
Boundaries: ordered boundary points
Agents: historical + predicted trajectory points
Ego plan: future waypoints

4.2 VAD-Tiny for Edge Deployment

Variant	Backbone	Queries	Planning L2	FPS (A100)	Est. Orin
VAD-Base	ResNet-50	300 agent + 100 map	0.54	8	~40ms (!)
VAD-Tiny	ResNet-18	200 agent + 50 map	0.62	20	~80ms

VAD-Tiny is the most promising unified model for Orin deployment — 80ms estimated with TensorRT.

5. StreamPETR and Temporal Fusion

5.1 StreamPETR (Wang et al., ICCV 2023)

StreamPETR brings temporal fusion to sparse query-based detection via streaming — it propagates object queries across frames without re-processing historical images.

Architecture:

Frame t-1: Image → Features → Queries → Propagated Queries
                                              ↓
Frame t:   Image → Features → Cross-Attention(new features, propagated queries)
                                              ↓
                              Updated Queries → Detection + Tracking

Benefit: Temporal context at minimal cost — only process the current frame's images, but retain object memory via persistent queries.

5.2 Multi-Task Extension

StreamPETR's propagated queries can be decoded into multiple tasks:

python

class StreamPETRMultiTask(nn.Module):
    """StreamPETR with multi-task heads."""
    
    def __init__(self):
        self.backbone = ResNet50()
        self.bev_encoder = BEVPoolv2()  # fast BEV via pool
        
        # Shared temporal queries
        self.num_queries = 600
        self.query_memory = None  # persistent across frames
        
        # Task-specific heads
        self.detection_head = CenterHead(256, num_classes=8)
        self.segmentation_head = SegmentationHead(256, num_classes=14)
        self.prediction_head = MotionHead(256, future_steps=6)
        self.freespace_head = FreeSpaceHead(256)
    
    def forward(self, images, prev_queries=None):
        # Image features
        features = self.backbone(images)
        
        # BEV features (for segmentation/freespace)
        bev = self.bev_encoder(features)
        
        # Query update with temporal propagation
        if prev_queries is not None:
            queries = self.temporal_attention(features, prev_queries)
        else:
            queries = self.init_queries()
        
        # Parallel task heads
        detections = self.detection_head(queries)
        segmentation = self.segmentation_head(bev)
        predictions = self.prediction_head(queries, detections)
        freespace = self.freespace_head(bev)
        
        # Save queries for next frame
        self.query_memory = queries.detach()
        
        return {
            'detections': detections,
            'segmentation': segmentation,
            'predictions': predictions,
            'freespace': freespace
        }

6. Task Interference and Loss Balancing

6.1 The Negative Transfer Problem

Multi-task learning can hurt performance when tasks conflict:

Conflict Type	Example	Impact
Gradient conflict	Detection wants sharp features, segmentation wants smooth	-1-2% on both tasks
Representation conflict	Short-range detection vs long-range prediction need different scales	-2-3% on one task
Learning rate conflict	Detection converges fast, prediction is slow	Unstable training
Data imbalance	Many detection labels, few planning labels	Dominant task overfits

6.2 Loss Balancing Methods

Method	Mechanism	Cost	Effectiveness
Fixed weights	Manual tuning	None	Baseline
Uncertainty weighting (Kendall et al., 2018)	Homoscedastic uncertainty as weight	1 param per task	Good
GradNorm (Chen et al., 2018)	Normalize gradient magnitudes	Gradient computation	Good
PCGrad (Yu et al., 2020)	Project conflicting gradients	O(T²) per step	Best for conflicts
CAGrad (Liu et al., 2021)	Common descent direction	O(T²) per step	Best accuracy
DWA (Liu et al., 2019)	Dynamic weight average	Track loss ratios	Simple, effective
IMTL (Liu et al., 2021)	Impartial multi-task learning	O(T²) per step	Theoretically sound

6.3 Uncertainty Weighting (Recommended)

python

class MultiTaskLoss(nn.Module):
    """Uncertainty-weighted multi-task loss (Kendall et al., 2018)."""
    
    def __init__(self, task_names):
        super().__init__()
        # Learnable log-variance per task (initialized to 0 = equal weight)
        self.log_vars = nn.ParameterDict({
            name: nn.Parameter(torch.zeros(1)) for name in task_names
        })
    
    def forward(self, losses):
        """
        Args:
            losses: dict of {task_name: loss_value}
        Returns:
            total weighted loss
        """
        total = 0
        weights = {}
        
        for name, loss in losses.items():
            log_var = self.log_vars[name]
            # Weight = 1/(2*sigma^2), regularized by log(sigma^2)
            precision = torch.exp(-log_var)
            total += precision * loss + log_var
            weights[name] = precision.item()
        
        return total, weights

# Usage
multi_loss = MultiTaskLoss(['detection', 'segmentation', 'prediction', 'freespace'])
total, weights = multi_loss({
    'detection': det_loss,
    'segmentation': seg_loss,
    'prediction': pred_loss,
    'freespace': free_loss
})
# weights automatically learned: {detection: 2.1, segmentation: 0.8, ...}

6.4 PCGrad for Conflict Resolution

python

def pcgrad_update(task_losses, shared_params, optimizer):
    """Projecting Conflicting Gradients (PCGrad)."""
    task_grads = []
    
    # Compute per-task gradients
    for loss in task_losses:
        optimizer.zero_grad()
        loss.backward(retain_graph=True)
        grads = [p.grad.clone() for p in shared_params]
        task_grads.append(grads)
    
    # Project conflicting gradients
    for i in range(len(task_grads)):
        for j in range(len(task_grads)):
            if i == j:
                continue
            # For each pair of tasks
            for k in range(len(shared_params)):
                g_i = task_grads[i][k]
                g_j = task_grads[j][k]
                
                # Check if gradients conflict (negative cosine similarity)
                dot = (g_i * g_j).sum()
                if dot < 0:
                    # Project g_i onto normal of g_j
                    task_grads[i][k] = g_i - (dot / (g_j.norm()**2 + 1e-8)) * g_j
    
    # Sum projected gradients
    optimizer.zero_grad()
    for k, p in enumerate(shared_params):
        p.grad = sum(task_grads[i][k] for i in range(len(task_grads)))
    optimizer.step()

7. Shared-Backbone Multi-Head Architecture

7.1 Recommended Architecture for reference airside AV stack

Rather than adopting a monolithic end-to-end model, the recommended approach for reference airside AV stack is a shared-backbone, multi-head design that preserves the proven PointPillars detector while adding capabilities:

LiDAR Scans (4-8 RoboSense)
         ↓
    Point Cloud Merge & Preprocessing
         ↓
    ┌─────────────────────────────┐
    │   Shared PillarVFE Backbone │  ← existing PointPillars backbone
    │   + SECOND FPN Neck         │
    └────────────┬────────────────┘
                 ↓
    ┌────────┬────────┬──────────┬───────────┐
    │ Det    │ Seg    │ FreeSpace│ Prediction │
    │ Head   │ Head   │ Head     │ Head       │
    │ (exist)│ (new)  │ (new)    │ (new)      │
    └────────┴────────┴──────────┴───────────┘
         ↓        ↓         ↓          ↓
    Detections  Semantic   Free Space  Motion
    (boxes)     Map (BEV)  Grid (BEV)  Forecasts
         ↓        ↓         ↓          ↓
    ┌────────────────────────────────────────┐
    │        Frenet Planner (existing)        │
    │  + enhanced cost function using all     │
    │    multi-task outputs                   │
    └────────────────────────────────────────┘

7.2 Task Head Designs

Detection Head (Existing — CenterHead):

Already proven at 6.84ms on Orin with TensorRT
No changes needed

Segmentation Head (New):

python

class BEVSegmentationHead(nn.Module):
    """Lightweight BEV segmentation from shared features."""
    
    def __init__(self, in_channels=256, num_classes=14):
        super().__init__()
        self.head = nn.Sequential(
            nn.Conv2d(in_channels, 128, 3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.Conv2d(128, 64, 3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.Conv2d(64, num_classes, 1)
        )
    
    def forward(self, bev_features):
        return self.head(bev_features)  # (B, C, H, W)

Estimated: 2-3ms on Orin (simple convolutions on shared BEV features).

Free Space Head (New):

python

class FreeSpaceHead(nn.Module):
    """Binary free/occupied classification for path planning."""
    
    def __init__(self, in_channels=256):
        super().__init__()
        self.head = nn.Sequential(
            nn.Conv2d(in_channels, 64, 3, padding=1),
            nn.ReLU(),
            nn.Conv2d(64, 1, 1),
            nn.Sigmoid()
        )
    
    def forward(self, bev_features):
        return self.head(bev_features)  # (B, 1, H, W) probability of free

Estimated: 1-2ms on Orin.

Motion Prediction Head (New):

python

class MotionPredictionHead(nn.Module):
    """Predict future positions of detected objects."""
    
    def __init__(self, in_channels=256, num_modes=6, future_steps=10):
        super().__init__()
        self.num_modes = num_modes
        self.future_steps = future_steps
        
        # Per-object features from detection queries
        self.motion_encoder = nn.Sequential(
            nn.Linear(in_channels, 256),
            nn.ReLU(),
            nn.Linear(256, 256)
        )
        
        # Multi-modal prediction
        self.trajectory_decoder = nn.Linear(256, num_modes * future_steps * 2)
        self.mode_probs = nn.Linear(256, num_modes)
    
    def forward(self, object_features):
        """
        Args:
            object_features: (B, N, C) from detection head
        Returns:
            trajectories: (B, N, K, T, 2) — K modes, T timesteps, xy
            mode_probs: (B, N, K) — probability per mode
        """
        h = self.motion_encoder(object_features)
        
        traj = self.trajectory_decoder(h).reshape(
            -1, self.num_modes, self.future_steps, 2
        )
        probs = F.softmax(self.mode_probs(h), dim=-1)
        
        return traj, probs

Estimated: 3-5ms on Orin (runs only for N detected objects, typically <50).

7.3 Total Multi-Task Timing

Component	Standalone	Multi-Task (shared backbone)	Savings
Backbone (PillarVFE + FPN)	5.0ms	5.0ms (shared)	—
Detection head	1.8ms	1.8ms	0%
Segmentation head	N/A (separate: 8ms)	2.5ms	69%
Free space head	N/A (separate: 5ms)	1.5ms	70%
Motion prediction	N/A (separate: 15ms)	4.0ms	73%
Total	~34ms (4 separate models)	~14.8ms	56%

The shared backbone saves ~19ms by amortizing the most expensive computation.

8. Foundation Models as Unified Backbones

8.1 DINOv2 as Multi-Task Backbone

DINOv2 features contain rich semantic information usable for multiple tasks simultaneously (see 30-autonomy-stack/perception/overview/dinov2-foundation-models-driving.md):

python

class DINOv2MultiTaskPerception(nn.Module):
    """DINOv2 backbone with multi-task heads for camera perception."""
    
    def __init__(self):
        self.backbone = DINOv2ViTB14(frozen=True)
        self.adapter = nn.Sequential(
            nn.Linear(768, 256), nn.ReLU()
        )  # LoRA adapter for driving domain
        
        # Multi-task heads from shared DINOv2 features
        self.det_head_2d = FCOS3DHead(256)
        self.seg_head = SegFormerHead(256, 14)
        self.depth_head = DepthHead(256)
        self.lane_head = LaneDetHead(256)
    
    def forward(self, images):
        features = self.backbone(images)  # (B, N, 768)
        adapted = self.adapter(features)  # (B, N, 256)
        
        # Spatial reshape
        H, W = images.shape[-2] // 14, images.shape[-1] // 14
        spatial = adapted.reshape(-1, H, W, 256).permute(0, 3, 1, 2)
        
        return {
            'detections': self.det_head_2d(spatial),
            'segmentation': self.seg_head(spatial),
            'depth': self.depth_head(spatial),
            'lanes': self.lane_head(spatial)
        }

8.2 PTv3 as LiDAR Multi-Task Backbone

Point Transformer v3 (PTv3) provides a unified LiDAR feature extractor (see 30-autonomy-stack/perception/overview/lidar-foundation-models.md):

Task	Single-Task (separate backbone)	Multi-Task (shared PTv3)	Delta
3D Detection (NDS)	72.1	71.5	-0.6
Semantic Seg (mIoU)	80.3	79.8	-0.5
Panoptic Seg (PQ)	74.2	73.6	-0.6
Total compute	3x backbone	1x backbone	67% savings

Minimal accuracy loss (<1%) for 67% compute reduction.

9. Deployment Considerations for Orin

9.1 End-to-End Model Feasibility

Model	A100 Latency	Orin Estimate	Feasible?	Key Bottleneck
UniAD	100ms	250ms+	No	Dense BEV, serial cascade
SparseDrive	67ms	150ms	Borderline	Deformable attention
VAD-Tiny	50ms	80ms	Maybe	BEV encoder
StreamPETR	40ms	70ms	Yes	Image backbone
Multi-head PointPillars	14.8ms	14.8ms	Yes	Already on Orin

9.2 Incremental Multi-Task Deployment

Phase	Addition	Orin Latency	Benefit
Current	PointPillars detection only	6.84ms	Baseline
+1	Add segmentation head	9.3ms	Road surface classification
+2	Add free space head	10.8ms	Explicit drivable area
+3	Add motion prediction head	14.8ms	Predictive planning
+4	Add camera branch (future)	~35ms	Redundant perception

Each phase adds ~1.5-4ms and can be validated independently.

9.3 TensorRT Multi-Head Optimization

python

# TensorRT engine with multiple output heads
import tensorrt as trt

def build_multi_head_engine(onnx_path, engine_path):
    """Build TensorRT engine with shared backbone + multiple heads."""
    builder = trt.Builder(TRT_LOGGER)
    network = builder.create_network(1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH))
    parser = trt.OnnxParser(network, TRT_LOGGER)
    
    # Parse ONNX (exported with all heads)
    parser.parse_from_file(onnx_path)
    
    config = builder.create_builder_config()
    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 1 << 30)
    
    # FP16 for backbone and heads
    config.set_flag(trt.BuilderFlag.FP16)
    
    # INT8 for backbone only (heads need FP16 for classification accuracy)
    # ... calibration setup ...
    
    # Mark multiple outputs
    for i in range(network.num_outputs):
        output = network.get_output(i)
        print(f"Output {i}: {output.name}, shape: {output.shape}")
    
    engine = builder.build_serialized_network(network, config)
    with open(engine_path, 'wb') as f:
        f.write(engine)

10. Airside-Specific Multi-Task Design

10.1 Airside Task Taxonomy

Task	Priority	Why	Existing?
3D Object Detection	Critical	Core safety function	Yes (PointPillars)
Ground Segmentation	High	Drivable area identification	Yes (RANSAC)
Semantic Segmentation	Medium	Rich scene understanding	No
Free Space Estimation	High	Path planning input	No (implicit)
Motion Prediction	High	Predictive safety margins	No
Lane/Marking Detection	Medium	Taxiway following	No (map-based)
Object Classification (fine)	Medium	GSE type identification	Partial
Occupancy Prediction	Medium	Occluded area reasoning	No
Depth Estimation	Low (LiDAR primary)	Camera fallback	No

10.2 Airside Segmentation Classes

For the segmentation head:

ID	Class	Color	Priority	Description
0	`background`	black	Low	Sky, distant buildings
1	`tarmac`	gray	Medium	Paved apron surface
2	`taxiway_marking`	yellow	High	Yellow lines, guidance markings
3	`aircraft_body`	blue	Critical	Fuselage, wings
4	`aircraft_engine`	red	Critical	Jet engines (safety zone)
5	`gse_vehicle`	green	High	Baggage carts, tugs, trucks
6	`ground_crew`	orange	Critical	Personnel
7	`jet_bridge`	purple	Medium	Passenger boarding bridges
8	`terminal`	brown	Low	Terminal buildings
9	`grass_area`	lime	Medium	Non-drivable
10	`safety_equipment`	pink	Medium	Cones, barriers, chocks
11	`construction`	red-orange	High	Temporary obstacles
12	`fod`	white	Critical	Foreign object debris
13	`water_puddle`	cyan	Medium	Surface water

10.3 Multi-Task Training Data Strategy

Task	Public Data	Airside Data Needed	Labeling Cost
Detection	nuScenes pre-train	2,000 frames	$16-30K
Segmentation	SemanticKITTI pre-train	1,000 frames	$15-25K
Free space	Waymo pre-train	500 frames	$5-10K
Motion prediction	nuScenes pre-train	2,000 frames	$20-40K
Total	Free	5,500 frames	$56-105K

With multi-task shared labeling (annotate all tasks per frame): $40-75K (30% savings from shared annotation sessions).

11. Comparison and Recommendations

11.1 Architecture Comparison

Architecture	Accuracy	Latency (Orin)	Certifiability	Complexity	Risk
Separate models	Baseline	34ms (4 models)	High (modular)	Low	Low
Multi-head shared	+1-3%	14.8ms	Medium	Medium	Low
UniAD-style	+5-10%	250ms+	Low (monolithic)	High	High
SparseDrive-style	+5-10%	150ms	Low	High	High
VAD-Tiny	+3-5%	80ms	Low	Medium	Medium

11.2 Recommendation

Near-term (0-12 months): Multi-head shared backbone

Preserve proven PointPillars detection
Add segmentation, free space, motion prediction heads incrementally
Each head validated independently before deployment
Total: 14.8ms on Orin — 4.5x faster than separate models

Medium-term (12-24 months): Camera-LiDAR multi-task fusion

Add camera branch with DINOv2/PTv3 features
BEV fusion at feature level
Enables redundant perception for certification

Long-term (24+ months): End-to-end unified (when Orin successor available)

SparseDrive-style on NVIDIA Thor (~1000 TOPS)
Full end-to-end training with planning objective
Only after sufficient airside training data accumulated

12. Key Takeaways

UniAD (CVPR 2023 Best Paper) unified 6 tasks — improving planning L2 by 30% over modular baselines, but at 100ms on A100 (250ms+ on Orin — infeasible)
SparseDrive achieves 3x speedup over UniAD — fully sparse representation eliminates dense BEV bottleneck, 9 FPS on A100, but still 150ms estimated on Orin
Multi-head shared backbone saves 56% compute — 14.8ms vs 34ms for 4 separate models on Orin, by amortizing the PointPillars backbone across detection + segmentation + free space + prediction
Task interference is manageable — uncertainty-weighted loss (Kendall et al., 2018) automatically balances tasks with 1 learnable parameter per task. PCGrad resolves gradient conflicts at O(T²) per step
Positive transfer between tasks — detection + segmentation share boundary features (+1-2% mAP), tracking + prediction share temporal features (+5-10% minADE), but detection + planning transfer is mixed
Incremental deployment is key — add heads one at a time (segmentation +2.5ms, free space +1.5ms, prediction +4ms), each validated independently before production
VAD-Tiny is the most Orin-feasible unified model — estimated 80ms with TensorRT, but still consumes most of the 100ms planning budget with less proven detection accuracy
PTv3 as shared LiDAR backbone loses <1% per task vs single-task for 67% compute reduction — the multi-task tax is small for a strong backbone
Airside segmentation needs 14 custom classes — from tarmac and taxiway markings (drivable) to aircraft engines and ground crew (critical safety zones)
Multi-task labeling saves 30% — annotating all tasks per frame in shared sessions ($40-75K) vs separate annotation campaigns ($56-105K)
Foundation models (DINOv2, PTv3) are natural multi-task backbones — their pre-trained features transfer well to multiple downstream tasks with LoRA adapters, reducing per-task fine-tuning cost
Certification favors modular over monolithic — shared-backbone multi-head architecture is decomposable for safety analysis (each head can be evaluated independently), while fully end-to-end models (UniAD) resist modular certification per ISO 3691-4 / UL 4600
Recommended timeline: Multi-head PointPillars now (14.8ms, proven) → camera-LiDAR fusion at 12-24 months → full end-to-end on Thor at 24+ months
Free space estimation is the highest-value addition — enables explicit drivable area reasoning in the Frenet planner at only 1.5ms additional cost, replacing implicit free-space inference from detection
Motion prediction from shared features achieves 70% of standalone accuracy — the multi-task prediction head at 4ms is a practical substitute for a dedicated 15ms prediction model, good enough for conservative airside speeds (≤30 km/h)

References

Hu, Y. et al., "Planning-oriented Autonomous Driving (UniAD)," CVPR 2023 (Best Paper)
Sun, W. et al., "SparseDrive: End-to-End Autonomous Driving via Sparse Scene Representation," 2024
Jiang, B. et al., "VAD: Vectorized Scene Representation for Efficient Autonomous Driving," ICCV 2023
Wang, S. et al., "Exploring Object-Centric Temporal Modeling for Efficient Multi-View 3D Object Detection (StreamPETR)," ICCV 2023
Kendall, A. et al., "Multi-Task Learning Using Uncertainty to Weigh Losses," CVPR 2018
Chen, Z. et al., "GradNorm: Gradient Normalization for Adaptive Loss Balancing in Deep Multitask Networks," ICML 2018
Yu, T. et al., "Gradient Surgery for Multi-Task Learning (PCGrad)," NeurIPS 2020
Liu, B. et al., "Conflict-Averse Gradient Descent for Multi-Task Learning (CAGrad)," NeurIPS 2021
Li, Y. et al., "BEVFormer: Learning Bird's-Eye-View Representation from Multi-Camera Images via Spatiotemporal Transformers," ECCV 2022
Wu, P. et al., "Point Transformer V3: Simpler, Faster, Stronger," CVPR 2024

Document generated for reference airside AV stack industry research, April 2026. Covers multi-task and unified perception architectures — for individual task details, see specific docs: detection (perception/openpcdet-centerpoint.md), segmentation (perception/lidar-semantic-segmentation.md), prediction (planning/motion-prediction.md), occupancy (world-models/occupancy-networks-comparison.md).

SLAM Methods

Methods

Multi-Task and Unified Perception Architectures for Autonomous Driving ​

Executive Summary ​

Table of Contents ​

1. The Case for Unified Perception ​

1.1 The Problem with Separate Models ​

1.2 The Multi-Task Advantage ​

1.3 When Multi-Task Helps vs Hurts ​

2. UniAD: The First Unified Framework ​

2.1 Architecture (Hu et al., CVPR 2023 Best Paper) ​

2.2 Key Design Choices ​

2.3 Results on nuScenes ​

2.4 Limitations ​

3. SparseDrive: Fully Sparse End-to-End ​

3.1 Architecture (Sun et al., 2024) ​

3.2 Results ​

3.3 Orin Feasibility ​

4. VAD: Vectorized Scene Representation ​

4.1 Architecture (Jiang et al., ICCV 2023) ​

4.2 VAD-Tiny for Edge Deployment ​

5. StreamPETR and Temporal Fusion ​

5.1 StreamPETR (Wang et al., ICCV 2023) ​

5.2 Multi-Task Extension ​

6. Task Interference and Loss Balancing ​

6.1 The Negative Transfer Problem ​

6.2 Loss Balancing Methods ​

6.3 Uncertainty Weighting (Recommended) ​

6.4 PCGrad for Conflict Resolution ​

7. Shared-Backbone Multi-Head Architecture ​

7.1 Recommended Architecture for reference airside AV stack ​

Multi-Task and Unified Perception Architectures for Autonomous Driving

Executive Summary

Table of Contents

1. The Case for Unified Perception

1.1 The Problem with Separate Models

1.2 The Multi-Task Advantage

1.3 When Multi-Task Helps vs Hurts

2. UniAD: The First Unified Framework

2.1 Architecture (Hu et al., CVPR 2023 Best Paper)

2.2 Key Design Choices

2.3 Results on nuScenes

2.4 Limitations

3. SparseDrive: Fully Sparse End-to-End

3.1 Architecture (Sun et al., 2024)

3.2 Results

3.3 Orin Feasibility

4. VAD: Vectorized Scene Representation

4.1 Architecture (Jiang et al., ICCV 2023)

4.2 VAD-Tiny for Edge Deployment

5. StreamPETR and Temporal Fusion

5.1 StreamPETR (Wang et al., ICCV 2023)

5.2 Multi-Task Extension

6. Task Interference and Loss Balancing

6.1 The Negative Transfer Problem

6.2 Loss Balancing Methods

6.3 Uncertainty Weighting (Recommended)

6.4 PCGrad for Conflict Resolution

7. Shared-Backbone Multi-Head Architecture

7.1 Recommended Architecture for reference airside AV stack