Skip to content

Model Compression & Edge Deployment for Airside AV

Unified Guide: From Research Models to Real-Time Orin Inference

Last updated: 2026-04-11


Table of Contents

  1. The Edge Deployment Challenge
  2. Compression Technique Taxonomy
  3. Quantization: PTQ vs QAT
  4. Knowledge Distillation
  5. Pruning & Sparsity
  6. Architecture-Aware Compression
  7. NVIDIA Model Optimizer (ModelOpt)
  8. Model-Specific Recipes
  9. TensorRT Deployment Pipeline
  10. DLA Offloading Strategy
  11. Multi-Model Orchestration on Orin
  12. ROS Noetic Integration
  13. Validation & Safety
  14. Cost & Timeline
  15. References

1. The Edge Deployment Challenge

1.1 The Gap Between Research and Production

ModelResearch GPUParamsFP32 LatencyTarget Orin Latency
PTv3 (segmentation)A10046M85ms<30ms
CenterPoint (detection)V1009M52ms<15ms
BEVFusion (multi-modal)A10068M120ms<40ms
DINOv2-L (backbone)A100304M45ms<20ms
FlashOcc (occupancy)A10052M28ms<15ms
FlatFormer (segmentation)A10018M35ms<25ms
Alpamayo (VLA teacher)8×H10010B2000msN/A (distill only)

Research models run on A100/H100 GPUs with 40-80GB HBM and 300-700W TDP. NVIDIA Orin AGX provides 275 TOPS at 15-60W with 64GB unified memory. The 10-30x compute gap requires systematic compression.

1.2 reference airside AV stack Orin Compute Budget

Available: 275 TOPS (INT8), 138 TFLOPS (FP16), 32GB or 64GB unified memory
Power envelope: 15W (min) to 60W (max) — airside vehicles are electric, power matters

Allocated per 100ms perception cycle (10Hz):
┌─────────────────────────────────┬──────────┬──────────┐
│ Component                        │ Latency  │ Memory   │
├─────────────────────────────────┼──────────┼──────────┤
│ LiDAR preprocessing              │ 5ms      │ 0.5 GB   │
│ 3D Segmentation (FlatFormer)     │ 25ms     │ 1.5 GB   │
│ 3D Detection (CenterPoint)       │ 12ms     │ 1.0 GB   │
│ Tracking (Kalman + association)   │ 3ms      │ 0.2 GB   │
│ Occupancy grid (nvblox)           │ 10ms     │ 1.0 GB   │
│ Localization (GTSAM)              │ 8ms      │ 0.5 GB   │
│ Planning (Frenet)                 │ 5ms      │ 0.3 GB   │
│ Safety monitoring                 │ 2ms      │ 0.1 GB   │
├─────────────────────────────────┼──────────┼──────────┤
│ TOTAL                             │ 70ms     │ 5.1 GB   │
│ Headroom                          │ 30ms     │ 58.9 GB  │
└─────────────────────────────────┴──────────┴──────────┘

Key constraint: Multiple models share the GPU. Even if one model fits in 30ms solo, concurrent execution with other models may cause contention. CUDA streams and TensorRT execution contexts enable time-multiplexing.


2. Compression Technique Taxonomy

Model Compression

├── Quantization (reduce precision)
│   ├── Post-Training Quantization (PTQ)     — no retraining
│   ├── Quantization-Aware Training (QAT)    — fine-tune with fake quantization
│   └── Mixed-Precision                      — different layers, different precision

├── Knowledge Distillation (train smaller model)
│   ├── Response distillation                — match teacher's output logits
│   ├── Feature distillation                 — match intermediate representations
│   └── Cross-modal distillation             — LiDAR ← camera teacher

├── Pruning (remove parameters)
│   ├── Structured pruning                   — remove entire channels/heads
│   ├── Unstructured pruning                 — remove individual weights
│   └── Task-aware safe pruning              — protect safety-critical paths

├── Architecture Search / Redesign
│   ├── Neural Architecture Search (NAS)     — automated architecture tuning
│   ├── Manual architecture reduction         — fewer layers/channels
│   └── Efficient operator substitution       — replace attention with linear

└── Pipeline-Level Optimization
    ├── Layer fusion (Conv+BN+ReLU)          — TensorRT automatic
    ├── Operator replacement                  — custom CUDA kernels
    └── Memory layout optimization            — NHWC vs NCHW for Orin DLA

2.1 Expected Compression by Technique

TechniqueSpeedupAccuracy LossEffortCombinable?
FP16 (from FP32)1.5-2x<0.1%TrivialYes
INT8 PTQ2-4x0.5-2%LowYes
INT8 QAT2-4x0.1-0.5%MediumYes
Response distillation3-8x1-3%MediumYes
Feature distillation3-8x0.5-2%HighYes
Structured pruning (30%)1.3-1.5x0.5-1%MediumYes
Structured pruning (50%)1.8-2.2x1-3%MediumYes
Layer fusion (TensorRT)1.2-1.5x0%AutomaticAlways
Combined (distill + INT8 + fuse)5-15x1-3%High

3. Quantization: PTQ vs QAT

3.1 Post-Training Quantization (PTQ)

PTQ computes quantization scale factors from a calibration dataset — no retraining needed.

python
import tensorrt as trt
import numpy as np

class CalibrationDataset:
    """INT8 calibration dataset for LiDAR perception models."""
    
    def __init__(self, data_dir, num_samples=200):
        self.samples = self._load_samples(data_dir, num_samples)
        self.current_idx = 0
        self.batch_size = 1
    
    def _load_samples(self, data_dir, num_samples):
        """Load representative LiDAR scans for calibration.
        
        IMPORTANT: Calibration data must be representative of deployment:
        - Include day AND night scans
        - Include wet AND dry conditions
        - Include empty apron AND busy turnaround
        - Include close range (5m) AND far range (100m+) objects
        """
        import glob
        files = sorted(glob.glob(f'{data_dir}/*.npy'))[:num_samples]
        return [np.load(f) for f in files]
    
    def get_batch(self):
        if self.current_idx >= len(self.samples):
            return None
        batch = self.samples[self.current_idx]
        self.current_idx += 1
        return [batch]
    
    def reset(self):
        self.current_idx = 0


def build_int8_engine(onnx_path, calibrator, output_path):
    """Build TensorRT INT8 engine with PTQ calibration."""
    logger = trt.Logger(trt.Logger.WARNING)
    builder = trt.Builder(logger)
    network = builder.create_network(
        1 << int(trt.NetworkDefinitionCreationFlag.EXPLICIT_BATCH)
    )
    parser = trt.OnnxParser(network, logger)
    
    with open(onnx_path, 'rb') as f:
        parser.parse(f.read())
    
    config = builder.create_builder_config()
    config.set_memory_pool_limit(trt.MemoryPoolType.WORKSPACE, 4 << 30)  # 4GB
    config.set_flag(trt.BuilderFlag.INT8)
    config.set_flag(trt.BuilderFlag.FP16)  # fallback for unsupported INT8 layers
    config.int8_calibrator = calibrator
    
    # Profile for dynamic shapes (variable point count)
    profile = builder.create_optimization_profile()
    profile.set_shape('points', 
                      min=(1, 30000, 4),    # minimum 30K points
                      opt=(1, 100000, 4),   # typical 100K points
                      max=(1, 200000, 4))   # maximum 200K points
    config.add_optimization_profile(profile)
    
    engine = builder.build_serialized_network(network, config)
    with open(output_path, 'wb') as f:
        f.write(engine)
    
    return output_path

PTQ Accuracy Impact by Model Type

ModelFP32 → FP16 LossFP32 → INT8 PTQ LossNotes
CenterPoint0.02% mAP0.80% mAPWell-behaved, symmetric weights
PointPillars0.05% mAP0.80% mAPSimple architecture, quantizes well
FlatFormer0.1% mIoU1.5-2.0% mIoUAttention layers sensitive
Cylinder3D0.1% mIoU2.0-3.0% mIoUSparse conv quantization tricky
BEVFusion0.2% mAP3.0-5.0% mAPMulti-modal, needs careful calibration
FlashOcc0.1% IoU1.0-1.5% IoU2D backbone quantizes well

3.2 Quantization-Aware Training (QAT)

When PTQ accuracy loss is unacceptable (>2%), QAT fine-tunes with simulated quantization:

python
import torch
from torch.quantization import QConfig, prepare_qat, convert

def qat_fine_tune(model, train_loader, calibration_loader, epochs=10, lr=1e-5):
    """
    Quantization-aware training for LiDAR perception models.
    
    Key insight: Start from a pre-trained FP32 model, insert fake quantization
    nodes, then fine-tune for a few epochs. The model learns to be robust to
    quantization noise.
    """
    # Step 1: Define quantization config
    model.qconfig = QConfig(
        activation=torch.quantization.FakeQuantize.with_args(
            observer=torch.quantization.MovingAverageMinMaxObserver,
            quant_min=-128, quant_max=127, dtype=torch.qint8
        ),
        weight=torch.quantization.FakeQuantize.with_args(
            observer=torch.quantization.MovingAveragePerChannelMinMaxObserver,
            quant_min=-128, quant_max=127, dtype=torch.qint8,
            ch_axis=0  # per-channel for weights
        )
    )
    
    # Step 2: Prepare model with fake quantization
    model.train()
    model_prepared = prepare_qat(model)
    
    # Step 3: Fine-tune (typically 5-20% of original training epochs)
    optimizer = torch.optim.Adam(model_prepared.parameters(), lr=lr)
    
    for epoch in range(epochs):
        for batch in train_loader:
            points, labels = batch
            pred = model_prepared(points)
            loss = torch.nn.functional.cross_entropy(pred, labels)
            loss.backward()
            optimizer.step()
            optimizer.zero_grad()
    
    # Step 4: Convert to actual INT8
    model_int8 = convert(model_prepared.eval())
    return model_int8

# QAT typically recovers 50-80% of PTQ accuracy loss
# Example: FlatFormer PTQ loses 2.0% mIoU → QAT recovers to 0.5% loss

3.3 Mixed-Precision Strategy

Not all layers need the same precision. TensorRT automatically finds the best per-layer precision when both INT8 and FP16 flags are set:

Layer type → Recommended precision:
  Conv/Linear (backbone):     INT8  — most parameters, biggest speedup
  Batch Normalization:        Folded into Conv (zero cost)
  Attention Q,K,V projections: FP16 — sensitive to quantization
  Attention softmax:          FP32  — numerical stability
  Sparse convolution:         FP16  — INT8 support limited in TorchSparse
  Output head (classification): FP16 — small, accuracy-sensitive
  Loss/gradient (training):   FP32  — never quantize during training

4. Knowledge Distillation

4.1 Why Distillation for Airside

Distillation enables deploying a small model that behaves like a large model:

  • Teacher: PTv3-Large (46M params, 82.7% mIoU, runs on A100)
  • Student: FlatFormer-Small (8M params, 65% mIoU baseline → 72% after distillation, runs on Orin)

The teacher transfers its "dark knowledge" — the probability distribution over all classes, not just the argmax — to the student.

4.2 Response-Based Distillation

python
import torch
import torch.nn.functional as F

def distillation_loss(student_logits, teacher_logits, labels, 
                      temperature=4.0, alpha=0.7):
    """
    Combined distillation + hard label loss.
    
    Args:
        student_logits: (B, C, N) student predictions
        teacher_logits: (B, C, N) teacher predictions (detached)
        labels: (B, N) ground truth class labels
        temperature: softens probability distributions (higher = softer)
        alpha: weight for distillation loss vs hard loss
    
    Returns:
        combined loss
    """
    # Soft targets from teacher (dark knowledge)
    soft_student = F.log_softmax(student_logits / temperature, dim=1)
    soft_teacher = F.softmax(teacher_logits / temperature, dim=1)
    distill_loss = F.kl_div(soft_student, soft_teacher, reduction='batchmean')
    distill_loss *= temperature ** 2  # scale gradient magnitude
    
    # Hard targets from ground truth
    hard_loss = F.cross_entropy(student_logits, labels)
    
    return alpha * distill_loss + (1 - alpha) * hard_loss

# Training loop
for points, labels in train_loader:
    with torch.no_grad():
        teacher_logits = teacher_model(points)  # large model
    student_logits = student_model(points)       # small model
    
    loss = distillation_loss(student_logits, teacher_logits, labels)
    loss.backward()
    optimizer.step()

4.3 Feature-Based Distillation

Match intermediate feature maps, not just output logits:

python
def feature_distillation_loss(student_features, teacher_features, projectors):
    """
    Match intermediate representations from multiple layers.
    
    student_features: list of (B, C_s, N) tensors from student layers
    teacher_features: list of (B, C_t, N) tensors from teacher layers
    projectors: list of nn.Linear(C_s, C_t) alignment layers
    """
    total_loss = 0
    for s_feat, t_feat, proj in zip(student_features, teacher_features, projectors):
        # Project student features to teacher dimension
        s_aligned = proj(s_feat.permute(0, 2, 1)).permute(0, 2, 1)
        # L2 distance between normalized features
        s_norm = F.normalize(s_aligned, dim=1)
        t_norm = F.normalize(t_feat.detach(), dim=1)
        total_loss += (1 - (s_norm * t_norm).sum(dim=1)).mean()
    
    return total_loss / len(student_features)

4.4 Cross-Modal Distillation (Camera → LiDAR)

TinyBEV (ICCV 2025 Workshop) demonstrates distilling a multi-modal teacher (camera + LiDAR) into a camera-only student. For airside, the inverse is more useful:

Teacher: Camera + LiDAR multi-modal model (BEVFusion)
         - Uses 6 cameras + 4 LiDAR for training
         - Rich texture + depth information
Student: LiDAR-only model (FlatFormer)
         - Deployed with LiDAR only (the reference airside AV stack's sensor suite)
         - Inherits camera-informed features without camera at inference

Benefit: LiDAR model learns texture-aware features from camera teacher
         e.g., distinguishing hi-vis vest (person) from orange cone (equipment)
         purely from LiDAR geometry + intensity, informed by camera knowledge

4.5 Distillation Results for Perception Models

TeacherStudentTaskTeacher AccStudent (no distill)Student (distilled)Speedup
PTv3-LFlatFormer-SSeg82.7% mIoU65.0%72.1%5x
BEVFusionCenterPointDet72.9% mAP65.8%69.4%4x
UniADTinyBEVMultiWithin 2-3%5-8x
DINOv2-LDINOv2-SBackboneWithin 1.5%6x

5. Pruning & Sparsity

5.1 Structured Pruning

Remove entire channels, attention heads, or layers — produces directly smaller model without special sparse hardware support:

python
import torch.nn.utils.prune as prune

def structured_prune_model(model, amount=0.3):
    """
    Remove 30% of channels (structured pruning).
    
    After pruning, the model has fewer parameters and runs faster
    WITHOUT needing sparse tensor support.
    """
    for name, module in model.named_modules():
        if isinstance(module, torch.nn.Conv2d):
            prune.ln_structured(module, name='weight', amount=amount, n=2, dim=0)
            prune.remove(module, 'weight')  # make pruning permanent
        elif isinstance(module, torch.nn.Linear):
            prune.ln_structured(module, name='weight', amount=amount, n=2, dim=0)
            prune.remove(module, 'weight')
    
    return model

5.2 Task-Aware Safe Pruning for Safety-Critical Models

Recent work (2025) shows that naive pruning can disproportionately damage rare class detection — exactly the safety-critical classes (personnel, FOD) we care about most:

python
def safety_aware_pruning(model, train_loader, safety_classes, amount=0.3):
    """
    Prune model while protecting channels most important for safety classes.
    
    Key idea: Compute channel importance separately for safety-critical classes.
    Only prune channels that have low importance for ALL classes including safety ones.
    """
    # Step 1: Compute per-channel importance for each class
    importance = {}  # {class_id: {layer_name: importance_scores}}
    
    for cls_id in range(model.num_classes):
        importance[cls_id] = compute_channel_importance(model, train_loader, cls_id)
    
    # Step 2: For each layer, protect channels important for safety classes
    for name, module in model.named_modules():
        if not hasattr(module, 'weight'):
            continue
        
        # Aggregate importance: max across safety classes
        safety_importance = torch.stack([
            importance[cls_id][name] for cls_id in safety_classes
        ]).max(dim=0).values
        
        # Only prune channels with low safety importance
        threshold = torch.quantile(safety_importance, amount)
        mask = safety_importance > threshold
        
        # Apply mask
        prune.custom_from_mask(module, name='weight', mask=mask.unsqueeze(-1))

5.3 Pruning + Distillation Pipeline

The NVIDIA Model Optimizer recommended pipeline:

Step 1: Train full model to convergence                    → 82% mIoU
Step 2: Structured pruning (30% channels)                  → 78% mIoU (-4%)
Step 3: Knowledge distillation from full model (10 epochs)  → 80% mIoU (+2% recovered)
Step 4: INT8 QAT (5 epochs)                                → 79.5% mIoU (-0.5%)
Step 5: TensorRT layer fusion + optimization               → 0% additional loss

Final: 79.5% mIoU at 3-5x speedup over original FP32 model

6. Architecture-Aware Compression

6.1 Efficient Backbone Substitution

Instead of compressing a large model, start with an efficient architecture:

Large ModelEfficient AlternativeParams RatioAccuracy GapOrin Speedup
PTv3-LargeFlatFormer0.39x-10% mIoU4-5x faster
ResNet-101 (2D)MobileNetV3-L0.14x-3% mAP6x faster
DINOv2-LargeDINOv2-Small0.07x-4% on ImageNet8x faster
Swin-L (BEV)EfficientNet-B30.12x-5% mAP5x faster
ViT-L (backbone)FastViT-SA360.05x-3% on ImageNet10x faster

6.2 Operator Substitution for Orin

Some operations are efficient on desktop GPUs but slow on Orin:

Expensive OpOrin-Friendly ReplacementSpeedup
Multi-head attention (dense)Grouped attention (fewer heads)1.5-2x
Deformable attentionStandard attention + learned offsets2-3x
Sparse 3D conv (MinkowskiEngine)TorchSparse 2.x2-3x
Trilinear interpolation (3D)Nearest-neighbor + learned upscale1.5x
Dynamic convolutionStatic conv + channel attention2x
FlashAttention v2TensorRT fused MHA plugin1.3x

7. NVIDIA Model Optimizer (ModelOpt)

7.1 Overview

NVIDIA Model Optimizer (formerly TensorRT Model Optimizer) is the unified compression toolkit:

bash
pip install nvidia-modelopt

# Supported techniques:
# - PTQ (INT8, FP8, INT4, FP4)
# - QAT
# - Structured pruning (GradNorm-based)
# - Knowledge distillation
# - Sparsity (2:4 structured)
# - NAS (Neural Architecture Search)

7.2 PTQ with ModelOpt

python
import modelopt.torch.quantization as mtq

# One-line PTQ quantization
model_quantized = mtq.quantize(model, quant_cfg=mtq.INT8_DEFAULT_CFG, 
                                forward_loop=calibration_forward_loop)

# Export to ONNX for TensorRT
torch.onnx.export(model_quantized, dummy_input, 'model_int8.onnx',
                   opset_version=17)

7.3 2:4 Structured Sparsity

Orin's GPU supports 2:4 sparsity natively — for every 4 consecutive weights, at most 2 are non-zero. TensorRT exploits this for ~2x speedup with minimal accuracy loss:

python
import modelopt.torch.sparsity as mts

# Apply 2:4 structured sparsity
model_sparse = mts.sparsify(model, mode='2:4')

# Fine-tune to recover accuracy (typically 5-10 epochs)
for epoch in range(10):
    train_one_epoch(model_sparse, train_loader, optimizer)

# Export — TensorRT automatically uses sparse tensor cores
torch.onnx.export(model_sparse, dummy_input, 'model_sparse.onnx')

8. Model-Specific Recipes

8.1 CenterPoint (3D Detection)

Starting point: CenterPoint-Pillar (9M params, 65.8% mAP nuScenes)
  Step 1: FP16 TensorRT conversion                         → 12ms Orin, 65.7% mAP
  Step 2: INT8 PTQ (200 calibration scans)                  → 7ms Orin, 65.0% mAP
  Step 3: Pillar size optimization (0.2m→0.25m)             → 5ms Orin, 64.5% mAP
  
  Final: 5ms, 64.5% mAP — fits comfortably in budget
  Already achieved in OpenPCDet/Lidar_AI_Solution reference code

8.2 FlatFormer (Segmentation)

Starting point: FlatFormer (18M params, 70% mIoU estimated)
  Step 1: Reduce embed_dim 128→96, num_layers 4→3           → ~12M params, 67% mIoU
  Step 2: FP16 TensorRT                                     → 35ms Orin, 67% mIoU  
  Step 3: INT8 PTQ                                           → 22ms Orin, 65.5% mIoU
  Step 4: QAT recovery (10 epochs)                           → 22ms Orin, 66.5% mIoU
  Step 5: 2:4 sparsity                                       → 18ms Orin, 66.0% mIoU
  
  Final: 18ms, 66% mIoU — within 30ms budget

8.3 FlashOcc (Camera Occupancy)

Starting point: FlashOcc-R50 (52M params, 32.0% mIoU Occ3D)
  Step 1: Replace ResNet-50 with MobileNetV3-L              → 18M params, 28% mIoU
  Step 2: FP16 TensorRT + C2H plugin                        → 5ms Orin, 28% mIoU
  Step 3: INT8 PTQ                                           → 3ms Orin, 27% mIoU
  
  Final: 3ms, 27% mIoU — extremely fast, adequate for safety grid
  Note: FlashOcc already designed for speed — compression gains are smaller

8.4 DINOv2 Backbone (Feature Extraction)

Starting point: DINOv2-Large (304M params)
  Option A: Use DINOv2-Small (22M params) + LoRA adapters
            → 22M + 0.5M LoRA = 22.5M params
            → FP16: 8ms Orin, within 1.5% of Large accuracy with task-specific adapter
  
  Option B: Distill DINOv2-Large → DINOv2-Small
            → Additional 1-2% accuracy recovery over Option A
            → Requires training pipeline (8 GPU-hours on A100)
  
  Recommendation: Option A (simpler, nearly as good)

8.5 VLA Distillation (Alpamayo → Edge Policy)

Alpamayo is a 10B-parameter Vision-Language-Action model. It cannot run on Orin. The deployment strategy is distillation to a small policy:

Teacher: Alpamayo 10B (runs on cloud/datacenter)
Student: FastViT-SA24 + Transformer-S (20M params, runs on Orin)

Distillation strategy:
  1. Run Alpamayo on recorded driving logs → generate trajectory labels
  2. Train student to match Alpamayo trajectories (behavioral cloning)
  3. Add DAgger: deploy student, query Alpamayo for corrections
  4. After 100K frames: student achieves 85-90% of teacher performance
  
  Student inference on Orin: ~15ms (FP16), ~10ms (INT8)
  Accuracy: 85-90% of Alpamayo, much better than rule-based planning
  
  Note: Alpamayo has non-commercial license — distillation may inherit
  license restrictions. Use Cosmos-based alternatives for commercial deployment.

9. TensorRT Deployment Pipeline

9.1 Complete Pipeline

bash
#!/bin/bash
# Full model deployment pipeline: PyTorch → ONNX → TensorRT → ROS

# ===== Step 1: Export PyTorch to ONNX =====
python export_model.py \
    --model flatformer \
    --weights flatformer_airside_v1.pth \
    --output flatformer.onnx \
    --opset 17 \
    --dynamic-batch \
    --simplify  # onnx-simplifier removes redundant ops

# ===== Step 2: Validate ONNX =====
python -c "
import onnx
model = onnx.load('flatformer.onnx')
onnx.checker.check_model(model)
print(f'Inputs: {[i.name for i in model.graph.input]}')
print(f'Outputs: {[o.name for o in model.graph.output]}')
"

# ===== Step 3: Build TensorRT engine (ON ORIN TARGET) =====
# IMPORTANT: Engine must be built on the target device
# Engines are NOT portable between GPU architectures

# FP16 engine (safe default)
/usr/src/tensorrt/bin/trtexec \
    --onnx=flatformer.onnx \
    --saveEngine=flatformer_fp16.engine \
    --fp16 \
    --workspace=4096 \
    --minShapes=points:1x30000x4 \
    --optShapes=points:1x100000x4 \
    --maxShapes=points:1x200000x4 \
    --verbose 2>&1 | tee build_fp16.log

# INT8 engine (fastest)
/usr/src/tensorrt/bin/trtexec \
    --onnx=flatformer.onnx \
    --saveEngine=flatformer_int8.engine \
    --int8 --fp16 \
    --calib=calibration_cache.bin \
    --workspace=4096 \
    --minShapes=points:1x30000x4 \
    --optShapes=points:1x100000x4 \
    --maxShapes=points:1x200000x4

# ===== Step 4: Benchmark =====
/usr/src/tensorrt/bin/trtexec \
    --loadEngine=flatformer_int8.engine \
    --iterations=1000 \
    --warmUp=100 \
    --avgRuns=100
# Output: min/avg/max latency, throughput, GPU utilization

# ===== Step 5: Validate accuracy =====
python validate_engine.py \
    --engine flatformer_int8.engine \
    --test-data /data/airside_test/ \
    --metrics miou,per_class_iou,latency_p99 \
    --compare-with flatformer_fp16.engine  # ensure INT8 ≈ FP16

9.2 Common TensorRT Pitfalls on Orin

IssueSymptomFix
Dynamic shapes OOMEngine build failsReduce maxShapes or increase workspace
INT8 accuracy drop >3%mIoU plummetsUse QAT instead of PTQ
Sparse conv not supportedONNX export failsUse TorchSparse ONNX plugin or convert to dense
Batch norm folding failureAccuracy mismatchFold BN manually before export
DLA incompatible opsDLA fallback to GPUCheck --useDLACore=0 --allowGPUFallback
Engine not portableCrash on different OrinAlways build engine ON the target device
Attention mask dynamicShape mismatchPre-compute attention masks for each input size

10. DLA Offloading Strategy

10.1 Orin DLA Capabilities

Orin has 2 Deep Learning Accelerators (DLAs), each providing ~50 TOPS INT8 at very low power (~5W each):

DLA supported ops: Conv2D, Deconv, FC, Pool, Activation (ReLU/Sigmoid/Tanh), 
                   BatchNorm, ElementWise, Scale, Softmax, Concat
DLA NOT supported: Sparse Conv, Attention, Custom CUDA, most transformer ops
GPU:   FlatFormer (segmentation) — attention ops, sparse/irregular
       CenterPoint backbone — sparse pillar conv
       
DLA 0: CenterPoint detection head — simple conv + FC layers
       Post-processing CNN refinements
       
DLA 1: SalsaNext (lightweight backup segmentation)
       Binary moving/static classifier
       Simple thermal fusion network (if thermal cameras added)

10.3 DLA Engine Build

bash
# Build engine targeting DLA core 0 with GPU fallback
/usr/src/tensorrt/bin/trtexec \
    --onnx=detection_head.onnx \
    --saveEngine=detection_head_dla0.engine \
    --int8 \
    --useDLACore=0 \
    --allowGPUFallback \
    --workspace=1024

# Check DLA utilization
# TensorRT reports which layers run on DLA vs GPU fallback

11. Multi-Model Orchestration on Orin

11.1 Concurrent Execution with CUDA Streams

cpp
// Multi-model inference manager for Orin
class MultiModelInference {
public:
    MultiModelInference() {
        // Create separate CUDA streams for each model
        cudaStreamCreate(&seg_stream_);      // segmentation
        cudaStreamCreate(&det_stream_);      // detection
        cudaStreamCreate(&occ_stream_);      // occupancy
    }
    
    void runPerceptionCycle(const PointCloud& input) {
        // Shared preprocessing on default stream
        auto preprocessed = preprocess(input);
        cudaDeviceSynchronize();
        
        // Launch all models concurrently on separate streams
        seg_engine_->enqueueV2(seg_buffers_, seg_stream_, nullptr);
        det_engine_->enqueueV2(det_buffers_, det_stream_, nullptr);
        occ_engine_->enqueueV2(occ_buffers_, occ_stream_, nullptr);
        
        // Wait for all to complete
        cudaStreamSynchronize(seg_stream_);
        cudaStreamSynchronize(det_stream_);
        cudaStreamSynchronize(occ_stream_);
        
        // Post-process results (can also be parallelized)
        auto seg_result = postprocessSeg();
        auto det_result = postprocessDet();
        auto occ_result = postprocessOcc();
    }
    
private:
    cudaStream_t seg_stream_, det_stream_, occ_stream_;
    TrtEngine* seg_engine_, *det_engine_, *occ_engine_;
};

11.2 Orin Power Mode Selection

ModeGPU ClockCPU ClockDLAPowerUse Case
MAXN1.3 GHz2.2 GHz2x active60WFull autonomous operation
MODE_50W1.0 GHz1.5 GHz2x active50WNormal operation
MODE_30W0.8 GHz1.2 GHz1x active30WLow-speed maneuvering
MODE_15W0.6 GHz0.8 GHz1x active15WIdle/monitoring

Recommendation: Use MODE_50W as default for balance of performance and thermal management. Switch to MAXN during complex scenarios (busy turnaround, multiple aircraft).

bash
# Set power mode (persistent across reboots)
sudo nvpmodel -m 2  # MODE_50W
sudo jetson_clocks   # lock clocks to max for consistent latency

12. ROS Noetic Integration

12.1 TensorRT Inference Node

cpp
#include <ros/ros.h>
#include <sensor_msgs/PointCloud2.h>
#include <NvInfer.h>
#include <cuda_runtime_api.h>

class TrtInferenceNode {
public:
    TrtInferenceNode(ros::NodeHandle& nh) {
        // Load engine
        std::string engine_path;
        nh.param<std::string>("engine_path", engine_path, "model.engine");
        engine_ = loadEngine(engine_path);
        context_ = engine_->createExecutionContext();
        
        // Allocate GPU buffers
        allocateBuffers();
        
        // ROS interface
        sub_ = nh.subscribe("/airside_av/lidar/aggregated", 1,
                           &TrtInferenceNode::callback, this);
        pub_ = nh.advertise<perception_msgs::SemanticPointCloud>(
               "/perception/segmentation", 1);
        
        ROS_INFO("TensorRT engine loaded: %s (%.1f MB)", 
                 engine_path.c_str(), getEngineSize() / 1e6);
    }
    
private:
    void callback(const sensor_msgs::PointCloud2::ConstPtr& msg) {
        auto t0 = ros::Time::now();
        
        // 1. Copy points to GPU input buffer
        int num_points = copyPointsToGPU(msg);
        
        // 2. Set dynamic shape for this input
        context_->setInputShape("points", 
            nvinfer1::Dims3{1, num_points, 4});
        
        // 3. Execute inference
        context_->enqueueV2(buffers_.data(), stream_, nullptr);
        cudaStreamSynchronize(stream_);
        
        // 4. Copy results back to CPU
        auto result = copyResultsFromGPU(num_points);
        
        // 5. Publish
        publishResult(msg->header, result);
        
        double elapsed = (ros::Time::now() - t0).toSec() * 1000;
        ROS_DEBUG_THROTTLE(5.0, "Inference: %.1fms (%d points)", 
                          elapsed, num_points);
    }
};

12.2 Model Hot-Swap for OTA Updates

cpp
class ModelManager {
    /**
     * Supports swapping TensorRT engines at runtime without restarting ROS.
     * Used for OTA model updates: new engine → validate → swap → old as fallback.
     */
public:
    bool swapEngine(const std::string& new_engine_path) {
        // Step 1: Load new engine in background
        auto new_engine = loadEngine(new_engine_path);
        if (!new_engine) {
            ROS_ERROR("Failed to load new engine: %s", new_engine_path.c_str());
            return false;
        }
        
        // Step 2: Run validation on test inputs
        if (!validateEngine(new_engine, test_inputs_)) {
            ROS_WARN("New engine failed validation, keeping current");
            return false;
        }
        
        // Step 3: Atomic swap (lock-free pointer exchange)
        {
            std::lock_guard<std::mutex> lock(engine_mutex_);
            old_engine_ = std::move(active_engine_);  // keep as fallback
            active_engine_ = std::move(new_engine);
        }
        
        ROS_INFO("Engine swapped successfully: %s", new_engine_path.c_str());
        return true;
    }
};

13. Validation & Safety

13.1 Compression Validation Protocol

Before deploying any compressed model, validate against the uncompressed baseline:

python
def validate_compression(original_engine, compressed_engine, test_data):
    """
    Validate that compression hasn't degraded safety-critical metrics.
    
    GATE CRITERIA (all must pass):
    1. Overall mIoU drop < 3%
    2. Personnel class IoU drop < 1%  (CRITICAL)
    3. Aircraft class IoU drop < 2%   (CRITICAL)
    4. No new false negatives for persons within 20m
    5. Latency P99 < target (e.g., 30ms)
    6. No inference failures over 10,000 consecutive frames
    """
    results = {
        'overall_miou_drop': 0,
        'person_iou_drop': 0,
        'aircraft_iou_drop': 0,
        'new_false_negatives': 0,
        'latency_p99': 0,
        'inference_failures': 0,
    }
    
    for sample in test_data:
        orig_pred = run_engine(original_engine, sample)
        comp_pred = run_engine(compressed_engine, sample)
        
        # Check for new false negatives (compressed misses, original detects)
        orig_person = orig_pred == PERSON_CLASS
        comp_person = comp_pred == PERSON_CLASS
        new_fn = (orig_person & ~comp_person).sum()
        
        # Filter to within 20m range
        range_mask = sample.ranges < 20.0
        results['new_false_negatives'] += (orig_person & ~comp_person & range_mask).sum()
    
    # Gate decisions
    gates = {
        'overall_miou': results['overall_miou_drop'] < 3.0,
        'person_safety': results['person_iou_drop'] < 1.0,
        'aircraft_safety': results['aircraft_iou_drop'] < 2.0,
        'no_new_fn': results['new_false_negatives'] == 0,
        'latency': results['latency_p99'] < 30.0,
        'stability': results['inference_failures'] == 0,
    }
    
    passed = all(gates.values())
    return passed, gates, results

13.2 Simplex Safety Integration

Compressed models are the Advanced Controller (AC) in the Simplex architecture. The Baseline Controller (BC) is the classical RANSAC pipeline that doesn't depend on ML:

┌─────────────────────────────────┐
│          Decision Module         │
│  if AC_confidence > threshold   │
│     AND AC_latency < deadline   │
│     AND no GPU errors:          │
│       → Use AC (neural model)   │
│  else:                          │
│       → Use BC (RANSAC)         │
└─────────────────────────────────┘
         │              │
    ┌────┴────┐    ┌────┴────┐
    │   AC    │    │   BC    │
    │ Neural  │    │ RANSAC  │
    │ FlatF.  │    │ Classic │
    │ INT8    │    │ No ML   │
    └─────────┘    └─────────┘

14. Cost & Timeline

14.1 Compression Engineering Effort

TaskDurationPrerequisites
ONNX export + TensorRT FP161-2 daysTrained model
INT8 PTQ calibration2-3 days200 representative scans
QAT fine-tuning1 weekTraining pipeline, GPU access
Knowledge distillation2-3 weeksTeacher model, training data
Structured pruning + recovery1-2 weeksTraining pipeline
Full pipeline (distill + prune + INT8)4-6 weeksAll above
ROS integration + testing2-3 weeksTensorRT engine
Validation + safety certification2-4 weeksTest dataset, gate criteria

14.2 Compute Cost

TaskHardwareDurationCloud Cost
Distillation training4× A1003 days$1,500
QAT fine-tuning1× A1001 day$125
Pruning + recovery1× A1002 days$250
INT8 calibration1× Orin (local)2 hours$0
Validation runs1× Orin (local)8 hours$0
Total compute~$2,000

14.3 Expected Outcomes

ModelOriginal (A100)Compressed (Orin)SpeedupAccuracy Delta
Segmentation (FlatFormer)35ms, 70% mIoU18ms, 66% mIoU2x-4%
Detection (CenterPoint)52ms, 65.8% mAP5ms, 64.5% mAP10x-1.3%
Occupancy (FlashOcc)28ms, 32% mIoU3ms, 27% mIoU9x-5%
Total perception~115ms~26ms4.4xAcceptable

15. References

Quantization

Knowledge Distillation

  • "Compressing Multi-Task Model for Autonomous Driving via Pruning and Knowledge Distillation" (2025) — arxiv.org/abs/2511.05557
  • TinyBEV: Khan et al., "Cross-Modal Knowledge Distillation for Efficient Multi-Task BEV Perception" (ICCV 2025 Workshop)
  • "On the Road to Portability: Compressing End-to-End Motion Planner" (2024) — arxiv.org/abs/2403.01238

Pruning

Edge Deployment

Architecture Efficiency

  • FlatFormer: Liu et al., "Flattened Window Attention for Efficient Point Cloud Transformer" (CVPR 2023)
  • FastViT: Vasu et al., "FastViT: A Fast Hybrid Vision Transformer using Structural Reparameterization" (ICCV 2023)
  • MobileNetV3: Howard et al., "Searching for MobileNetV3" (ICCV 2019)

Public research notes collected from public sources.