Vision Foundation Models for Autonomous Driving Perception

Comprehensive Technical Report — With Focus on Airport Airside Applications

Vision Foundation Models for Driving
Open-Vocabulary / Zero-Shot Detection for Driving
3D Foundation Models
Video Foundation Models
Domain Adaptation Strategies for Airside
Synthesis: A Roadmap for Airport Airside Perception

1. Vision Foundation Models for Driving

1.1 Segment Anything Model (SAM / SAM 2)

SAM (Kirillov et al., 2023) introduced a promptable segmentation foundation model trained on SA-1B (1 billion masks, 11 million images). The model accepts points, boxes, or text prompts and generalizes to novel image distributions in a zero-shot manner, often matching or exceeding supervised approaches.

SAM 2 (Meta, 2024) extends SAM to video through a per-session streaming memory module. Key advances:

Processes video frames sequentially with a memory bank that captures target object information across frames
Tracks objects even through temporary occlusions
Requires as little as a single click on one frame to track objects through an entire video
When applied to images, the memory module is empty, making it functionally equivalent to SAM 1
Trained on SA-V dataset: ~600K object masks across 51K videos from 47 countries
Outperforms all prior video object segmentation models, with particular strength on tracking object parts

Applications to driving perception:

Paper	Key Contribution
VFMM3D (Ding et al., 2024)	Combines SAM + Depth Anything to convert monocular images to pseudo-LiDAR for 3D detection; SOTA on KITTI and Waymo
FusionSAM (Li et al., 2024)	First application of SAM to multimodal (RGB + thermal) segmentation for driving; +4.1% mIoU improvement
Multi-modal NeRF Self-Supervision (Timoneda et al., IROS 2024)	Uses SAM masks from camera imagery to supervise LiDAR semantic segmentation via NeRF rendering; evaluated on nuScenes, SemanticKITTI, ScribbleKITTI
SAM Zero-shot Robustness (Yan et al., 2024)	Demonstrates SAM achieves "acceptable" adversarial robustness for driving segmentation without any fine-tuning, attributed to scale of parameters and training data
Segment, Lift and Fit (Li et al., ECCV 2024)	Uses SAM for automatic 3D shape labeling from 2D prompts; achieves ~90% AP@0.5 IoU on KITTI
RMP-SAM (Xu et al., 2024)	Real-time multi-purpose SAM variant for driving; supports interactive, panoptic, and video instance segmentation
VideoSAM (Guo et al., 2024)	Extends SAM to open-world video segmentation for robotics and autonomous driving
Weather Robustness (Kou et al., 2024)	SAM for pseudo-label generation to train segmentation models robust to adverse weather; +88.56% mIoU improvement

Airside relevance: SAM's zero-shot segmentation is exceptionally relevant for airport airside because it can segment novel objects (aircraft, GSE, baggage carts) without training on airside-specific data. SAM 2's video tracking enables consistent object tracking across frames as vehicles navigate the apron.

1.2 Grounding DINO / Grounded SAM

Grounding DINO (Liu et al., 2023) marries the DINO detector with grounded pre-training for open-set object detection. Architecture consists of:

Feature enhancer for multi-scale visual features
Language-guided query selection that uses text to initialize detection queries
Cross-modality decoder for tight vision-language fusion

Performance: 52.5 AP on COCO zero-shot transfer; 26.1 AP mean on ODinW zero-shot benchmarks. Accepts arbitrary text prompts describing objects to detect.

Grounded SAM combines Grounding DINO with SAM in a pipeline:

Grounding DINO detects bounding boxes from text prompts
SAM generates precise segmentation masks within those boxes

Applications to driving:

Segment, Lift and Fit (ECCV 2024) uses this pipeline for automatic 3D annotation in driving datasets
ZOPP (Ma et al., NeurIPS 2024) integrates vision foundation models including Grounding DINO for zero-shot offboard panoptic perception and auto-labeling on the Waymo dataset

Airside relevance: Grounded SAM is the most immediately applicable tool for airside perception because operators can specify text prompts like "aircraft," "baggage loader," "fuel truck," "pushback tug," "ground power unit" to detect and segment objects never seen in standard driving datasets. This eliminates the cold-start problem for novel object classes.

1.3 DINOv2 as Backbone

DINOv2 (Oquab et al., 2023) produces versatile visual features through self-supervised pre-training on a curated, diverse dataset. Key properties:

1-billion-parameter ViT trained without supervision
Features transfer across image distributions and tasks without fine-tuning
Distilled smaller models surpass OpenCLIP at both image and pixel levels
Effective for dense prediction tasks (depth, segmentation, surface normals)

Applications to driving:

DistillNeRF (Wang et al., NeurIPS 2024): Distills DINOv2 and CLIP features into a 3D neural field representation for autonomous driving; enables zero-shot 3D semantic occupancy prediction without 3D annotations
LargeAD (Kong et al., TPAMI 2025): Uses DINOv2-driven superpixels to align 2D semantics with 3D LiDAR point clouds across 11 driving datasets

DINOv2's dense features are particularly strong for:

Monocular depth estimation (competitive with supervised methods)
Semantic segmentation with linear probes
Instance retrieval and matching
Surface normal estimation

Airside relevance: DINOv2 features can serve as a backbone for airside perception models without large-scale airside training data. Its dense features enable depth estimation and segmentation on novel environments. Linear probes on frozen DINOv2 features can be trained with minimal airside annotations.

1.4 CLIP / SigLIP for Driving Scene Understanding

CLIP (Radford et al., 2021) learns joint image-text embeddings through contrastive pre-training on 400M image-text pairs. Its zero-shot classification and open-vocabulary capabilities make it foundational for driving.

SigLIP (Zhai et al., 2023) improves on CLIP by using sigmoid loss instead of softmax:

Operates on image-text pairs without requiring global pairwise similarity computation
Works effectively at both large and small batch sizes (32K is sufficient)
Achieves 84.5% ImageNet zero-shot accuracy with locked-image tuning on just 4 TPUv4 chips

Applications to driving:

DistillNeRF: Uses CLIP features for zero-shot semantic understanding of 3D driving scenes
Clipomaly (Reichard et al., 2024): First CLIP-based open-world anomaly segmentation for driving; dynamically extends vocabulary at inference time to assign human-interpretable names to unknown objects without retraining
Hazardous Object Detection (Shriram et al., 2025): Uses CLIP to match VLM-predicted hazard descriptions with bounding boxes in traffic scenes
DistillNeRF demonstrates that CLIP embeddings enable zero-shot 3D semantic occupancy prediction

Airside relevance: CLIP/SigLIP embeddings can bridge the vocabulary gap between road driving and airside environments. Text queries like "Boeing 737," "aircraft tow bar," or "jet bridge" can retrieve or classify objects without airside training data. SigLIP's efficiency makes it practical for real-time on-vehicle deployment.

1.5 EVA / InternImage / InternVL

EVA (Fang et al., 2022): A 1-billion-parameter ViT trained via masked image-text aligned feature reconstruction.

SOTA on image recognition, video action recognition, object detection, instance segmentation, semantic segmentation
Competitive on both LVIS (1000+ categories) and COCO (80 categories) instance segmentation
Functions as an effective CLIP vision encoder, improving training stability

EVA-02 (Fang et al., 2023): A more efficient successor.

304M parameters; 90.0% ImageNet fine-tuning accuracy
CLIP variant achieves 80.4% zero-shot accuracy with ~1/6 parameters and ~1/6 data compared to prior open-source CLIP
Released in four sizes (6M to 304M parameters)

InternImage (Wang et al., 2022): CNN-based foundation model using deformable convolutions.

65.4 mAP on COCO test-dev; 62.9 mIoU on ADE20K
Achieves adaptive spatial aggregation conditioned on input and task
Surpasses both CNN and ViT baselines across vision benchmarks

InternVL 1.5 (Chen et al., 2024): Open-source multimodal model approaching GPT-4V.

InternViT-6B vision encoder with continuous learning
Dynamic high-resolution processing: 1-40 tiles of 448x448 pixels, supporting up to 4K inputs
SOTA on 8 of 18 multimodal benchmarks

DriveLM (Sima et al., ECCV 2024): Uses InternVL-style models for driving with Graph VQA.

Models perception, prediction, and planning as structured QA
Competitive with driving-specific architectures
Strong zero-shot generalization to unseen scenarios

Airside relevance: InternVL's multimodal reasoning can interpret complex airside scenes where spatial relationships matter (e.g., "is the pushback tug connected to the aircraft?"). EVA and InternImage provide strong backbones for fine-tuning on limited airside data.

1.6 How Foundation Models Handle Novel/Unseen Object Classes

Foundation models address novel objects through several mechanisms:

Zero-shot transfer via language: CLIP, Grounding DINO, and SigLIP use natural language to specify arbitrary object categories at inference time, bypassing the closed-set assumption.
Promptable segmentation: SAM segments any object given a spatial prompt (point, box), regardless of whether that class was in training data.
Open-vocabulary anomaly detection: Clipomaly dynamically extends its vocabulary at inference time to detect and name unknown objects without retraining.
Emergent robustness: SAM demonstrates zero-shot adversarial robustness for driving tasks, attributed to the scale of its training.
Cross-modal knowledge transfer: Models like DetAny3D and VFMM3D transfer 2D foundation model knowledge to 3D detection of novel objects.

Critical insight for airside: The combination of Grounding DINO (detect novel objects by name) + SAM (precise segmentation) + CLIP (semantic understanding) provides a complete pipeline for perceiving airside-specific objects without any airside training data as a starting point.

2. Open-Vocabulary / Zero-Shot Detection for Driving

2.1 OWL-ViT

OWL-ViT (Minderer et al., ECCV 2022): "Simple Open-Vocabulary Object Detection with Vision Transformers."

Combines standard ViT architecture with contrastive image-text pre-training and end-to-end detection fine-tuning
Minimal architectural modifications needed to adapt image-level models to detection
Supports both zero-shot text-conditioned and one-shot image-conditioned detection
Key finding: larger models and more pre-training consistently improve detection performance

Strengths for driving: Requires only a text description or a single reference image to detect novel objects. One-shot image-conditioned detection is particularly useful when text descriptions are ambiguous (e.g., showing an image of a specific GSE type).

2.2 YOLO-World

YOLO-World (Cheng et al., 2024): Real-time open-vocabulary object detection.

RepVL-PAN: Re-parameterizable Vision-Language Path Aggregation Network that fuses visual and linguistic features
Region-Text Contrastive Loss for aligning visual regions with text descriptions
Performance: 35.4 AP on LVIS with 52.0 FPS on V100 -- combines accuracy with real-time speed
Zero-shot detection across hundreds of categories without fine-tuning
Strong downstream performance on instance segmentation

Airside relevance: YOLO-World's real-time speed (52 FPS) makes it deployable for live airside perception. The open vocabulary means it can detect "aircraft," "baggage cart," "fuel bowser," "marshaller" etc. using only text prompts, without ever training on airside imagery.

2.3 Open-Vocabulary 3D Detection

A rapidly emerging field that bridges 2D foundation model capabilities with 3D perception:

Paper	Approach	Key Results
Open 3D World (Cheng & Li, 2024)	Fuses BEV features with text embeddings for open-vocabulary 3D detection	Zero-shot generalization on Lyft Level 5
DetAny3D (Zhang et al., 2025)	Foundation model for zero-shot monocular 3D detection using 2D-to-3D knowledge transfer	SOTA on unseen categories and novel camera configs
OV-SCAN (Chow et al., 2025)	Semantically consistent alignment for novel object discovery in open-vocab 3D detection	Improved robustness on nuScenes
BoxFusion (Lan et al., 2025)	Real-time multi-view box fusion using CLIP semantics, no dense 3D reconstruction	Real-time performance on large-scale environments
OpenBox (Lee et al., 2025)	Automatic 3D annotation pipeline using 2D vision foundation models	Associates 2D cues with 3D point clouds
VESPA (Tempfli et al., 2025)	Multimodal LiDAR+camera with VLMs for open-vocabulary 3D labeling	Strong pseudolabel performance on nuScenes
Monocular OV-3D (Huang et al., 2024)	RGB-only training using pseudo-LiDAR with LLM-refined labels	No LiDAR sensor required

2.4 Language-Guided Detection and Segmentation

ZOPP (Ma et al., NeurIPS 2024): Zero-shot offboard panoptic perception framework.

Combines vision foundation model zero-shot recognition with 3D point cloud representations
Unified framework for detection, segmentation, and classification
Validated extensively on Waymo Open Dataset
Addresses data imbalance and long-tail distribution challenges

Clipomaly (Reichard et al., 2024): CLIP-based open-world anomaly segmentation.

Dynamically extends vocabulary at inference time
Assigns human-interpretable names to unknown objects
Zero anomaly-specific training data required
SOTA on anomaly segmentation benchmarks

Hazardous Object Detection (Shriram et al., 2025): Multi-agent VLM system.

Integrates VLM reasoning with zero-shot detection via CLIP
Detects novel hazardous objects in video streams
Enhanced COOOL anomaly detection benchmark with natural language descriptions

2.5 Few-Shot Adaptation to New Domains

Foundation models enable efficient adaptation to new domains through several strategies:

One-shot detection (OWL-ViT): Provide a single reference image of the target object to detect it in new scenes.
Text prompting (Grounding DINO, YOLO-World): Describe novel objects in natural language for immediate zero-shot detection.
Linear probing (DINOv2): Train a simple linear layer on frozen DINOv2 features with as few as 10-50 labeled examples per class.
Pseudo-label generation: Use foundation models to generate labels on unlabeled data, then train specialized detectors.

3. 3D Foundation Models

3.1 UniPAD: Universal Pre-training for Autonomous Driving

UniPAD (Yang et al., CVPR 2024): Universal pre-training paradigm for autonomous driving.

Uses 3D volumetric differentiable rendering to implicitly encode 3D space
Reconstructs both continuous 3D shapes and their 2D projections
Integrates with both 2D and 3D perception frameworks
Results: 73.2 NDS for 3D object detection, 79.4 mIoU for 3D semantic segmentation on nuScenes validation (SOTA)

Key insight: By pre-training on the task of rendering realistic views from 3D representations, the model learns rich 3D structural priors that transfer to downstream detection and segmentation.

3.2 Point-BERT for LiDAR

Point-BERT (Yu et al., 2022): BERT-style pre-training for 3D point cloud Transformers.

Masked Point Modeling (MPM): divides point clouds into patches, masks random patches, recovers original tokens
Discrete VAE tokenizer generates meaningful local representations
Results: 93.8% on ModelNet40, 83.1% on ScanObjectNN (hardest setting)
Strong few-shot transfer learning for point cloud classification

Airside relevance: Point-BERT's few-shot capabilities mean a LiDAR-based airside perception system could be pre-trained on large-scale driving LiDAR data, then adapted to airside-specific object classes with very few labeled examples.

3.3 PonderV2: 3D Foundation Model

PonderV2 (Zhu et al., 2024): Universal pre-training paradigm for 3D foundation models.

Uses differentiable neural rendering as the pre-training objective
Encodes rich geometry and appearance cues into 3D features
SOTA on 11 indoor and outdoor benchmarks
Complementary to UniPAD but with broader scope beyond driving

LargeAD (Kong et al., TPAMI 2025): The most comprehensive cross-modal pretraining framework for driving.

Extracts semantically rich superpixels from 2D images using vision foundation models (DINOv2)
Aligns superpixels with LiDAR point clouds via contrastive learning
Temporal consistency across sequential frames
Pre-trained on 11 large-scale multi-sensor driving datasets
Generalizes across different LiDAR sensor configurations

Multi-modal NeRF Self-Supervision (Timoneda et al., IROS 2024):

Uses SAM masks from cameras to supervise LiDAR segmentation
NeRF rendering bridges the viewpoint gap between camera and LiDAR
Drops the NeRF head at inference (LiDAR-only)

DistillNeRF (Wang et al., NeurIPS 2024):

Distills CLIP and DINOv2 features into 3D neural fields
Self-supervised (no 3D annotations needed)
Enables zero-shot 3D semantic occupancy prediction

3.5 Zero-Shot 3D Understanding

The state of the art in zero-shot 3D understanding combines:

2D foundation model features (CLIP, DINOv2, SAM) for semantic richness
Cross-modal alignment to transfer knowledge to 3D representations
Language conditioning for open-vocabulary 3D queries

DetAny3D exemplifies this: a promptable 3D detection foundation model that uses 2D Aggregator and 3D Interpreter modules with Zero-Embedding Mapping to detect novel 3D objects from monocular images, achieving SOTA on unseen categories.

4. Video Foundation Models

4.1 VideoMAE

VideoMAE (Tong et al., NeurIPS 2022): Masked autoencoders for self-supervised video pre-training.

Applies 90-95% masking ratio (much higher than image MAE), leveraging temporal redundancy in video
Data-efficient: strong results on datasets with only 3K-4K videos
Key finding: data quality matters more than quantity; domain shift is critical
Results: 87.4% on Kinetics-400, 75.4% on Something-Something V2, 91.3% on UCF101

VideoMAE's domain sensitivity insight is critical for airside: Since domain shift matters more than data quantity, pre-training or fine-tuning on even a small airside video dataset could yield substantial improvements over using only road-driving video data.

4.2 InternVideo / InternVideo2

InternVideo (Wang et al., 2022): Combines masked video modeling + video-language contrastive learning.

SOTA across 39 video datasets
91.1% on Kinetics-400, 77.2% on Something-Something V2
Covers action recognition, action detection, video-language alignment, open-world video applications

InternVideo2 (Wang et al., ECCV 2024): Scaled to 6B-parameter video encoder.

Progressive training: masked video modeling -> crossmodal contrastive -> next token prediction
Spatiotemporal consistency via semantic video segmentation and video-audio-speech captions
SOTA on 60+ video and audio tasks
Extended video comprehension capabilities

4.3 Video Understanding for Driving Scenarios

Neuro-Symbolic Video Understanding (Choi et al., ECCV 2024):

Identifies that video foundation models (VideoLLaMA, ViCLIP) excel at short-term understanding but struggle with extended temporal reasoning
Proposes decoupling semantic understanding from temporal analysis
Uses VLMs for per-frame perception + state machines for long-term event tracking
9-15% F1 improvement for complex event identification on Waymo and NuScenes

MVBench (Li et al., 2023): Benchmark with 20 video understanding tasks.

Demonstrates that current multimodal LLMs struggle with temporal reasoning
VideoChat2 baseline outperforms existing models by >15% on temporal tasks

4.4 Action Recognition in Driving Context

Video foundation models enable:

Trajectory prediction from temporal sequences
Behavior recognition of other road users (pedestrians, cyclists, other vehicles)
Anomaly detection in video streams (unusual maneuvers, near-misses)
Scene dynamics understanding (traffic flow, construction zone behavior)

Airside relevance: Video models are critical for airside because:

Aircraft movements follow specific taxiing protocols
GSE has predictable but varied operational patterns (loading, unloading, towing)
Safety-critical events (jet blast, FOD on apron) need temporal context
Marshaller gesture recognition requires action understanding

5. Domain Adaptation Strategies for Airside

5.1 The Airside Domain Gap

Airport airside differs fundamentally from standard road driving:

Dimension	Road Driving	Airport Airside
Object classes	Cars, trucks, pedestrians, cyclists	Aircraft, GSE (tugs, loaders, fuel trucks), baggage carts, marshallers
Scene layout	Lanes, intersections, sidewalks	Aprons, taxiways, stands, jetbridges
Scale variation	Moderate (cars ~4m, trucks ~12m)	Extreme (baggage cart ~2m, A380 ~73m)
Dynamics	High-speed, predictable lanes	Low-speed, complex multi-agent coordination
Markings	Lane markings, traffic signs	Stand lines, taxiway markings, safety zones
Regulations	Road traffic rules	ICAO/airport-specific procedures

5.2 Adapting Road-Driving Models to Airside

Strategy 1: Zero-Shot Foundation Models (No Airside Data)

The fastest path to airside perception with zero labeled airside data:

Grounding DINO + SAM for open-vocabulary detection and segmentation with text prompts ("aircraft," "pushback tug," "fuel bowser," "baggage loader," "ground power unit")
YOLO-World for real-time open-vocabulary detection at 52 FPS
Clipomaly for anomaly detection (detecting unknown objects and assigning interpretable names)
OWL-ViT one-shot detection: provide a single reference image of each GSE type

Expected performance: Moderate. Foundation models will detect and segment objects at a general level but may confuse similar GSE types or struggle with airside-specific spatial reasoning.

Strategy 2: Pseudo-Label Generation + Specialized Model Training

Use foundation models to bootstrap training data:

Deploy Grounded SAM on unlabeled airside footage to generate pseudo-labels
Human annotators correct only the errors (significantly faster than annotation from scratch)
Train a specialized detector (e.g., RT-DETR, YOLOv8) on the corrected labels
Iterate with active learning

The ZOPP framework validates this approach: foundation model pseudo-labels on Waymo data are sufficient for training competitive downstream models.

Strategy 3: Foundation Model Fine-Tuning with Limited Airside Data

Adapt foundation models directly using parameter-efficient methods.

5.3 Fine-Tuning Strategies (LoRA, Adapters)

LoRA (Hu et al., 2021) is the primary technique for efficient foundation model adaptation:

Freezes pre-trained weights; injects low-rank trainable matrices into transformer layers
Reduces trainable parameters by 10,000x compared to full fine-tuning
No additional inference latency (weight matrices are merged at deployment)
GPU memory reduction of ~3x

Application to airside:

Foundation Model	LoRA Application	Expected Outcome
SAM	Adapt mask decoder for airside-specific object boundaries	Better segmentation of aircraft components, GSE
Grounding DINO	Adapt detection head for airside vocabulary	Improved detection precision for GSE subclasses
DINOv2	Add LoRA to ViT backbone for airside feature extraction	Better features for airside-specific tasks
YOLO-World	Fine-tune with airside image-text pairs	Improved airside vocabulary understanding
InternVL	Adapt for airside visual question answering	Scene understanding ("Is the aircraft door open?")

Adapter-based approaches:

Add small adapter modules between frozen transformer layers
Train only adapter parameters (~2-5% of total model parameters)
Can be swapped at inference time (e.g., switch between road and airside adapters)

Visual prompting (Bahng et al., 2022): Learns a single image perturbation that enables frozen models to perform new tasks.

Particularly effective for CLIP; robust to distribution shift
Could enable CLIP-based airside perception without modifying model weights

5.4 How Much Airside Data Is Needed?

Based on the literature, the data requirements follow a progression:

Zero-shot (0 airside examples):

Open-vocabulary detection (Grounding DINO, YOLO-World): functional but noisy
SAM segmentation: works well for clear object boundaries
CLIP classification: works for well-known categories (aircraft types)
Estimated performance: 40-60% mAP for common airside objects

Few-shot (5-50 examples per class):

OWL-ViT one-shot detection with reference images
DINOv2 linear probes with 10-50 labeled examples
Point-BERT few-shot classification for LiDAR objects
Estimated performance: 55-70% mAP

Limited annotation (100-500 examples per class):

LoRA fine-tuning of foundation models
Pseudo-label correction (foundation model generates, human corrects)
Estimated performance: 65-80% mAP

Moderate annotation (1,000-5,000 examples per class):

Full adapter training
Specialized model training on pseudo-labeled + corrected data
Estimated performance: 75-90% mAP

Key insight from VideoMAE research: Data quality (domain match) matters more than quantity. 3K-4K well-chosen airside video clips may outperform 100K road driving clips for airside-specific tasks.

5.5 Prompt Engineering for Zero-Shot Airside Perception

Effective prompts for airside detection with Grounding DINO / YOLO-World:

Aircraft detection:

Generic: "aircraft . airplane . jet"
Specific: "narrow-body aircraft . wide-body aircraft . turboprop aircraft . helicopter"
Component-level: "aircraft engine . landing gear . aircraft wing . cockpit window"

Ground Support Equipment:

"pushback tug . aircraft tug . towbarless tractor"
"belt loader . container loader . high loader . cargo loader"
"fuel truck . fuel bowser . refueling vehicle"
"ground power unit . air start unit . hydraulic power unit"
"baggage cart . baggage dolly . ULD dolly"
"passenger stairs . airstairs . mobile stairway"
"catering truck . catering high-lift"
"deicing truck . deicing vehicle"
"follow-me car . marshalling vehicle"

Prompt engineering best practices:

Use multiple synonyms separated by " . " for each category
Include both generic and specific terms
Test with negative examples to calibrate false-positive thresholds
For ambiguous categories, use descriptive phrases: "small yellow vehicle towing baggage carts"
Leverage CLIP similarity scores to rank detections by confidence

5.6 Active Learning with Foundation Models

Foundation-model-assisted active learning pipeline:

Initial deployment: Run Grounding DINO + SAM on unlabeled airside data
Uncertainty identification: Flag detections where:
- Grounding DINO confidence is between 0.3-0.7 (uncertain)
- Multiple text prompts produce conflicting results
- SAM mask quality score is low
Human review: Annotators review only uncertain cases (reduces annotation effort by 60-80%)
Model update: LoRA fine-tune on corrected annotations
Iterate: New model identifies new uncertain cases

This approach was validated by Multi-label Scene Classification (Li et al., 2025), which combined Knowledge Acquisition and Accumulation (KAA) with Consistency-based Active Learning (CAL) for autonomous vehicle perception, achieving significant improvements with reduced annotation.

6. Synthesis: A Roadmap for Airport Airside Perception

6.1 Recommended Architecture Stack

Layer 4: Scene Understanding
  InternVL / GPT-4V for complex spatial reasoning
  "Is the pushback complete?" "Is Stand 42 occupied?"

Layer 3: Temporal Understanding
  SAM 2 for video object tracking
  InternVideo2 features for action recognition
  Marshaller gesture recognition, GSE activity classification

Layer 2: 3D Perception
  Cross-modal alignment (LargeAD approach)
  DINOv2 superpixels aligned with LiDAR points
  DetAny3D for zero-shot monocular 3D detection

Layer 1: 2D Perception (Foundation)
  YOLO-World for real-time open-vocabulary detection (52 FPS)
  Grounding DINO + SAM for high-quality segmentation
  DINOv2 backbone for dense features
  Clipomaly for anomaly detection

6.2 Phased Deployment Strategy

Phase 1: Zero-Shot Baseline (Week 1-2)

Deploy YOLO-World + Grounding DINO + SAM with airside text prompts
Evaluate on representative airside footage
Identify gaps (which objects are missed, which are confused)

Phase 2: Few-Shot Adaptation (Week 3-4)

Collect 50-100 reference images per critical object class
OWL-ViT one-shot detection for rare GSE types
DINOv2 linear probes for object classification
Active learning to identify the most valuable annotations

Phase 3: Efficient Fine-Tuning (Month 2-3)

LoRA fine-tuning of detection models on corrected pseudo-labels
Adapter training for SAM mask decoder on airside boundaries
Specialized 3D detection using foundation model knowledge transfer

Phase 4: Full System (Month 4-6)

Integrate video understanding (SAM 2 tracking, temporal models)
3D perception with cross-modal alignment
Scene-level reasoning with VLMs
Continuous active learning pipeline

6.3 Key Research Gaps and Opportunities

No airside-specific foundation model benchmarks exist. Creating an "AirsideNet" dataset with GSE annotations would accelerate the field.
Scale variation handling: Current foundation models are not explicitly designed for the extreme scale variation on airside (2m baggage cart next to 73m A380). Multi-scale attention mechanisms may need adaptation.
Multi-agent coordination understanding: Airside operations involve complex choreography between multiple GSE, aircraft, and personnel. Current VLMs can describe individual objects but struggle with relational reasoning at this complexity.
Regulatory compliance verification: Foundation models could be trained to verify compliance with safety procedures (e.g., checking that FOD walks are performed, that vehicles maintain safe distances from active engines).
Weather and lighting robustness: Airside operations occur 24/7 in all weather conditions. Foundation models show promising robustness (SAM maintains performance under weather perturbations), but this needs systematic validation for airside conditions.

6.4 Critical References

Reference	Year	Venue	Relevance
SAM (Kirillov et al.)	2023	ICCV	Zero-shot segmentation foundation
SAM 2 (Meta)	2024	-	Video segmentation + tracking
Grounding DINO (Liu et al.)	2023	ECCV	Open-set detection by text
YOLO-World (Cheng et al.)	2024	CVPR	Real-time open-vocabulary detection
DINOv2 (Oquab et al.)	2023	TMLR	Self-supervised visual features
CLIP (Radford et al.)	2021	ICML	Vision-language alignment
SigLIP (Zhai et al.)	2023	ICCV	Efficient image-text pretraining
OWL-ViT (Minderer et al.)	2022	ECCV	Open-vocabulary + one-shot detection
EVA (Fang et al.)	2022	CVPR	Scaled vision foundation model
EVA-02 (Fang et al.)	2023	-	Efficient vision representations
InternImage (Wang et al.)	2022	CVPR	CNN-based foundation model
InternVL 1.5 (Chen et al.)	2024	-	Open-source multimodal model
UniPAD (Yang et al.)	2024	CVPR	3D pre-training for driving
Point-BERT (Yu et al.)	2022	CVPR	3D point cloud pre-training
LargeAD (Kong et al.)	2025	TPAMI	Cross-sensor pretraining
LoRA (Hu et al.)	2021	ICLR	Parameter-efficient fine-tuning
VideoMAE (Tong et al.)	2022	NeurIPS	Video self-supervised learning
InternVideo2 (Wang et al.)	2024	ECCV	Video foundation model
DistillNeRF (Wang et al.)	2024	NeurIPS	3D scene understanding with VFMs
ZOPP (Ma et al.)	2024	NeurIPS	Zero-shot panoptic perception
DetAny3D (Zhang et al.)	2025	-	Zero-shot 3D detection
Clipomaly (Reichard et al.)	2024	-	Open-world anomaly segmentation
VFMM3D (Ding et al.)	2024	-	SAM + Depth for monocular 3D
FusionSAM (Li et al.)	2024	-	Multimodal SAM for driving
DriveLM (Sima et al.)	2024	ECCV	VLM-based driving reasoning
MobileSAM (Zhang et al.)	2023	-	Lightweight SAM for deployment
OMG-Seg (Li et al.)	2024	CVPR	Unified segmentation model

Report compiled March 2026. Research landscape is evolving rapidly; key papers from late 2024 through early 2026 represent the frontier of foundation model application to autonomous driving perception.

SLAM Methods

Methods

Vision Foundation Models for Autonomous Driving Perception ​

Comprehensive Technical Report — With Focus on Airport Airside Applications ​

Table of Contents ​

1. Vision Foundation Models for Driving ​

1.1 Segment Anything Model (SAM / SAM 2) ​

1.2 Grounding DINO / Grounded SAM ​

1.3 DINOv2 as Backbone ​

1.4 CLIP / SigLIP for Driving Scene Understanding ​

1.5 EVA / InternImage / InternVL ​

1.6 How Foundation Models Handle Novel/Unseen Object Classes ​

2. Open-Vocabulary / Zero-Shot Detection for Driving ​

2.1 OWL-ViT ​

2.2 YOLO-World ​

2.3 Open-Vocabulary 3D Detection ​

2.4 Language-Guided Detection and Segmentation ​

2.5 Few-Shot Adaptation to New Domains ​

3. 3D Foundation Models ​

3.1 UniPAD: Universal Pre-training for Autonomous Driving ​

3.2 Point-BERT for LiDAR ​

3.3 PonderV2: 3D Foundation Model ​

3.4 Cross-Modal Pretraining (2D Images <-> 3D Point Clouds) ​

3.5 Zero-Shot 3D Understanding ​

4. Video Foundation Models ​

4.1 VideoMAE ​

4.2 InternVideo / InternVideo2 ​

4.3 Video Understanding for Driving Scenarios ​

4.4 Action Recognition in Driving Context ​

5. Domain Adaptation Strategies for Airside ​

5.1 The Airside Domain Gap ​

Vision Foundation Models for Autonomous Driving Perception

Comprehensive Technical Report — With Focus on Airport Airside Applications

Table of Contents

1. Vision Foundation Models for Driving

1.1 Segment Anything Model (SAM / SAM 2)

1.2 Grounding DINO / Grounded SAM

1.3 DINOv2 as Backbone

1.4 CLIP / SigLIP for Driving Scene Understanding

1.5 EVA / InternImage / InternVL

1.6 How Foundation Models Handle Novel/Unseen Object Classes

2. Open-Vocabulary / Zero-Shot Detection for Driving

2.1 OWL-ViT

2.2 YOLO-World

2.3 Open-Vocabulary 3D Detection

2.4 Language-Guided Detection and Segmentation

2.5 Few-Shot Adaptation to New Domains

3. 3D Foundation Models

3.1 UniPAD: Universal Pre-training for Autonomous Driving

3.2 Point-BERT for LiDAR

3.3 PonderV2: 3D Foundation Model

3.4 Cross-Modal Pretraining (2D Images <-> 3D Point Clouds)

3.5 Zero-Shot 3D Understanding

4. Video Foundation Models

4.1 VideoMAE

4.2 InternVideo / InternVideo2

4.3 Video Understanding for Driving Scenarios

4.4 Action Recognition in Driving Context

5. Domain Adaptation Strategies for Airside

5.1 The Airside Domain Gap