LiDAR Foundation Models & 3D Point Cloud Pre-training for Autonomous Driving
Comprehensive Technical Survey (2022-2026) for LiDAR-Primary Airside AV
Table of Contents
- Executive Summary
- 3D Point Cloud Pre-training / Self-Supervised Learning
- LiDAR Foundation Models & Efficient 3D Backbones
- Multi-modal 3D Foundation Models (Language-3D Alignment)
- Pre-training to Fine-tuning Pipeline
- Practical Deployment on NVIDIA Orin
- LiDAR World Models & Generation
- Latest 2025-2026 Advances
- Comparative Summary Table
- Recommendations for LiDAR-Primary Airside AV
- References
1. Executive Summary
LiDAR foundation models and 3D point cloud pre-training have matured rapidly from 2022-2026, evolving from object-level pre-training (Point-BERT, Point-MAE) to scene-level driving-specific methods (AD-PT, GD-MAE, BEV-MAE) and now to universal 3D encoders (Sonata, Concerto, Utonia) that span indoor, outdoor, and object domains.
Key Findings for LiDAR-Primary Airside AV
Pre-training saves 50-80% of labeled data. GD-MAE achieves comparable accuracy with only 20% of Waymo labels. GPC outperforms full-dataset training-from-scratch using just 20% of KITTI labels. PSA-SSL matches SOTA with 10x fewer labels.
The Pointcept stack (PTv3 + Sonata/Concerto) is the current SOTA. PTv3 (CVPR 2024 Oral) is 3x faster and 10x more memory-efficient than PTv2. Sonata (CVPR 2025 Highlight) provides self-supervised pre-trained weights. Concerto (NeurIPS 2025) adds 2D-3D joint learning, outperforming standalone 3D SSL by 4.8%.
FlatFormer is the path to real-time transformer-based LiDAR on Orin. 4.6x faster than SST, 1.4x faster than CenterPoint -- the first point cloud transformer achieving real-time on edge GPUs.
DSVT with TensorRT achieves 27 Hz on A100, with a pillar variant reaching 37ms latency. Community TensorRT implementations exist.
ScaLR provides the best LiDAR-only self-supervised features via DINOv2-to-LiDAR distillation, reaching 67.8% mIoU linear probing on nuScenes. Directly applicable to a multi-LiDAR stack.
No airside-specific LiDAR pre-training exists. Road driving pre-training transfers, but domain adaptation (LoRA, adapters, or DADT) is needed. PointLoRA (CVPR 2025) provides parameter-efficient fine-tuning specifically for point clouds.
LiDAR world models have emerged. Copilot4D (ICLR 2024) reduces point cloud forecasting error by 65%. LiDARCrafter (AAAI 2026 Oral) enables 4D LiDAR scene generation from language prompts.
Open-vocabulary 3D is bridged via CLIP alignment. ULIP-2 (CVPR 2024) and OpenScene (CVPR 2023) enable language-queried 3D understanding. Concerto includes a CLIP translator for open-world 3D perception.
2. 3D Point Cloud Pre-training / Self-Supervised Learning
2.1 Point-BERT (CVPR 2022)
Paper: "Pre-training 3D Point Cloud Transformers with Masked Point Modeling" Authors: Yu et al. (Tsinghua, Shanghai AI Lab) GitHub: 677 stars | MIT License
Architecture:
- Discrete VAE (dVAE) tokenizer converts local point patches into discrete tokens
- Standard Transformer backbone trained via Masked Point Modeling (MPM): masks random patches, predicts original tokens
- Two-stage training: (1) train dVAE tokenizer, (2) pre-train transformer with MPM
Key Results:
| Benchmark | Metric | Score |
|---|---|---|
| ModelNet40 | Accuracy | 93.8% |
| ScanObjectNN (hardest) | Accuracy | 83.1% |
| Few-shot (5-way 10-shot) | Accuracy | Strong transfer |
Airside Relevance: Pioneered BERT-style pre-training for 3D. The few-shot transfer capability is relevant for learning to recognize novel airside objects (GSE, aircraft variants) with minimal labeled data. However, trained on object-level datasets (ShapeNet, ModelNet), not driving-scale point clouds.
Limitation: Object-level only. Does not handle large-scale outdoor LiDAR scenes directly.
2.2 Point-MAE (ECCV 2022)
Paper: "Masked Autoencoders for Point Cloud Self-supervised Learning" Authors: Pang et al. GitHub: 622 stars
Architecture:
- Divides point cloud into irregular patches, randomly masks at high ratio (60-80%)
- Standard Transformer autoencoder with asymmetric design (heavy encoder, light decoder)
- Shifting mask tokens operation adapted for point cloud properties
- Purely reconstruction-based objective (no tokenizer needed, unlike Point-BERT)
Key Results:
| Benchmark | Metric | Score |
|---|---|---|
| ScanObjectNN (hardest) | Accuracy | 85.18% |
| ModelNet40 | Accuracy | 94.04% |
| Few-shot (5-way 10-shot) | Accuracy | 96.3% +/- 2.5 |
| ShapeNetPart | mIoU | 86.1% |
Airside Relevance: Simpler than Point-BERT (no dVAE needed), better results. Demonstrated that a simple masked reconstruction objective learns powerful 3D features. Foundation for all subsequent MAE-based 3D methods.
Limitation: Same as Point-BERT -- object-level, not directly applicable to driving-scale scenes.
2.3 PointGPT (NeurIPS 2023)
Paper: "Auto-regressively Generative Pre-training from Point Clouds" Authors: Chen et al. (Beijing Institute of Technology) GitHub: 245 stars
Architecture:
- Partitions point cloud into irregular patches, orders via Morton (Z-order) curve
- Extractor-generator Transformer decoder with dual masking strategy
- Auto-regressive prediction of next point patches (GPT-style, not BERT-style)
- Three variants: PointGPT-S (small), PointGPT-B (base), PointGPT-L (large)
Key Results:
| Benchmark | Metric | Score |
|---|---|---|
| ModelNet40 | Accuracy | 94.9% (PointGPT-L) |
| ScanObjectNN (hardest) | Accuracy | 93.4% (PointGPT-L) |
| Few-shot (5-way 10-shot) | Accuracy | 98.0% +/- 1.9 |
| ShapeNetPart | mIoU | 86.6% (PointGPT-L) |
Airside Relevance: Best object-level pre-training results. The auto-regressive formulation learns generative representations that better capture 3D structure than masked reconstruction. Strong few-shot capabilities. However, still object-level.
2.4 GD-MAE (CVPR 2023)
Paper: "Generative Decoder for MAE Pre-training on LiDAR Point Clouds" Authors: Yang et al. GitHub: 124 stars
Architecture:
- First MAE method designed specifically for outdoor LiDAR point clouds
- Generative decoder that hierarchically merges surrounding context to restore masked geometric knowledge
- Works with voxel-based 3D backbones (VoxelBackBone, SECOND)
- Compatible with CenterPoint and PV-RCNN detection heads
Key Results:
| Dataset | Metric | Score | Notes |
|---|---|---|---|
| Waymo (Vehicle L1) | mAPH | 80.2/79.8 | Two-stage |
| KITTI (Car) | AP | 82.01 | Moderate |
| ONCE (Vehicle) | AP | 76.79 | vs 74.10 baseline |
Critical Data Efficiency Result:
- Achieves comparable accuracy with only 20% of labeled Waymo data -- the GD-MAE_0.2 variant demonstrates that pre-training on unlabeled data followed by fine-tuning on 20% of labels matches or approaches full-label performance.
Airside Relevance: HIGH. This is directly applicable to the reference airside AV stack. GD-MAE works with the same voxel-based backbones used by CenterPoint (already documented in openpcdet-centerpoint.md). The 80% label reduction directly addresses the zero-airside-dataset problem -- pre-train on unlabeled airside LiDAR sweeps, fine-tune with minimal annotations.
2.5 AD-PT (NeurIPS 2023)
Paper: "Autonomous Driving Pre-Training with Large-scale Point Cloud Dataset" Authors: Yuan et al. (Shanghai AI Lab) GitHub: Available
Architecture:
- Formulates pre-training as semi-supervised learning: few-shot labeled + massive unlabeled data
- First to build a large-scale, diverse pre-training point cloud dataset spanning multiple driving domains
- Compatible with PV-RCNN++, SECOND, CenterPoint backbones
- Cross-dataset transfer: pre-train on combined data, fine-tune on target domain
Key Results:
- Significant improvements on Waymo, nuScenes, and KITTI across all tested backbones (PV-RCNN++, SECOND, CenterPoint)
- Key innovation: unlike prior methods that pre-train and fine-tune on the same benchmark, AD-PT enables cross-dataset generalization
- Pre-training on diverse data distribution improves downstream performance beyond single-dataset pre-training
Airside Relevance: HIGH. AD-PT's cross-dataset transfer paradigm is exactly what airside needs -- pre-train on diverse road driving LiDAR data (Waymo + nuScenes + KITTI), then fine-tune on airside data. The semi-supervised formulation means even a small number of labeled airside samples combined with unlabeled airside sweeps can be effective.
2.6 Occupancy-MAE (IEEE TIV 2023)
Paper: "Self-supervised Pre-training Large-scale LiDAR Point Clouds with Masked Occupancy Autoencoders" Authors: Min et al. GitHub: 280 stars
Architecture:
- Designed specifically for voxel-based large-scale outdoor LiDAR
- Range-aware random masking strategy (accounts for LiDAR density variation with distance)
- Pretext task: binary occupancy prediction (does a voxel contain points?)
- Even with 90% masking ratio, learns representative features
- Compatible with SECOND, CenterPoint, PV-RCNN detectors
Key Results:
| Task | Dataset | Improvement |
|---|---|---|
| 3D Detection (Car) | KITTI | Reduces labeled data by 50% for car detection |
| 3D Detection (Small objects) | Waymo | ~2% AP improvement |
| 3D Segmentation | Multiple | ~2% mIoU improvement |
Airside Relevance: HIGH. The occupancy prediction pretext task is particularly well-suited for airside because occupancy awareness is fundamental for safe navigation. The 50% labeled data reduction for car detection suggests similar savings for airside objects. Works with existing OpenPCDet backbones.
2.7 BEV-MAE (AAAI 2024)
Paper: "Bird's Eye View Masked Autoencoders for Point Cloud Pre-training in Autonomous Driving" Authors: Ren et al. (Peking University) GitHub: Available
Architecture:
- Projects LiDAR point cloud onto 2D BEV grid
- BEV-guided masking: randomly masks non-empty BEV grids
- Pretext task: predicts point density per masked BEV grid (not just occupancy)
- Leverages LiDAR density-distance correlation
- Avoids complex 3D decoder design by operating in BEV space
Key Results:
| Setting | Metric | Improvement |
|---|---|---|
| 100% pretrain, 20% finetune (Waymo) | mAP | +1.42 over baseline |
| 100% pretrain, 20% finetune (Waymo) | APH | +1.34 over baseline |
Airside Relevance: The BEV formulation is particularly relevant because the reference airside AV stack already uses BEV representations for planning. Pre-training in BEV space means the learned features are directly aligned with the downstream planning representation.
2.8 ALSO (CVPR 2023)
Paper: "Automotive Lidar Self-supervision by Occupancy Estimation" Authors: Boulch et al. (Valeo AI) GitHub: 180 stars
Architecture:
- Self-supervised pretext task: reconstruct the surface from which 3D points were sampled
- Single-stream pipeline (unlike contrastive learning methods that require augmented pairs)
- Trains on the task of predicting whether a 3D point is on or off the LiDAR-measured surface
- Supports MinkUNet, SPVCNN backbones
- Very lightweight: trainable on limited compute resources
Key Results:
- Consistent improvements across nuScenes, SemanticKITTI, KITTI3D, ONCE
- Works for both semantic segmentation and 3D object detection
- Single-stream design is 2-3x more resource-efficient than contrastive methods
Airside Relevance: Developed by Valeo AI (automotive Tier-1 supplier), making it industry-validated. The surface reconstruction pretext task is intuitive for LiDAR: the model learns what surfaces look like, which transfers directly to understanding object shapes. The low compute requirement is important for resource-constrained airside AV development.
2.9 GPC: Grounded Point Colorization (ICLR 2024)
Paper: "Pre-Training LiDAR-Based 3D Object Detectors Through Colorization" Authors: Pan et al. GitHub: Available
Architecture:
- Cross-modal pre-training: teaches LiDAR backbone to predict colors from point positions
- Color prediction formulated as classification over quantized RGB bins
- Ground-truth colors provided as context hints (grounded colorization)
- Balanced softmax loss handles class imbalance (especially for ground points)
- Compatible with PointRCNN, PV-RCNN, CenterPoint
Key Results:
| Setting | Dataset | Result |
|---|---|---|
| 20% labeled data | KITTI | Outperforms training from scratch on 100% data |
| 5% labeled data | KITTI | +7.5% AP (55.2 -> 62.7) for PointRCNN |
| Full data | KITTI/Waymo | Significant improvements across all detectors |
Airside Relevance: VERY HIGH. The remarkable data efficiency -- outperforming full-dataset training with only 20% of labels -- is exactly what airside needs. However, requires camera-LiDAR pairs for pre-training (to provide color supervision). The reference airside AV stack has 360-degree cameras alongside LiDAR, making this directly applicable.
3. LiDAR Foundation Models & Efficient 3D Backbones
3.1 Point Transformer V3 / PTv3 (CVPR 2024 Oral)
Paper: "Point Transformer V3: Simpler, Faster, Stronger" Authors: Wu et al. (HKU, Max Planck) GitHub: 2,900 stars (via Pointcept) | Part of the Pointcept ecosystem
Architecture:
- Replaces KNN neighbor search with serialized neighbor mapping using space-filling curves (Z-order, Hilbert)
- Sparse convolution layers replace complex relative positional encodings
- Scales receptive field from 16 to 1,024 points while remaining efficient
- 3x faster and 10x more memory-efficient than PTv2
Key Results:
- SOTA on 20+ downstream tasks across indoor and outdoor scenarios
- Selected as one of 90 Oral presentations at CVPR 2024 (0.78% of submissions)
- Multi-dataset joint training further improves results
- Supported datasets: ScanNet, S3DIS, SemanticKITTI, nuScenes, Waymo, ModelNet40
Airside Relevance: VERY HIGH. PTv3 is the current best general-purpose 3D backbone. Its efficiency improvements make it viable for deployment. The Pointcept codebase provides a complete training pipeline. The multi-dataset training capability means you can pre-train across diverse LiDAR datasets before fine-tuning on airside data.
3.2 DSVT (CVPR 2023)
Paper: "Dynamic Sparse Voxel Transformer with Rotated Sets" Authors: Wang et al. GitHub: 451 stars | TensorRT implementation available
Architecture:
- Transformer-only 3D backbone (no sparse convolutions)
- Dynamic Sparse Window Attention: partitions by sparsity, not fixed windows
- Rotated set attention avoids information isolation between windows
- Attention-style 3D pooling (no custom CUDA ops -- deployment-friendly)
- Pillar and Voxel variants
Key Results:
| Variant | mAP/H_L1 (Waymo) | Latency (PyTorch) | Latency (TensorRT) |
|---|---|---|---|
| Pillar | 71.0 mAPH_L2 | 67ms | 37ms |
| Voxel (2-stage) | 78.9 mAPH_L1 | 97ms | - |
| Real-time (TRT) | - | - | 27 Hz |
Airside Relevance: HIGH. The deployment-friendliness is critical -- no custom CUDA ops means easier TensorRT conversion. The pillar variant at 37ms TensorRT latency is within real-time budget for 20 km/h airside operations. Community TensorRT implementations exist (DSVT-AI-TRT).
3.3 FlatFormer (CVPR 2023)
Paper: "Flattened Window Attention for Efficient Point Cloud Transformer" Authors: Liu et al. (MIT-Han-Lab, NVIDIA) GitHub: 141 stars
Architecture:
- Trades spatial proximity for computational regularity
- Flattens point cloud via window-based sorting into equal-size groups (not equal-shape windows)
- Self-attention within groups; alternating sort axes for multi-directional feature exchange
- Shift windows for cross-group feature exchange
- First point cloud transformer to achieve real-time on edge GPUs
Key Results:
| Metric | FlatFormer | SST | CenterPoint |
|---|---|---|---|
| Waymo mAP/H_L1 | 76.1/73.4 (1-sweep) | - | - |
| Speedup vs SST | 4.6x | 1x | - |
| Speedup vs CenterPoint | 1.4x | - | 1x |
Airside Relevance: VERY HIGH for Orin deployment. FlatFormer explicitly targets edge GPU efficiency. Being faster than CenterPoint (the current reference airside AV stack detection backbone candidate) while achieving comparable or better accuracy makes it a direct upgrade path. Co-authored by NVIDIA, suggesting strong TensorRT compatibility.
3.4 SphereFormer (CVPR 2023)
Paper: "Spherical Transformer for LiDAR-based 3D Recognition" Authors: Lai et al. (CUHK, NVIDIA) GitHub: 364 stars
Architecture:
- Radial window self-attention: non-overlapping narrow, long windows extending radially from the sensor
- Overcomes disconnection between sparse distant points and dense close points
- Specifically designed for the spherical geometry of LiDAR point clouds
- Plug-and-play module that can be added to existing backbones
Key Results:
| Dataset | Metric | Score | Distant Point Improvement |
|---|---|---|---|
| nuScenes (val) | mIoU | 79.5% (TTA) | 13.3% -> 30.4% |
| SemanticKITTI (val) | mIoU | 69.0% (TTA) | - |
| Waymo (val) | mIoU | 70.8% (TTA) | 61.9% |
Airside Relevance: HIGH. The distant point performance improvement (13.3% -> 30.4%) is critical for airside operations where detecting distant aircraft, moving GSE, and personnel at range is essential for safe planning. The plug-and-play design means it could augment the existing reference airside AV stack perception pipeline.
3.5 LargeKernel3D (CVPR 2023)
Paper: "Scaling up Kernels in 3D Sparse CNNs" Authors: Chen et al. (CUHK, SenseTime) GitHub: 215 stars
Architecture:
- Spatial-wise partition convolution (SW-LK block) enables large 3D kernels efficiently
- Maintains sparsity while expanding the effective receptive field
- Applied on top of 3D sparse CNN backbones
Key Results:
| Dataset | Task | Metric | Score |
|---|---|---|---|
| nuScenes (test) | Detection | NDS | 72.8 (LiDAR only) |
| nuScenes (test) | Detection | NDS | 74.2 (multimodal) |
| ScanNetv2 | Segmentation | mIoU | 73.9 |
Ranked 1st on nuScenes LiDAR leaderboard at time of publication.
Airside Relevance: MODERATE. Improves 3D sparse CNN backbones that are already used in the reference airside AV stack. The large receptive field helps for detecting large objects (aircraft) at all distances. However, the CNN-based approach is being superseded by transformer methods (PTv3, DSVT).
3.6 Senna (arXiv 2024)
Paper: "Bridging Large Vision-Language Models and End-to-End Autonomous Driving" Authors: Zhao et al. (HUST VL Lab) GitHub: Available
Note: Despite initial framing as a "LiDAR foundation model," Senna is actually a VLM-based end-to-end driving system, not a LiDAR-specific foundation model. Included here for completeness as it was identified in the research scope.
Architecture:
- Senna-VLM generates planning decisions in natural language
- Senna-E2E predicts precise trajectories
- Multi-image encoding with multi-view prompts for scene understanding
- Pre-trained on DriveX (1M driving clips), fine-tuned on nuScenes
Key Results:
| Metric | Improvement |
|---|---|
| Average planning error | -27.12% (with DriveX pre-training) |
| Collision rate | -33.33% (with DriveX pre-training) |
Airside Relevance: LOW-MODERATE. Senna is primarily a camera-based VLM system, not a LiDAR foundation model. The natural language planning output is interesting for explainability but the architecture is not directly applicable to a LiDAR-primary stack. The DriveX pre-training approach (large-scale diverse data -> fine-tune) is a useful paradigm reference.
3.7 LiDARFormer (2023)
Paper: "A Unified Transformer-based Multi-task Network for LiDAR Perception" Authors: Li et al. GitHub: Available
Architecture:
- Cross-space transformer: learns attention between 2D BEV and 3D sparse voxel features
- Unified framework for detection and segmentation
- Multi-task learning improves individual task performance
Key Results:
| Dataset | Task | Metric | Score |
|---|---|---|---|
| Waymo | Detection | mAPH L2 | 76.4 |
| nuScenes | Detection | NDS | 74.3 |
Airside Relevance: MODERATE. The multi-task capability (detection + segmentation) is useful for airside where you need both object detection and drivable area segmentation. The BEV-3D cross-attention could bridge different representation layers in the reference airside AV stack.
4. Multi-modal 3D Foundation Models (Language-3D Alignment)
4.1 ULIP / ULIP-2 (CVPR 2023 / CVPR 2024)
Paper: "Learning a Unified Representation of Language, Images, and Point Clouds" Authors: Xue et al. (Salesforce Research) GitHub: 598 stars
ULIP Architecture:
- Tri-modal contrastive learning: aligns 3D point cloud features with CLIP's image and text embeddings
- Freezes CLIP image/text encoders; trains 3D encoder to align
- Supports PointNet2, PointBERT, PointMLP, PointNeXt as 3D backbones
- Model-agnostic: no extra latency at inference (only 3D encoder needed)
ULIP-2 Advances (CVPR 2024):
- Uses LLMs to auto-generate holistic language descriptions for 3D shapes (eliminates manual 3D annotation)
- Scaled to larger datasets (Objaverse, ShapeNet)
- Only needs 3D data as input -- fully scalable
Key Results:
| Benchmark | Metric | ULIP | ULIP-2 |
|---|---|---|---|
| ModelNet40 (zero-shot) | Top-1 Acc | 60.4% | 84.7% |
| Objaverse-LVIS (zero-shot) | Top-1 Acc | - | 50.6% |
| ScanObjectNN (fine-tuned) | OA | - | 91.5% (1.4M params) |
ULIP-2 outperforms PointCLIP by 28.8% on zero-shot classification.
Airside Relevance: Enables language-queried 3D understanding. For airside: "find the pushback tug" or "where is the fuel bowser?" queries on 3D point clouds. However, currently object-level (ModelNet/ShapeNet scale), not scene-level driving LiDAR.
4.2 PointCLIP / PointCLIP V2 (CVPR 2022 / ICCV 2023)
Paper: "Point Cloud Understanding by CLIP" / "Prompting CLIP and GPT for Powerful 3D Open-world Learning"
PointCLIP Architecture:
- Projects point clouds into multi-view depth maps
- Feeds depth maps to frozen CLIP image encoder
- Aggregates view-wise zero-shot predictions
PointCLIP V2 Advances:
- Realistic depth map generation via shape projection module
- GPT generates 3D-specific text prompts for CLIP's text encoder
- Unified framework for zero-shot 3D classification, segmentation, and detection
Airside Relevance: LOW-MODERATE. The multi-view projection approach is computationally expensive and loses 3D information. ULIP's direct 3D feature alignment is superior. However, PointCLIP V2's zero-shot 3D detection capability could be useful for initial airside prototyping with zero labeled data.
4.3 OpenScene (CVPR 2023)
Paper: "3D Scene Understanding with Open Vocabularies" Authors: Peng et al. GitHub: 812 stars
Architecture:
- Back-projects 3D points into multi-view images to aggregate CLIP/OpenSeg features
- Trains sparse 3D convolutional network to distill aggregated pixel features into 3D
- Enables open-vocabulary queries on 3D point clouds at inference time
- At inference: only the 3D network is needed (no images required)
Supported Datasets: ScanNet, Matterport3D, nuScenes, Replica
Key Capabilities:
- Zero-shot 3D semantic segmentation via arbitrary text labels
- Open-vocabulary scene querying: objects, properties, materials, activities, abstract concepts
- 3D object search via image queries
- CPU inference possible after distillation
Airside Relevance: HIGH. OpenScene's open-vocabulary 3D understanding is directly applicable to airside. You could query the 3D scene with text: "aircraft engine," "ground power unit," "person in hi-vis." The distillation approach means cameras are only needed during training -- at inference, only LiDAR is required. This aligns perfectly with a LiDAR-primary stack.
Critical Insight: OpenScene's approach of distilling 2D foundation model knowledge into a 3D-only network is the ideal pattern for the reference airside AV stack: use cameras during pre-training/distillation, but deploy with LiDAR-only inference.
4.4 LiDAR-LLM (AAAI 2025)
Paper: "Exploring the Potential of Large Language Models for 3D LiDAR Understanding" Authors: Liu et al.
Architecture:
- Takes raw LiDAR point clouds as input to an LLM
- View-Aware Transformer (VAT) bridges 3D encoder and LLM
- Three-stage training: (1) LiDAR feature alignment, (2) 3D caption training, (3) 3D grounding
- Reformulates 3D scene understanding as language modeling
Key Results:
| Task | Metric | Score |
|---|---|---|
| 3D Captioning | BLEU-1 | 40.9 |
| 3D Grounding (Classification) | Accuracy | 63.1% |
| 3D Grounding (Localization) | BEV mIoU | 14.3% |
Airside Relevance: MODERATE. The ability to ask natural language questions about LiDAR scenes ("What is the large object to the left?") is useful for explainability and safety cases. However, latency may be too high for real-time perception. Better suited for offline analysis and safety auditing.
5. Pre-training to Fine-tuning Pipeline
5.1 How Much Labeled Data Does Pre-training Save?
| Method | Pre-training Type | Label Savings | Evidence |
|---|---|---|---|
| GD-MAE | MAE on LiDAR | 80% | 20% labels matches full-data performance on Waymo |
| GPC | Colorization | 80% | 20% KITTI outperforms 100% from scratch |
| GPC (extreme) | Colorization | 95% | 5% KITTI: +7.5% AP over scratch |
| Occupancy-MAE | Occupancy prediction | 50% | Halves labeled data for car detection on KITTI |
| PSA-SSL (CVPR 2025) | Pose/size-aware SSL | 90% | Matches SOTA with 10x fewer labels |
| BEV-MAE | BEV occupancy | ~60% | +1.42 mAP with 20% fine-tuning data |
| TREND | Temporal forecasting | Significant | +1.77% mAP on ONCE, +2.11% on nuScenes |
| ScaLR | Image-to-LiDAR distill | ~40-50% | 67.8% linear probe mIoU (strong frozen features) |
| ALSO | Surface reconstruction | ~30-40% | Consistent gains on SemanticKITTI, nuScenes |
Recommended Strategy for Airside:
- Pre-train backbone on diverse road LiDAR data (Waymo + nuScenes + KITTI) using AD-PT's semi-supervised paradigm
- Collect unlabeled airside LiDAR sweeps (requires only driving the vehicle)
- Continue pre-training on unlabeled airside data using GD-MAE or Occupancy-MAE
- Fine-tune with minimal labeled airside data (500-1,000 annotated frames may suffice)
5.2 Transfer from Road Driving to Airside (Domain Gap)
Key Domain Differences:
| Dimension | Road Driving | Airport Airside |
|---|---|---|
| Object sizes | Cars 4m, trucks 12m | Baggage carts 2m, A380 73m |
| Point density | Dense in 30-50m range | Variable (open areas vs. stands) |
| Dynamic objects | Cars, pedestrians | GSE, aircraft, ground crew |
| Ground surface | Asphalt with lane markings | Apron concrete, stand markings |
| Structures | Buildings, trees, signs | Jetbridges, terminal buildings |
| LiDAR patterns | Typical urban scanning | May have reflections from aircraft fuselage |
Transfer Methods (2024-2025):
Domain Adaptive Distill-Tuning (DADT):
- Specifically designed for fine-tuning pre-trained 3D models with limited target data
- Uses pseudo beam generation and BEV attention-based regularizers
- Alleviates domain shift between source (road) and target (airside) domains
Shelf-Supervised Cross-Modal Pre-Training:
- Bootstraps 3D representations using 2D image foundation models (DINOv2, CLIP)
- Yields better semi-supervised detection accuracy than self-supervised pretext tasks
- Particularly effective when labeled target data is scarce
ScaLR Distillation Pipeline:
- Distill DINOv2 features into LiDAR backbone using camera-LiDAR pairs
- Pre-train on diverse driving datasets, then fine-tune on airside
- Produces strong frozen LiDAR features that transfer across domains
5.3 Few-Shot 3D Detection with Pre-trained Models
| Approach | Setting | Result |
|---|---|---|
| GPC + PointRCNN | 5% KITTI labels | +7.5% AP (55.2 -> 62.7) |
| GD-MAE | 20% Waymo labels | Matches full-data baseline |
| PSA-SSL | 10% labels | Matches SOTA with 10x fewer labels |
| Point-BERT | 5-way 10-shot | Strong transfer on ModelNet |
| PointGPT | 5-way 10-shot | 98.0% accuracy on ModelNet |
Practical Estimate for Airside:
- Pre-trained + 500 labeled airside frames: ~65-75% mAP for common objects (tractors, baggage carts, aircraft)
- Pre-trained + 1,000 labeled frames: ~75-85% mAP
- Pre-trained + 5,000 labeled frames: ~85-90% mAP
- These estimates assume pre-training on road driving data with domain adaptation
5.4 LoRA/Adapter Approaches for 3D Models
PointLoRA (CVPR 2025):
- First LoRA method specifically designed for point cloud learning
- Multi-Scale Token Selection module captures local information
- Complements LoRA's global feature aggregation with local point cloud priors
- Integrates selected tokens at various scales via shared Prompt MLP
LoRA for LiDAR Semantic Segmentation (2026):
- 73.4% parameter reduction vs. full fine-tuning
- Greater resistance to catastrophic forgetting
- Achieves baseline accuracy with substantially fewer trainable parameters
- Suitable for resource-constrained deployment (Orin)
Adapter Strategy for Airside:
Pre-trained 3D Backbone (Frozen)
|
+-- LoRA adapters (rank 16-32) per transformer layer
| Only 2-5% parameters trained
| ~100x less GPU memory for fine-tuning
|
+-- Task-specific heads
CenterPoint head for detection
Segmentation head for drivable areaBenefits for Airside:
- Swap between road and airside LoRA adapters at deployment
- Fine-tune on consumer GPU (RTX 4090) instead of A100 cluster
- Maintain pre-trained knowledge while adapting to airside domain
- Multiple LoRA adapters: different airports, different seasons, different GSE types
6. Practical Deployment on NVIDIA Orin
6.1 Which Models Run on Orin (275 TOPS)?
| Model | Architecture | Latency (A100) | Estimated Latency (Orin) | TensorRT | Deployment Feasibility |
|---|---|---|---|---|---|
| PointPillars | Pillar + 2D CNN | 2ms | 6.84ms (measured) | Yes (INT8) | PRODUCTION READY |
| CenterPoint | Voxel + 2D CNN | ~15ms | ~45ms (estimated) | Yes | PRODUCTION READY |
| FlatFormer | Flat attention | ~25ms | ~75ms (estimated) | Likely | FEASIBLE with optimization |
| DSVT (Pillar) | Sparse transformer | 37ms (TRT) | ~110ms (estimated) | Yes (community) | MARGINAL (may need Thor) |
| PTv3 | Serialized attention | Varies | >100ms (estimated) | Partial | FUTURE (Thor timeline) |
| SphereFormer | Radial attention | ~50ms | >150ms (estimated) | Unknown | FUTURE |
| LargeKernel3D | Sparse CNN | ~30ms | ~90ms (estimated) | Possible | FEASIBLE with optimization |
Orin Latency Estimates: Roughly 3x A100 latency for well-optimized TensorRT models. The reference airside AV stack system runs perception at ~10 Hz (100ms budget), so models under ~80ms Orin latency are viable.
6.2 TensorRT Compatibility
Fully Compatible (tested):
- PointPillars: via NVIDIA Lidar_AI_Solution, INT8 PTQ, 6.84ms on Orin
- CenterPoint: via NVIDIA Lidar_AI_Solution
- DSVT (Pillar): community TensorRT implementation (DSVT-AI-TRT), 37ms on A100 FP16
Likely Compatible (standard ops):
- FlatFormer: standard attention ops, designed for edge deployment by MIT-Han-Lab/NVIDIA
- BEV-MAE pre-trained backbones: standard voxel backbone, same TRT path as CenterPoint
- GD-MAE pre-trained backbones: same as above
Challenging (custom ops):
- SphereFormer: custom radial window ops
- PTv3: serialized attention with space-filling curves (non-standard)
- PointGPT/Point-MAE/Point-BERT: designed for object-level, not optimized for deployment
6.3 Integration with Existing Detection Heads
Pre-trained backbones can drop into OpenPCDet detection heads:
Pre-trained Backbone (GD-MAE, AD-PT, BEV-MAE, Occupancy-MAE)
|
+-- VoxelBackBone8x (standard OpenPCDet backbone)
|
+-- Compatible Detection Heads:
- CenterHead (CenterPoint) -- heatmap-based, anchor-free
- AnchorHeadSingle (PointPillars) -- anchor-based
- Voxel R-CNN head -- two-stage refinement
- PV-RCNN head -- point-voxel fusionThis means the pre-training methods from Section 2 (GD-MAE, AD-PT, Occupancy-MAE, BEV-MAE) can be used to initialize the same backbones already used for CenterPoint and PointPillars in the reference airside AV stack, with no architectural changes.
6.4 Recommended Deployment Path
Phase 1 (Now): PointPillars with Pre-trained Backbone
- Use GD-MAE or Occupancy-MAE pre-trained VoxelBackBone8x
- Same PointPillars head, same TensorRT pipeline
- Expected: ~6.84ms on Orin (unchanged latency), +2-5% mAP from pre-training
Phase 2 (6-12 months): CenterPoint with Pre-trained Backbone + LoRA
- Pre-train on Waymo/nuScenes, fine-tune with LoRA on airside data
- CenterPoint head for anchor-free multi-class detection
- Expected: ~45ms on Orin, ~70-80% mAP with 1,000 labeled airside frames
Phase 3 (Thor era, 2026-2027): Transformer Backbone
- DSVT or FlatFormer backbone with pre-trained weights
- PTv3 + Sonata/Concerto pre-training when Thor hardware available
- Expected: ~30ms on Thor, >85% mAP with full pre-training pipeline
7. LiDAR World Models & Generation
7.1 Copilot4D (ICLR 2024)
Paper: "Learning Unsupervised World Models for Autonomous Driving via Discrete Diffusion" Authors: Zhang et al. (Waabi)
Architecture:
- VQVAE tokenizer converts LiDAR point clouds to discrete tokens
- Discrete diffusion model predicts future token sequences
- Parallel decoding and denoising via enhanced Masked Generative Image Transformer (MaskGIT)
Key Results:
- Reduces Chamfer distance by >65% for 1s prediction, >50% for 3s prediction
- Evaluated on nuScenes, KITTI Odometry, Argoverse2
- First LiDAR-native world model for autonomous driving
Airside Relevance: HIGH. LiDAR point cloud forecasting is directly applicable to predicting future positions of aircraft, GSE, and personnel. The unsupervised nature means no labeled data needed for world model training -- just collect LiDAR sequences while driving.
7.2 LiDARCrafter (AAAI 2026 Oral)
Paper: "Dynamic 4D World Modeling from LiDAR Sequences" Authors: WorldBench GitHub: 193 stars
Architecture:
- Three-component pipeline: (1) 4D layout generation from language, (2) single-frame LiDAR synthesis, (3) temporal consistency enforcement
- Language-guided: "add a moving vehicle on the left lane" generates corresponding 4D LiDAR sequence
- Scene-level, object-level, and sequence-level evaluation
Key Results:
- Best single-frame LiDAR generation on nuScenes
- Superior foreground object quality and temporal stability
- First controllable 4D LiDAR generation model
Airside Relevance: HIGH for simulation and training data generation. Generate synthetic airside LiDAR scenarios: "aircraft pushback from Stand 42," "baggage tractor crossing apron," "FOD on taxiway." Addresses the zero-public-airside-dataset problem through generation.
7.3 LidarDM (ICRA 2025)
Paper: "Generative LiDAR Simulation in a Generated World"
Architecture:
- Layout-aware LiDAR point cloud generation
- Physically plausible and temporally coherent
- Guided by driving scenarios (map, traffic)
Airside Relevance: MODERATE. LiDAR simulation for training and testing without real airside data.
7.4 Cosmos-Transfer-LidarGen (NVIDIA, 2025)
Part of NVIDIA Cosmos ecosystem:
- Uses Cosmos-Predict as runtime engine
- LiDAR tokenizer for range map representation
- Diffusion model for multi-view RGB to LiDAR range map generation
- Post-training scripts for custom dataset fine-tuning
- Commercially licensed (NVIDIA Open Model License)
Airside Relevance: HIGH. Commercially licensed LiDAR generation within the Cosmos ecosystem. Can generate synthetic LiDAR data from camera imagery, expanding training data. NVIDIA partnership makes it a natural fit for Orin/Thor deployment.
8. Latest 2025-2026 Advances
8.1 Sonata (CVPR 2025 Highlight)
Paper: "Self-Supervised Learning of Reliable Point Representations" Authors: Wu et al. (Meta, HKU) GitHub: 711 stars (via Facebook Research)
Architecture:
- Self-distillation approach on PTv3 backbone
- Encoder-only architecture for 3D point cloud understanding
- Multi-dataset pre-training across indoor and outdoor scenarios
- SOTA on ScanNet, S3DIS semantic segmentation
License: CC-BY-NC 4.0 (restricted by NC datasets like HM3D, ArkitScenes)
Airside Relevance: HIGH. Provides pre-trained PTv3 weights that transfer across domains. The self-supervised approach means no labels needed for pre-training. However, NC license restricts commercial deployment.
8.2 Concerto (NeurIPS 2025)
Paper: "Joint 2D-3D Self-Supervised Learning Emerges Spatial Representations" Authors: Zhang et al. (Pointcept team) GitHub: 519 stars
Architecture:
- Intra-modal self-distillation on 3D point clouds (refines internal spatial representations)
- Cross-modal joint embedding prediction (aligns point features with image patch features using camera parameters)
- Simulates human multisensory synergy for spatial cognition
- Includes CLIP translator for open-world 3D perception
- Three sizes: Small (39M), Base (108M), Large (208M)
Key Results:
- Outperforms standalone SOTA 2D SSL by 14.2% in linear probing for 3D perception
- Outperforms standalone SOTA 3D SSL by 4.8% in linear probing
- 80.7% mIoU on ScanNet with full fine-tuning (new SOTA)
- Variant for video-lifted point cloud spatial understanding
Airside Relevance: VERY HIGH. The joint 2D-3D learning is ideal for the reference airside AV stack which has both cameras and LiDAR. The CLIP translator enables open-world airside perception without airside-specific labels. The 39M parameter small model is potentially Orin-viable.
8.3 Utonia (March 2026)
Paper: "Toward One Encoder for All Point Clouds" Authors: Pointcept team GitHub: Available via Pointcept
Architecture:
- Single self-supervised encoder across heterogeneous domains: remote sensing, outdoor LiDAR, indoor RGB-D, CAD models, RGB-lifted point clouds
- Three designs: Causal Modality Blinding, Perceptual Granularity Rescale, RoPE for Cross-Domain Spatial Encoding
- Unified representation space across fundamentally different sensing geometries
- Emergent cross-domain behaviors from joint training
Key Capabilities:
- Benefits robotic manipulation when used as VLA features
- Improves VLM spatial reasoning when integrated
- Step toward true foundation model for sparse 3D data
Airside Relevance: HIGH. The universal encoder concept means pre-training on all available 3D data (driving, indoor, CAD models of aircraft/GSE) and deploying on airside LiDAR. The cross-domain capability directly addresses the domain gap problem.
8.4 NOMAE (CVPR 2025)
Paper: "Multi-Scale Neighborhood Occupancy Masked Autoencoder for Self-Supervised Learning in LiDAR" Authors: Abdelsamad et al.
Architecture:
- Masked occupancy reconstruction only in neighborhood of non-masked voxels (prevents occupancy information leakage)
- Hierarchical mask generation captures objects at multiple scales
- Separate decoders for each feature scale (multi-scale representation)
- Token upsampling module fuses multi-scale representations
Key Results:
- First SSL method to outperform strong supervised learning models on some benchmarks
- New SOTA across nuScenes and Waymo for SSL-based 3D perception
- Superior performance on both semantic segmentation and 3D object detection
Airside Relevance: VERY HIGH. The multi-scale architecture handles the extreme scale variation on airside (2m baggage carts to 73m A380). The occupancy-based pretext task directly teaches the model about 3D structure. New CVPR 2025 SOTA.
8.5 PSA-SSL (CVPR 2025)
Paper: "Pose and Size-aware Self-Supervised Learning on LiDAR Point Clouds" Authors: Nisar et al.
Architecture:
- Self-supervised bounding box regression as pretext task (first to use this)
- LiDAR beam pattern augmentation for sensor-agnostic features
- Complements contrastive learning with pose/size awareness
- 33% reduced pre-training time vs. comparable methods
Key Results:
- Matches SOTA SSL methods using up to 10x fewer labels on Waymo, nuScenes, SemanticKITTI
- Superior 3D object detection performance
- Sensor-agnostic features (important for cross-sensor transfer)
Airside Relevance: VERY HIGH. The sensor-agnostic feature learning means pre-trained models can transfer between different LiDAR configurations (e.g., from 64-beam Waymo LiDAR to 4-8 RoboSense RSHELIOS/RSBP on reference airside vehicles). The 10x label reduction is the best in the field.
8.6 TREND (NeurIPS 2025)
Paper: "Unsupervised 3D Representation Learning via Temporal Forecasting for LiDAR Perception" Authors: Chen et al. GitHub: Available
Architecture:
- First temporal forecasting method for unsupervised 3D pre-training
- Recurrent Embedding scheme generates 3D embeddings across time
- Temporal LiDAR Neural Field represents 3D scenes
- Differentiable rendering for loss computation
- Exploits object motion and semantics naturally present in temporal sequences
Key Results:
| Dataset | Task | Improvement | vs. Previous SOTA |
|---|---|---|---|
| ONCE | Detection | +1.77% mAP | 400% more improvement |
| nuScenes | Detection | +2.11% mAP | 400% more improvement |
| SemanticKITTI | Segmentation | Consistent gains | - |
Airside Relevance: HIGH. Temporal pre-training exploits the sequential nature of LiDAR data collection. Just driving the reference airside vehicle around the airside captures temporal LiDAR sequences that can be used for unsupervised pre-training with TREND. No labels needed, and the temporal context captures object dynamics (moving GSE, taxiing aircraft).
9. Comparative Summary Table
9.1 Pre-training Methods
| Method | Venue | Year | Type | Backbone | Data Savings | GitHub Stars | Orin-Ready | Airside Fit |
|---|---|---|---|---|---|---|---|---|
| Point-BERT | CVPR | 2022 | Masked modeling | Transformer | Few-shot | 677 | No | Low |
| Point-MAE | ECCV | 2022 | Masked reconstruction | Transformer | Few-shot | 622 | No | Low |
| PointGPT | NeurIPS | 2023 | Autoregressive | Transformer | Few-shot | 245 | No | Low |
| GD-MAE | CVPR | 2023 | MAE + generative decoder | Voxel CNN | 80% | 124 | Yes | HIGH |
| AD-PT | NeurIPS | 2023 | Semi-supervised | Multi-backbone | Significant | - | Yes | HIGH |
| Occupancy-MAE | IEEE TIV | 2023 | Occupancy prediction | Voxel CNN | 50% | 280 | Yes | HIGH |
| BEV-MAE | AAAI | 2024 | BEV occupancy | Voxel CNN | ~60% | - | Yes | HIGH |
| ALSO | CVPR | 2023 | Surface reconstruction | Sparse CNN | 30-40% | 180 | Yes | HIGH |
| GPC | ICLR | 2024 | Colorization | Multi-backbone | 80-95% | - | Yes | VERY HIGH |
| UniPAD | CVPR | 2024 | Volume rendering | Multi-modal | 50-60% | 204 | Partial | HIGH |
| ScaLR | CVPR | 2024 | Image-to-LiDAR distill | WaffleIron | 40-50% | 62 | Partial | HIGH |
| Sonata | CVPR | 2025 | Self-distillation | PTv3 | Large | 711 | No (future) | HIGH |
| Concerto | NeurIPS | 2025 | 2D-3D joint SSL | PTv3 | Large | 519 | No (future) | VERY HIGH |
| NOMAE | CVPR | 2025 | Multi-scale occ MAE | Voxel | Beats supervised | - | Yes | VERY HIGH |
| PSA-SSL | CVPR | 2025 | Pose/size-aware SSL | Multi-backbone | 90% | - | Yes | VERY HIGH |
| TREND | NeurIPS | 2025 | Temporal forecasting | Multi-backbone | Significant | - | Yes | HIGH |
| Utonia | arXiv | 2026 | Universal encoder | PTv3 | Large | - | No (future) | HIGH |
9.2 Backbone Models
| Model | Venue | Year | Latency (A100) | TensorRT | Orin Latency (est.) | GitHub Stars | Airside Fit |
|---|---|---|---|---|---|---|---|
| PointPillars | CVPR | 2019 | <5ms | Yes (NVIDIA) | 6.84ms | (in OpenPCDet) | PRODUCTION |
| CenterPoint | CVPR | 2021 | ~15ms | Yes (NVIDIA) | ~45ms | (in OpenPCDet) | PRODUCTION |
| FlatFormer | CVPR | 2023 | ~25ms | Likely | ~75ms | 141 | HIGH |
| DSVT | CVPR | 2023 | 37ms (TRT) | Yes | ~110ms | 451 | MARGINAL |
| SphereFormer | CVPR | 2023 | ~50ms | Unknown | >150ms | 364 | FUTURE |
| LargeKernel3D | CVPR | 2023 | ~30ms | Possible | ~90ms | 215 | MODERATE |
| LiDARFormer | 2023 | 2023 | ~40ms | Unknown | ~120ms | - | FUTURE |
| PTv3 | CVPR | 2024 | Varies | Partial | >100ms | 2,900 | FUTURE (Thor) |
10. Recommendations for LiDAR-Primary Airside AV
10.1 Immediate Actions (1-3 months)
Deploy GD-MAE or Occupancy-MAE pre-training on current PointPillars/CenterPoint backbone:
- No architecture change needed -- pre-training initializes the same VoxelBackBone8x
- Pre-train on unlabeled Waymo + nuScenes data
- Expected improvement: +2-5% mAP at zero additional labeling cost
- Same TensorRT deployment path, same latency on Orin
Collect unlabeled airside LiDAR data:
- Drive reference airside vehicles around the airside, recording LiDAR sweeps
- No annotation needed -- raw sweeps are used for self-supervised pre-training
- Target: 10,000+ sweeps across multiple airports, times of day, weather conditions
Evaluate GPC if camera-LiDAR pairs are available:
- If the reference airside AV stack records synchronized camera and LiDAR, GPC's colorization pre-training could yield 80-95% label savings
- Outperforms full-dataset training from scratch with only 20% of labels
10.2 Medium-term (3-12 months)
Implement ScaLR distillation pipeline:
- Distill DINOv2 features from cameras into LiDAR backbone
- Produces strong frozen LiDAR features that transfer to airside
- Requires camera-LiDAR calibration (already available in reference airside AV stack)
Add PointLoRA for parameter-efficient airside fine-tuning:
- Freeze pre-trained backbone, add LoRA adapters (rank 16-32)
- Fine-tune on labeled airside data (target: 1,000 annotated frames)
- Maintain separate LoRA adapters per airport if needed
Evaluate FlatFormer as CenterPoint backbone replacement:
- 1.4x faster than CenterPoint with comparable/better accuracy
- First transformer backbone viable for Orin deployment
- Drop-in replacement for sparse convolution backbone
Explore OpenScene distillation for open-vocabulary LiDAR:
- Use camera data during training to distill CLIP features into LiDAR backbone
- Deploy LiDAR-only at inference for open-vocabulary 3D understanding
- Enables text-queried detection of novel airside objects
10.3 Long-term (12-24 months, Thor timeline)
Adopt PTv3 + Sonata/Concerto pre-training:
- When Thor hardware is available (~1,000 TOPS), PTv3 becomes viable for real-time
- Sonata provides self-supervised pre-trained weights
- Concerto adds 2D-3D joint learning with CLIP translator for open-world perception
Integrate LiDAR world model (Copilot4D/LiDARCrafter):
- Use for simulation: generate synthetic airside LiDAR scenarios
- Use for prediction: forecast future LiDAR observations for planning
- Address the zero-public-airside-dataset problem through generation
Build airside LiDAR foundation model:
- Pre-train Utonia-style universal encoder on all available 3D data + airside data
- Create the first airside-specific LiDAR benchmark
- Target: single model that works across all airport types and conditions
10.4 POC Priority Order
| Priority | POC | Cost | Impact | Difficulty |
|---|---|---|---|---|
| 1 | GD-MAE pre-training for PointPillars | $500 (compute) | +2-5% mAP, zero labels | Low |
| 2 | Collect unlabeled airside LiDAR data | $0 (existing vehicles) | Enables all downstream | Operational |
| 3 | GPC colorization pre-training | $1,000 (compute) | 80%+ label savings | Medium |
| 4 | FlatFormer backbone evaluation | $500 (compute) | 1.4x faster, same accuracy | Medium |
| 5 | PointLoRA fine-tuning on airside | $500 (compute) | Domain adaptation | Medium |
| 6 | ScaLR DINOv2-to-LiDAR distillation | $2,000 (compute) | Strong frozen features | Medium-High |
| 7 | OpenScene distillation for open-vocab | $2,000 (compute) | Open-world 3D | High |
| 8 | LiDARCrafter synthetic data | $3,000 (compute) | Synthetic airside data | High |
11. References
Core Pre-training Papers
| Paper | Venue | Year | Link |
|---|---|---|---|
| Point-BERT | CVPR | 2022 | arXiv, GitHub |
| Point-MAE | ECCV | 2022 | GitHub |
| PointGPT | NeurIPS | 2023 | arXiv, GitHub |
| GD-MAE | CVPR | 2023 | Paper, GitHub |
| AD-PT | NeurIPS | 2023 | arXiv, NeurIPS |
| Occupancy-MAE | IEEE TIV | 2023 | arXiv, GitHub |
| BEV-MAE | AAAI | 2024 | arXiv, GitHub |
| ALSO | CVPR | 2023 | arXiv, GitHub |
| GPC | ICLR | 2024 | arXiv, GitHub |
| UniPAD | CVPR | 2024 | arXiv, GitHub |
| ScaLR | CVPR | 2024 | arXiv, GitHub |
| PSA-SSL | CVPR | 2025 | arXiv |
| NOMAE | CVPR | 2025 | arXiv |
| TREND | NeurIPS | 2025 | arXiv, GitHub |
Foundation Models & Backbones
| Paper | Venue | Year | Link |
|---|---|---|---|
| PTv3 | CVPR (Oral) | 2024 | arXiv, GitHub |
| Sonata | CVPR (Highlight) | 2025 | Paper, GitHub |
| Concerto | NeurIPS | 2025 | arXiv, GitHub |
| Utonia | arXiv | 2026 | arXiv, GitHub |
| Pointcept | Framework | 2022-2026 | GitHub (2,900 stars) |
| DSVT | CVPR | 2023 | arXiv, GitHub |
| FlatFormer | CVPR | 2023 | arXiv, GitHub |
| SphereFormer | CVPR | 2023 | arXiv, GitHub |
| LargeKernel3D | CVPR | 2023 | arXiv, GitHub |
Multi-modal & Language-3D
| Paper | Venue | Year | Link |
|---|---|---|---|
| ULIP | CVPR | 2023 | arXiv, GitHub |
| ULIP-2 | CVPR | 2024 | arXiv |
| PointCLIP | CVPR | 2022 | arXiv |
| PointCLIP V2 | ICCV | 2023 | arXiv |
| OpenScene | CVPR | 2023 | arXiv, GitHub |
| LiDAR-LLM | AAAI | 2025 | arXiv |
LiDAR World Models & Generation
| Paper | Venue | Year | Link |
|---|---|---|---|
| Copilot4D | ICLR | 2024 | arXiv, Website |
| LiDARCrafter | AAAI (Oral) | 2026 | GitHub |
| LidarDM | ICRA | 2025 | arXiv, GitHub |
| Cosmos-LidarGen | NVIDIA | 2025 | GitHub |
| DIO | CVPR | 2025 | Paper |
Deployment & Adaptation
| Paper | Venue | Year | Link |
|---|---|---|---|
| PointLoRA | CVPR | 2025 | Paper |
| DSVT-AI-TRT | Community | 2023 | GitHub |
| Senna | arXiv | 2024 | arXiv, GitHub |
Cross-References to Other Documents in This Repository
- PointPillars architecture details:
10-knowledge-base/geometry-3d/pointpillars.md - CenterPoint + OpenPCDet setup:
30-autonomy-stack/perception/overview/openpcdet-centerpoint.md - TensorRT deployment guide:
20-av-platform/compute/tensorrt-deployment-guide.md - NVIDIA Orin specs (275 TOPS):
20-av-platform/compute/nvidia-orin-technical.md - NVIDIA Thor specs (~1000 TOPS):
20-av-platform/compute/nvidia-drive-thor.md - Vision foundation models (SAM, DINOv2, CLIP):
30-autonomy-stack/perception/overview/vision-foundation-models.md - Open-vocabulary detection (Grounding DINO, YOLO-World):
30-autonomy-stack/perception/overview/open-vocab-detection.md - DINOv2 for driving (LoRA integration):
30-autonomy-stack/perception/overview/dinov2-foundation-models-driving.md - Occupancy world models:
30-autonomy-stack/world-models/occupancy-world-models.md - Sensor fusion architectures:
30-autonomy-stack/perception/overview/sensor-fusion-architectures.md - RoboSense LiDAR specs:
20-av-platform/sensors/robosense-lidar.md - Master synthesis:
90-synthesis/master/master-synthesis.md - POC proposals:
90-synthesis/poc-roadmaps/poc-proposals.md