LiDAR-Camera Occupancy Fusion

Executive Summary

LiDAR-camera occupancy fusion predicts dense 3D semantic occupancy by combining LiDAR geometry with camera semantics.
The strongest recent pattern is not one-shot concatenation; SDG-OCC, MS-Occ, and MR-Occ all add explicit cross-modal alignment, visibility handling, or staged refinement.
LiDAR supplies metric structure, sparse depth, and free-space rays; cameras supply class evidence, texture, signs, lane context, and far-field semantics.
Occupancy is a better planning interface than only boxes when obstacles are irregular, partially visible, or not part of a fixed detection taxonomy.
The method family is relevant to AVs, mobile robots, warehouses, mines, ports, and airports, but only if the system preserves uncertainty, visibility state, and calibration health.
For airport autonomy, this is a strong base layer for aircraft stands and service roads, but it must be extended with airside classes and adverse-weather validation.

Problem Fit

Use this family when a system needs a voxel-level understanding of occupied, free, unknown, and semantic space.
It is especially useful when box detectors miss non-box-like hazards such as cones, chocks, hoses, tow bars, small carts, pallets, vegetation, construction barriers, or aircraft ground equipment.
It fits sensor suites that already carry both LiDAR and surround cameras and need a common representation for planning, validation, and map comparison.
It is less appropriate as a complete replacement for tracking, because current occupancy fusion normally predicts instantaneous occupancy rather than persistent identities.
It should not be treated as a pure segmentation task. The free/unknown/occluded distinction matters as much as semantic class.
For safety cases, the output should be consumed as a probabilistic scene representation with sensor provenance, not as a single hard voxel label map.

Method Mechanics

SDG-OCC uses semantic and depth-guided BEV transformation to reduce the ambiguity of projecting image features into 3D.
Its semantic/depth guidance makes camera lifting more geometry-aware than plain view transformation from image features alone.
MS-Occ introduces middle-stage and late-stage LiDAR-camera fusion so that geometry and semantics interact at multiple points in the pipeline.
In MS-Occ, Gaussian-Geo renders sparse LiDAR depth with Gaussian kernels to give image features dense geometric priors, while a semantic-aware module enriches LiDAR voxels through deformable cross-attention.
MS-Occ then uses adaptive voxel fusion and high-confidence voxel refinement to resolve semantic disagreements after separate modality processing.
MR-Occ uses hierarchical multi-resolution voxel refinement so compute is concentrated on more informative voxels rather than spent uniformly across empty space.
MR-Occ adds an occlusion-aware prediction state, separating visible occupied voxels from occluded occupied regions where direct sensor evidence is absent.
A production implementation should track which modality supported each occupied voxel, because LiDAR-only, camera-only, and fused evidence have different failure modes.

Inputs and Outputs

Input: LiDAR point clouds with per-point timestamp, intensity or reflectance, ring/channel when available, and ego-frame calibration.
Input: multi-view RGB images with intrinsics, extrinsics, exposure metadata, camera timestamps, and ego pose.
Optional input: temporal sweeps, radar tracks, map priors, wheel odometry, and weather or lens-soiling state.
Training input: semantic occupancy labels from datasets such as nuScenes-OpenOccupancy, nuScenes-Occupancy, Occ3D, or SemanticKITTI-style voxel labels.
Output: dense voxel grid with empty, occupied, semantic, and optionally occluded/unknown states.
Optional output: per-voxel confidence, modality attribution, visibility state, and planner-facing free-space masks.
Optional output: intermediate BEV or voxel features reused by detection, tracking, mapping, or flow heads.

Assumptions

Camera and LiDAR calibration are accurate enough that cross-modal features land in the same voxel neighborhood.
Timestamp alignment and ego-motion compensation are known; small errors can look like semantic uncertainty or thickened object boundaries.
LiDAR labels or voxel labels are trustworthy enough to supervise completion in occluded space, even though labels may themselves be generated by heuristics.
The class taxonomy covers the objects that matter for the target ODD.
Voxel resolution and height range are adequate for the clearance problem; road-dataset grids can be too coarse for small industrial hazards.
The model sees enough adverse lighting, rain, fog, spray, night, and sensor-drop examples to learn fusion behavior instead of overusing the most predictive clean-weather modality.

Strengths

LiDAR geometry reduces the depth hallucination and feature bleeding common in camera-only occupancy.
Camera semantics reduce LiDAR-only confusion between geometrically similar surfaces.
Dense occupancy supports irregular object extents that boxes represent poorly.
Explicit occlusion or unknown states make the output more honest for planning than forced semantic completion everywhere.
Multi-stage fusion can handle modality disagreements better than late concatenation.
Hierarchical or sparse refinement improves efficiency by avoiding uniform dense computation across mostly empty 3D space.
The representation can serve as a common interface for planning, map update, and offline validation.

Limitations and Failure Modes

Calibration drift produces systematic voxel shifts that can be mistaken for model error.
LiDAR sparsity at long range still leaves camera lifting responsible for distant semantics.
Camera glare, darkness, wet lenses, or soiling can inject confident but wrong semantics.
LiDAR rain, snow, fog, spray, and multipath can provide wrong geometric anchors.
Occluded-region labels are partly inferred, so high mIoU may hide poor epistemic uncertainty.
Dense voxel output can be expensive in memory and latency unless sparse or multi-resolution mechanisms are used.
Small hazards can be lost when voxel size, downsampling, or class taxonomy is tuned for road vehicles.

Evaluation Notes

Report IoU and mIoU, but also split by visible, occluded, dynamic, static, near, far, small object, and weather condition.
Compare camera-only, LiDAR-only, early fusion, middle fusion, late fusion, and multi-stage fusion variants.
Include calibration perturbation tests because cross-modal alignment is a core assumption.
Include sensor-degradation tests: dropped cameras, reduced LiDAR beams, rain/fog simulation, lens dirt, exposure failure, and delayed timestamps.
Separate empty/free-space precision from occupied-space recall; planners care about false free space differently from false occupied space.
For airside or industrial use, build a holdout set with target-domain objects, not only road classes.
Validate output stability frame to frame, because flickering occupancy can produce unstable planning even when per-frame IoU is acceptable.

AV and Indoor/Outdoor Relevance

On-road AVs: strong fit for planner-facing occupancy, especially where camera-only BEV has depth ambiguity and LiDAR-only segmentation lacks class detail.
Airport AVs: high fit around stands, service roads, baggage makeup areas, and pushback routes if aircraft parts, GSE, cones, chocks, and personnel are labeled.
Indoor robots: useful when RGB-D or compact LiDAR plus camera rigs need semantic occupancy for shelving, pallets, doors, glass, and humans.
Outdoor industrial robots: useful for mines, ports, depots, and yards where irregular objects and partial occlusions are common.
Low-speed robots can often tolerate denser occupancy grids than highway AVs, but they require finer near-field clearance.
Any deployment should define how unknown or occluded voxels affect speed, clearance, and fallback behavior.

Implementation/Validation Checklist

Define the voxel grid, range, z limits, class taxonomy, unknown/occluded policy, and planner contract before model selection.
Version camera-LiDAR calibration, synchronization, and ego-pose interpolation with every training run.
Preserve raw LiDAR, image frames, projected features, and fused voxel evidence for audit.
Train modality dropout and corrupted-input cases rather than hoping fusion generalizes.
Add per-voxel modality attribution or confidence so downstream modules can distinguish LiDAR-supported geometry from camera-completed space.
Benchmark against simple baselines: LiDAR voxelization, camera-only occupancy, and late-fusion occupancy.
Run calibration-shift and timestamp-shift sweeps before any safety argument.
For airport use, validate against aircraft fuselage reflections, jet bridges, reflective vests, cones, chocks, belt loaders, tow bars, hoses, and wet concrete.

Local Cross-Links

Camera-only occupancy baselines: SurroundOcc, SparseOcc, TPVFormer, Cam4DOcc.
Radar and all-weather occupancy: 4D Radar-Camera Occupancy, RadarPillars, K-Radar.
Temporal and flow extensions: StreamingFlow, TrackOcc, Spatiotemporal Memory Occupancy Flow.
Robustness and validation: MultiCorrupt, MSC-Bench, LiDAR Weather Artifact Removal.

Sources

SDG-OCC CVPR 2025 paper: https://openaccess.thecvf.com/content/CVPR2025/papers/Duan_SDGOCC_Semantic_and_Depth-Guided_Birds-Eye_View_Transformation_for_3D_Multimodal_CVPR_2025_paper.pdf
SDG-OCC arXiv record: https://arxiv.org/abs/2507.17083
MS-Occ arXiv paper: https://arxiv.org/abs/2504.15888
MR-Occ arXiv paper: https://arxiv.org/abs/2412.20480
nuScenes-OpenOccupancy paper: https://arxiv.org/abs/2303.03991
SemanticKITTI dataset: https://semantic-kitti.org/

SLAM Methods

Methods

LiDAR-Camera Occupancy Fusion ​

Executive Summary ​

Problem Fit ​

Method Mechanics ​

Inputs and Outputs ​

Assumptions ​

Strengths ​

Limitations and Failure Modes ​

Evaluation Notes ​

AV and Indoor/Outdoor Relevance ​

Implementation/Validation Checklist ​

Local Cross-Links ​

Sources ​

LiDAR-Camera Occupancy Fusion

Executive Summary

Problem Fit

Method Mechanics

Inputs and Outputs

Assumptions

Strengths

Limitations and Failure Modes

Evaluation Notes

AV and Indoor/Outdoor Relevance

Implementation/Validation Checklist

Local Cross-Links

Sources