SurroundOcc

What It Is

SurroundOcc is a multi-camera 3D semantic occupancy prediction method for autonomous driving.
It predicts dense voxel occupancy from surround-view images instead of only boxes or sparse points.
The method was published at ICCV 2023 and released with code and generated dense occupancy labels.
Its main contribution is both an architecture and a label-generation pipeline for dense occupancy supervision.
It is a camera occupancy method, not a detection-only BEV method.

Lift multi-scale image features directly into 3D volume features using spatial 2D-3D attention.
Preserve 3D voxel structure instead of collapsing everything to BEV.
Apply 3D convolutions to progressively upsample volume features.
Supervise multiple volume levels with decayed weighted losses.
Generate dense occupancy ground truth by separately fusing multi-frame LiDAR scans of dynamic objects and static scenes.
Use dense labels to avoid the sparse-output limitation of methods trained only on LiDAR point labels.
The method makes occupancy a direct volumetric prediction problem for surround cameras.

Inputs: synchronized multi-camera images plus camera intrinsics and extrinsics.
Training inputs: sparse LiDAR scans, semantic labels, poses, and generated dense occupancy labels.
Output: dense 3D voxel grid with empty/non-empty state and semantic class probabilities.
Typical nuScenes-derived grids cover a fixed ego-centric range with discrete height bins.
Intermediate output: multi-scale 3D volume features after 2D-3D attention.
It does not primarily output object instances, tracks, or 3D boxes.

2D backbone extracts multi-scale feature maps from every camera.
Spatial 2D-3D attention maps image features to voxel queries in the 3D volume.
Low-resolution volume features are progressively upsampled with 3D convolutions.
Skip or multi-scale fusion combines high-resolution and low-resolution volume representations.
Occupancy heads at multiple levels produce auxiliary predictions for deep supervision.
The label-generation pipeline accumulates static-scene LiDAR separately from dynamic-object LiDAR to reduce motion smearing.
Official implementation is an MMDetection3D-style codebase with custom occupancy data preparation.

Benchmarks: nuScenes-derived occupancy and SemanticKITTI semantic scene completion.
Metrics include scene-completion IoU for geometry and semantic-scene-completion mIoU for semantic occupancy.
The paper reports superior performance over prior vision-based methods on nuScenes and SemanticKITTI.
The CVF paper reports state-of-the-art results on its generated dense nuScenes occupancy labels and on SemanticKITTI.
The paper emphasizes that dense generated labels improve results over sparse LiDAR-point supervision.
Training cost is higher than BEV-only methods because dense 3D volume features and 3D convolutions are used.
Evaluation should state the occupancy label pipeline because different nuScenes occupancy datasets are not identical.

Produces dense volumetric output, including occluded space, instead of only visible depth.
More suitable than boxes for arbitrary shapes and long-tail objects.
Dense supervision gives a stronger training signal than sparse point labels.
Multi-scale volume decoding can recover finer structure than a single low-resolution voxel grid.
The generated-label pipeline is valuable for bootstrapping occupancy datasets.
Direct 3D output is easier to connect to collision checking than image-space depth.

Dense 3D volumes are memory- and compute-heavy.
Generated labels inherit LiDAR sparsity, pose error, semantic labeling mistakes, and dynamic-object fusion artifacts.
3D convolution can hallucinate plausible occupancy in occluded areas without calibrated uncertainty.
Camera-only input remains vulnerable to lighting, weather, glare, and camera obstruction.
Fixed voxel grids can miss very thin structures or smear vertical details.
Domain shift in the label-generation pipeline can be severe outside road-driving scenes.

Strong candidate for apron occupancy because dense shape matters more than bounding boxes near aircraft and equipment.
Useful for aircraft wings, loader booms, tow bars, cones, hoses, dollies, and partially occluded service vehicles.
Airside label generation can use repeated LiDAR passes around stands, but must handle parked aircraft that move between sessions.
Needs explicit validation for large overhangs, shiny aircraft surfaces, jet bridges, night floodlights, and wet pavement.
Camera-only occupancy should be fused with LiDAR/radar or conservative map priors before safety-critical use.
The generated labels could seed an airside occupancy dataset if QA includes manual checks around aircraft clearance zones.

Define the voxel grid around operational clearance requirements, not only nuScenes ranges.
Separate static apron structure from dynamic GSE and aircraft when fusing labels.
Track label provenance per voxel so safety evaluation can distinguish observed, fused, and completed occupancy.
Use class-balanced training for rare but safety-critical airside objects.
Profile 3D convolution memory before scaling range or height resolution.
Add uncertainty or visibility masks; dense occupancy without confidence is risky for planners.
Keep dataset naming explicit because "SurroundOcc-nuScenes", "Occ3D-nuScenes", and custom labels differ.