AutoOcc

What It Is

AutoOcc is an ICCV 2025 automatic open-ended semantic occupancy annotation pipeline.
The full title is "AutoOcc: Automatic Open-Ended Semantic Occupancy Annotation via Vision-Language Guided Gaussian Splatting."
It generates dense semantic occupancy labels from raw multi-view images, optionally with LiDAR constraints.
It is an annotation and data-engine method, not a runtime safety perception model by itself.
It supports open-ended semantic categories rather than a fixed closed-set taxonomy.
The central representation is Vision-Language Guided Gaussian Splatting (VL-GS).

Ask a vision-language model to discover or refine a dynamic semantic query list for the scene.
Extract semantic attention maps from VLMs and visual foundation models.
Represent semantic scene structure with semantic-aware scalable Gaussians instead of dense voxels or raw point clouds.
Use self-estimated 3D flow to handle dynamic objects and reduce temporal trailing artifacts.
Optionally use LiDAR geometry as an additional constraint.
Convert the Gaussian scene representation into semantic occupancy with cumulative Gaussian-to-voxel splatting.

Input: multi-view camera images.
Optional input: LiDAR point clouds for geometric constraints.
Input metadata: camera calibration, ego pose, timestamps, and scene sequence.
Foundation inputs: VLM semantic attention, SAM-style masks, UniDepth-style depth, and self-estimated flow.
Output: dense semantic occupancy annotations.
Output: open-ended category list discovered or refined by the VLM pipeline.
Output: VL-GS scene representation that can be splatted to different voxel grids.

Start with raw images and prompt a VLM to list scene objects or categories.
Build semantic attention maps and masks from VLM/VFM outputs.
Estimate depth and construct semantic-aware Gaussian primitives.
Adjust Gaussian scale based on semantic attention and object geometry so large background areas and small objects are represented efficiently.
Estimate 3D flow for dynamic objects and attach dynamic attributes to moving Gaussians.
Add LiDAR geometric constraints when available.
Forward-splat accumulated Gaussians into a semantic occupancy voxel grid.

AutoOcc is evaluated on Occ3D-nuScenes and SemanticKITTI.
On Occ3D-nuScenes, the paper reports AutoOcc-V with 83.01 IoU and 20.92 mIoU using cameras only.
AutoOcc-M, using camera plus LiDAR, reports 88.62 IoU and 25.84 mIoU.
In cross-dataset zero-shot evaluation on SemanticKITTI, AutoOcc-M reports 41.23 IoU and 12.76 mIoU.
The paper reports about 30 GPU hours and 5.0 GB memory for AutoOcc annotation, compared with heavier manual or semi-automatic pipelines.
Evaluation should include label correctness, open-ended category quality, dynamic-object consistency, and human QA acceptance rate.

Reduces manual dense occupancy annotation cost.
Supports category discovery instead of requiring a fixed class list before labeling.
Gaussian primitives are more efficient than dense voxel annotation intermediates.
Dynamic flow handling is important for moving vehicles and pedestrians.
Can run in camera-only mode or use LiDAR constraints when available.
Useful for bootstrapping occupancy datasets in domains with scarce 3D labels.

VLM category discovery can hallucinate, merge, or split classes inconsistently.
Open-ended labels need taxonomy normalization before training production models.
Gaussian reconstruction can be geometrically plausible but wrong in safety-critical clearance zones.
Foundation-model masks and depth estimates can fail under glare, night lighting, rain, spray, and reflective surfaces.
Dynamic-object flow errors can smear labels through time.
Auto-label quality must be audited; it should not be treated as human-reviewed ground truth.

Strong data-engine fit because airside semantic occupancy labels are expensive and public datasets are sparse.
Open-ended category discovery can reveal site-specific GSE, temporary maintenance equipment, cones, chocks, hoses, covers, FOD-like objects, and aircraft parts.
Camera-only annotation is attractive for historical video logs, while LiDAR-constrained mode should be preferred for clearance-sensitive labels.
Human review should focus on rare safety-critical classes and adverse conditions.
Use AutoOcc to create training labels, not as the onboard emergency-stop source.
Airside deployment needs a controlled taxonomy layer that maps VLM phrases into operational classes.

Store generated category names, prompts, model versions, and post-normalized taxonomy IDs.
Keep AutoOcc-V and AutoOcc-M labels separate because their geometry evidence differs.
Require review for labels near aircraft envelopes, stand boundaries, personnel, FOD, and planned vehicle paths.
Add negative examples for glare, wet pavement, jet bridges, aircraft reflections, de-icing mist, and night floodlights.
Measure downstream model performance against manually audited validation sets, not only auto-label agreement.
Version voxel size, Gaussian parameters, depth model, VLM, SAM model, and LiDAR constraints with every label release.