Skip to content

AutoOcc

What It Is

  • AutoOcc is an ICCV 2025 automatic open-ended semantic occupancy annotation pipeline.
  • The full title is "AutoOcc: Automatic Open-Ended Semantic Occupancy Annotation via Vision-Language Guided Gaussian Splatting."
  • It generates dense semantic occupancy labels from raw multi-view images, optionally with LiDAR constraints.
  • It is an annotation and data-engine method, not a runtime safety perception model by itself.
  • It supports open-ended semantic categories rather than a fixed closed-set taxonomy.
  • The central representation is Vision-Language Guided Gaussian Splatting (VL-GS).

Core Technical Idea

  • Ask a vision-language model to discover or refine a dynamic semantic query list for the scene.
  • Extract semantic attention maps from VLMs and visual foundation models.
  • Represent semantic scene structure with semantic-aware scalable Gaussians instead of dense voxels or raw point clouds.
  • Use self-estimated 3D flow to handle dynamic objects and reduce temporal trailing artifacts.
  • Optionally use LiDAR geometry as an additional constraint.
  • Convert the Gaussian scene representation into semantic occupancy with cumulative Gaussian-to-voxel splatting.

Inputs and Outputs

  • Input: multi-view camera images.
  • Optional input: LiDAR point clouds for geometric constraints.
  • Input metadata: camera calibration, ego pose, timestamps, and scene sequence.
  • Foundation inputs: VLM semantic attention, SAM-style masks, UniDepth-style depth, and self-estimated flow.
  • Output: dense semantic occupancy annotations.
  • Output: open-ended category list discovered or refined by the VLM pipeline.
  • Output: VL-GS scene representation that can be splatted to different voxel grids.

Architecture or Pipeline

  • Start with raw images and prompt a VLM to list scene objects or categories.
  • Build semantic attention maps and masks from VLM/VFM outputs.
  • Estimate depth and construct semantic-aware Gaussian primitives.
  • Adjust Gaussian scale based on semantic attention and object geometry so large background areas and small objects are represented efficiently.
  • Estimate 3D flow for dynamic objects and attach dynamic attributes to moving Gaussians.
  • Add LiDAR geometric constraints when available.
  • Forward-splat accumulated Gaussians into a semantic occupancy voxel grid.

Training and Evaluation

  • AutoOcc is evaluated on Occ3D-nuScenes and SemanticKITTI.
  • On Occ3D-nuScenes, the paper reports AutoOcc-V with 83.01 IoU and 20.92 mIoU using cameras only.
  • AutoOcc-M, using camera plus LiDAR, reports 88.62 IoU and 25.84 mIoU.
  • In cross-dataset zero-shot evaluation on SemanticKITTI, AutoOcc-M reports 41.23 IoU and 12.76 mIoU.
  • The paper reports about 30 GPU hours and 5.0 GB memory for AutoOcc annotation, compared with heavier manual or semi-automatic pipelines.
  • Evaluation should include label correctness, open-ended category quality, dynamic-object consistency, and human QA acceptance rate.

Strengths

  • Reduces manual dense occupancy annotation cost.
  • Supports category discovery instead of requiring a fixed class list before labeling.
  • Gaussian primitives are more efficient than dense voxel annotation intermediates.
  • Dynamic flow handling is important for moving vehicles and pedestrians.
  • Can run in camera-only mode or use LiDAR constraints when available.
  • Useful for bootstrapping occupancy datasets in domains with scarce 3D labels.

Failure Modes

  • VLM category discovery can hallucinate, merge, or split classes inconsistently.
  • Open-ended labels need taxonomy normalization before training production models.
  • Gaussian reconstruction can be geometrically plausible but wrong in safety-critical clearance zones.
  • Foundation-model masks and depth estimates can fail under glare, night lighting, rain, spray, and reflective surfaces.
  • Dynamic-object flow errors can smear labels through time.
  • Auto-label quality must be audited; it should not be treated as human-reviewed ground truth.

Airside AV Fit

  • Strong data-engine fit because airside semantic occupancy labels are expensive and public datasets are sparse.
  • Open-ended category discovery can reveal site-specific GSE, temporary maintenance equipment, cones, chocks, hoses, covers, FOD-like objects, and aircraft parts.
  • Camera-only annotation is attractive for historical video logs, while LiDAR-constrained mode should be preferred for clearance-sensitive labels.
  • Human review should focus on rare safety-critical classes and adverse conditions.
  • Use AutoOcc to create training labels, not as the onboard emergency-stop source.
  • Airside deployment needs a controlled taxonomy layer that maps VLM phrases into operational classes.

Implementation Notes

  • Store generated category names, prompts, model versions, and post-normalized taxonomy IDs.
  • Keep AutoOcc-V and AutoOcc-M labels separate because their geometry evidence differs.
  • Require review for labels near aircraft envelopes, stand boundaries, personnel, FOD, and planned vehicle paths.
  • Add negative examples for glare, wet pavement, jet bridges, aircraft reflections, de-icing mist, and night floodlights.
  • Measure downstream model performance against manually audited validation sets, not only auto-label agreement.
  • Version voxel size, Gaussian parameters, depth model, VLM, SAM model, and LiDAR constraints with every label release.

Sources

Public research notes collected from public sources.