Skip to content

SparseOcc

What It Is

  • SparseOcc is a CVPR 2024 vision-based semantic occupancy network built around sparse 3D latent features.
  • It challenges dense voxel, BEV, and TPV representations for occupancy prediction.
  • The method keeps the 3D latent representation sparse in COO-style coordinates and processes only active voxels.
  • It is inspired by sparse point-cloud processing but uses camera images as input.
  • It is an occupancy method, not the separate ECCV 2024 MCG-NJU project with the same SparseOcc name.

Core Technical Idea

  • Start from camera features lifted to 3D with a Lift-Splat-Shoot style view transform.
  • Convert the resulting mostly empty 3D tensor into a sparse representation by gathering non-empty voxels.
  • Perform latent completion with sparse 3D operations rather than dense 3D convolutions.
  • Use a sparse latent diffuser to propagate information from observed non-empty regions to nearby empty regions.
  • Build a sparse feature pyramid with interpolation across scales for larger receptive fields.
  • Redesign the transformer head as a sparse head that segments occupied voxels instead of every voxel.
  • Preserve 3D geometry better than BEV/TPV while avoiding dense cubic cost.

Inputs and Outputs

  • Inputs: monocular or surround camera images, calibration, and view-transform geometry.
  • Training inputs: semantic occupancy labels on nuScenes-Occupancy or SemanticKITTI, plus depth supervision for the LSS component.
  • Output: dense semantic occupancy after scattering sparse predictions back to the voxel grid.
  • Intermediate output: sparse tensor coordinates and features for active voxels.
  • Intermediate output: coarse binary non-empty predictions used to filter voxels for the sparse transformer head.
  • It does not output object boxes or instance tracks by default.

Architecture

  • 2D encoder: ResNet/FPN-style image feature extractor.
  • View transform: LSS lifts image features into 3D using predicted depth.
  • Sparse conversion: dense lifted tensor is converted to sparse COO features.
  • Sparse latent diffuser: sparse completion block plus contextual aggregation block.
  • Kernel decomposition: 3D kernels are decomposed into orthogonal kernels to improve efficiency and shape modeling.
  • Sparse feature pyramid: downsampled sparse scales are fused through sparse interpolation.
  • Sparse transformer head: Mask2Former-style query head adapted to sparse occupied voxels and a learnable empty token.
  • Scatter step reconstructs dense masks from sparse predictions for loss and output.

Training and Evaluation

  • Benchmarks: nuScenes-Occupancy validation and SemanticKITTI semantic scene completion.
  • Metrics: occupied IoU for geometry and mIoU for semantic occupancy.
  • The paper reports 74.9% FLOP reduction over a dense baseline while improving mIoU from 12.8% to 14.1% on nuScenes-Occupancy.
  • The arXiv HTML table reports SparseOcc at 21.8 IoU and 14.1 mIoU on nuScenes-Occupancy validation with 455G FLOPs and 13G memory.
  • The same table reports 0.19 s 3D latency and 0.25 s overall latency for the listed setting.
  • SemanticKITTI results show competitive mIoU with much lower FLOPs than dense or TPV baselines.
  • Losses include mask/class losses with Hungarian assignment, depth loss, and coarse binary segmentation loss.

Strengths

  • Large efficiency gain over dense 3D occupancy models.
  • Keeps real 3D coordinates instead of compressing height into BEV.
  • Reduces hallucination on empty voxels by not spending model capacity everywhere.
  • Sparse feature pyramid gives completion capacity without fully densifying the scene.
  • Sparse transformer head targets the expensive part of semantic occupancy directly.
  • Good candidate for embedded occupancy when dense 3D volumes are too costly.

Failure Modes

  • Sparse representation depends on the initial lifted features; missed active regions may never be recovered.
  • Completion is local and can fail for large occluded objects or long-range invisible space.
  • Sparse CUDA and custom ops can complicate deployment on automotive accelerators.
  • The empty token and binary filtering can suppress rare small objects if thresholds are poorly tuned.
  • LSS depth errors still determine where camera evidence enters 3D.
  • Dense output after scatter may look complete while uncertainty remains poorly calibrated.

Airside AV Fit

  • Attractive for airside occupancy because open apron space is mostly empty, so sparse compute should help.
  • Better than pure BEV for vertical structures such as stairs, loader masts, wing edges, and jet bridge components.
  • Risky for small safety objects such as chocks, cones, hoses, and FOD if initial sparse activation misses them.
  • Needs airside-specific sparse label QA around low-height and thin objects.
  • Sparse processing could support larger apron ranges than dense 3D volumes under the same compute budget.
  • Should be paired with explicit uncertainty and conservative planning masks near aircraft and personnel.

Implementation Notes

  • Do not confuse this method with other projects named SparseOcc; cite VISION-SJTU/Tang et al. for this file.
  • Audit coordinate order, voxel size, and sparse tensor backend before comparing results.
  • Tune sparse activation thresholds for rare airside objects, not only aggregate mIoU.
  • Test custom sparse ops on the target accelerator early.
  • Monitor active voxel count by range and class; unexpected densification removes the efficiency benefit.
  • Keep the dense scatter path deterministic for reproducible evaluation.
  • For airside, add visible/occluded split metrics and thin-object recall.

Sources

Public research notes collected from public sources.