Skip to content

FlashOcc

What It Is

  • FlashOcc is a fast and memory-efficient camera occupancy method based on a channel-to-height plugin.
  • It keeps features in BEV for most computation and only expands channels into height at the output stage.
  • The method is designed as a plug-and-play occupancy head for BEVDet-family frameworks.
  • It targets deployment constraints where dense 3D voxel networks are too slow or memory-heavy.
  • It is a semantic occupancy method, not a 3D detection method, though it can share BEVDet features.

Core Technical Idea

  • Avoid expensive dense 3D feature processing for occupancy.
  • Keep the main representation as a BEV feature map so efficient 2D convolutions do most of the work.
  • Predict output channels that encode height slices and semantic logits.
  • Apply a channel-to-height transform to reshape BEV logits into a 3D occupancy volume.
  • Use the BEVDet/BEVDepth/BEVStereo family as the image-to-BEV front end.
  • The core insight is that height can be represented in channels until the final logits are needed.
  • This trades some 3D interaction capacity for major runtime and memory savings.

Inputs and Outputs

  • Inputs: surround camera images, camera intrinsics/extrinsics, and BEVDet-style geometry metadata.
  • Optional inputs: temporal frames if using BEVDetOCC-4D-Stereo or related backbones.
  • Training inputs: Occ3D-nuScenes or compatible semantic occupancy labels.
  • Output: 3D semantic occupancy grid after channel-to-height reshaping.
  • Intermediate output: BEV feature map refined by 2D convolutions.
  • It does not inherently produce instance IDs, object tracks, or box detections unless paired with other heads.

Architecture

  • Front end: BEVDet-style image backbone, depth/view transform, and BEV pooling.
  • BEV encoder: 2D convolutional feature extraction in top-down space.
  • Occupancy head: predicts a channel tensor whose channels correspond to height and class combinations.
  • Channel-to-height plugin: reshapes or maps channel groups into vertical voxel logits.
  • Temporal/stereo variants can use BEVDetOCC-4D-Stereo or stronger image backbones.
  • Panoptic-FlashOcc adds panoptic occupancy through instance-center modeling, but base FlashOcc is semantic occupancy.
  • Official repository includes training code, TensorRT testing code, and later FlashOccV2/Panoptic-FlashOcc updates.

Training and Evaluation

  • Primary benchmark: Occ3D-nuScenes semantic occupancy.
  • Metrics include mIoU, FPS, FLOPs, and parameter count.
  • The arXiv paper reports better precision, runtime efficiency, and memory cost than previous occupancy baselines under its settings.
  • Official README reports FlashOCC R50 256x704 variants at about 31.95 to 32.08 mIoU with 152.7 to 197.6 FPS in TensorRT FP16 on RTX 3090.
  • The README reports FlashOCC-4D-Stereo variants improving BEVDetOCC-4D-Stereo, including a Swin-B setting at 43.52 mIoU.
  • The repo notes a 2024 technical report where FlashOcc can be inserted into BEVDet with about 1.1 ms consumption.
  • Evaluation must disclose backend because PyTorch FPS and TensorRT FP16 FPS are not comparable.

Strengths

  • Very deployment-oriented compared with dense 3D occupancy networks.
  • Uses mature BEVDet-family components and avoids many 3D custom kernels.
  • 2D convolutional BEV processing maps well to common accelerators.
  • Can improve occupancy without fully replacing an existing BEV detector stack.
  • Strong option when latency and memory are first-order constraints.
  • Simple conceptual interface: BEV features in, voxel logits out.

Failure Modes

  • Channel-to-height has limited explicit 3D neighborhood reasoning before output.
  • Vertical ambiguity can be compressed into channels and may be harder to correct than in true 3D convolutions.
  • Thin or overhanging objects may suffer if BEV features do not preserve enough height evidence.
  • Occupancy quality depends heavily on the front-end depth/view transform.
  • Reported FPS may rely on TensorRT FP16 and specific hardware, so direct deployment may be slower.
  • The method can overfit to Occ3D-nuScenes height/range conventions if not retuned.

Airside AV Fit

  • Strong practical fit for airside vehicles that need occupancy-like output on embedded compute.
  • BEV-first processing matches apron planning grids and low-speed vehicle control.
  • Needs careful validation for vertical hazards: wings, loader booms, stairs, jet bridges, and service equipment.
  • Channel-to-height may be sufficient for ground-level GSE but risky for aircraft overhang clearance without LiDAR validation.
  • Good candidate for a camera fallback occupancy stream when LiDAR is degraded.
  • Must include night, glare, wet pavement, and camera contamination tests before operational use.

Implementation Notes

  • Start from an existing BEVDetOCC or BEVDepth-compatible config to avoid geometry plumbing errors.
  • Treat the channel-to-height mapping as part of the model contract; changing height bins requires head changes.
  • Benchmark both PyTorch and TensorRT on target hardware.
  • Check memory bandwidth, not only FLOPs, because BEV reshaping and logits can dominate.
  • For airside, increase or retune vertical bins around aircraft and equipment height ranges.
  • Add visibility and uncertainty masks if the planner consumes the dense occupancy grid directly.
  • Keep base FlashOcc separate from Panoptic-FlashOcc in documentation and metrics.

Sources

Public research notes collected from public sources.