Skip to content

TPVFormer

What It Is

  • TPVFormer is a vision-based 3D semantic occupancy method built around a tri-perspective-view representation.
  • It was introduced in "Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction" at CVPR 2023.
  • The method addresses BEV's loss of vertical structure by adding two orthogonal planes to the BEV plane.
  • It predicts semantic occupancy from camera inputs while using much less memory than dense 3D voxel features.
  • It is an occupancy representation and encoder method, not a 3D box detector.

Core Technical Idea

  • Represent the 3D scene with three feature planes: top, front, and side.
  • Model a 3D point by projecting it onto all three planes and summing the corresponding features.
  • Use image cross-attention to lift multi-camera 2D image features into TPV grid queries.
  • Use cross-view hybrid attention so the three planes exchange information.
  • Decode point or voxel semantics from the summed TPV features.
  • The design seeks a middle ground between BEV efficiency and dense voxel expressiveness.
  • It makes vertical and side structure visible to the model without materializing a full dense 3D tensor.

Inputs and Outputs

  • Inputs: multi-camera RGB images, camera intrinsics, camera extrinsics, and ego-frame geometry metadata.
  • Training supervision in the original occupancy task: sparse semantic LiDAR labels rather than dense hand-labeled voxels.
  • Output: semantic occupancy labels for voxels or queried 3D points.
  • Output classes follow the task dataset, such as nuScenes lidar segmentation semantics or SemanticKITTI SSC classes.
  • Intermediate output: three orthogonal TPV feature planes.
  • It does not output 3D boxes, instance tracks, or explicit per-object shapes unless paired with another head.

Architecture

  • Image backbone extracts multi-scale features from each camera image.
  • TPV queries are initialized on three orthogonal planes.
  • Image cross-attention samples relevant 2D image features for each TPV query.
  • Cross-view hybrid attention exchanges context across top, side, and front planes.
  • A lightweight prediction head maps the sum of projected plane features to point or voxel labels.
  • For voxel features, TPV planes can be broadcast along their orthogonal axes and summed.
  • The official code is based on BEVFormer and Cylinder3D components.
  • The repo includes configurations for nuScenes LiDAR segmentation, 3D semantic occupancy, and SemanticKITTI semantic scene completion.

Training and Evaluation

  • Benchmarks include Panoptic nuScenes/LiDAR segmentation style evaluation and SemanticKITTI semantic scene completion.
  • The paper formulates vision-based 3D semantic occupancy with sparse LiDAR semantic labels during training.
  • The CVPR paper reports that camera-only TPVFormer can be comparable with LiDAR-based methods on the nuScenes LiDAR segmentation task.
  • The official README lists 6 camera images, 16 semantics, sparse LiDAR semantic labels, and about 290 ms inference on a single A100 for its Tesla Occupancy Network comparison.
  • The repo provides a lower-memory 3090 configuration for occupancy training and separate SemanticKITTI support.
  • Metrics depend on task: mIoU for LiDAR segmentation or semantic scene completion, and qualitative dense occupancy for the original sparse-supervision setup.
  • Results should be interpreted with the supervision type stated clearly: sparse labels are not the same as dense occupancy labels.

Strengths

  • Preserves more vertical structure than a pure BEV plane.
  • Much cheaper than dense 3D voxel attention or 3D convolution over the full volume.
  • Works naturally with multi-camera image features and calibrated projection.
  • Flexible enough to produce point features or dense voxel features.
  • Good conceptual bridge between BEV detection and full semantic occupancy.
  • Official implementation and project page make it easy to audit architecture choices.

Failure Modes

  • TPV is still a compressed representation; complex geometry can alias across planes.
  • Sparse LiDAR supervision can produce plausible but unverified dense predictions in unseen voxels.
  • Inference cost can still be high for embedded deployment.
  • No temporal context in the original comparison limits handling of occlusion and motion.
  • Calibration and camera coverage errors directly affect image cross-attention.
  • Thin structures, overhangs, and rare classes can be missed if sparse labels do not cover them.

Airside AV Fit

  • Strong fit for representing airside vertical structure better than flat BEV: aircraft tails, loader masts, jet bridges, and service equipment.
  • Useful for camera occupancy research where dense airside voxel labels are not yet available.
  • Sparse LiDAR supervision is attractive if LiDAR survey vehicles can collect training labels.
  • Needs validation on aircraft overhangs, tow bars, hoses, chocks, cones, personnel, and stand equipment.
  • The original non-temporal setup is insufficient for safety-critical occlusion handling around parked aircraft.
  • Best used as a representation baseline for airside occupancy, then extended with temporal context and uncertainty.

Implementation Notes

  • Keep top/front/side plane resolutions tied to the physical voxel grid so projected features align.
  • Validate coordinate conventions carefully; plane projection bugs can silently corrupt training.
  • If using dense labels from another pipeline, document that the supervision differs from the original TPVFormer setup.
  • Profile attention memory before choosing range and resolution for apron-scale scenes.
  • Add temporal fusion externally if operating around occluding aircraft or moving GSE.
  • Use class-balanced losses or sampling for rare airside objects such as cones, chocks, and pedestrians.
  • Compare against BEV-only and dense-voxel baselines to prove TPV's value in the target domain.

Sources

Public research notes collected from public sources.