TPVFormer

What It Is

TPVFormer is a vision-based 3D semantic occupancy method built around a tri-perspective-view representation.
It was introduced in "Tri-Perspective View for Vision-Based 3D Semantic Occupancy Prediction" at CVPR 2023.
The method addresses BEV's loss of vertical structure by adding two orthogonal planes to the BEV plane.
It predicts semantic occupancy from camera inputs while using much less memory than dense 3D voxel features.
It is an occupancy representation and encoder method, not a 3D box detector.

Represent the 3D scene with three feature planes: top, front, and side.
Model a 3D point by projecting it onto all three planes and summing the corresponding features.
Use image cross-attention to lift multi-camera 2D image features into TPV grid queries.
Use cross-view hybrid attention so the three planes exchange information.
Decode point or voxel semantics from the summed TPV features.
The design seeks a middle ground between BEV efficiency and dense voxel expressiveness.
It makes vertical and side structure visible to the model without materializing a full dense 3D tensor.

Inputs: multi-camera RGB images, camera intrinsics, camera extrinsics, and ego-frame geometry metadata.
Training supervision in the original occupancy task: sparse semantic LiDAR labels rather than dense hand-labeled voxels.
Output: semantic occupancy labels for voxels or queried 3D points.
Output classes follow the task dataset, such as nuScenes lidar segmentation semantics or SemanticKITTI SSC classes.
Intermediate output: three orthogonal TPV feature planes.
It does not output 3D boxes, instance tracks, or explicit per-object shapes unless paired with another head.

Image backbone extracts multi-scale features from each camera image.
TPV queries are initialized on three orthogonal planes.
Image cross-attention samples relevant 2D image features for each TPV query.
Cross-view hybrid attention exchanges context across top, side, and front planes.
A lightweight prediction head maps the sum of projected plane features to point or voxel labels.
For voxel features, TPV planes can be broadcast along their orthogonal axes and summed.
The official code is based on BEVFormer and Cylinder3D components.
The repo includes configurations for nuScenes LiDAR segmentation, 3D semantic occupancy, and SemanticKITTI semantic scene completion.

Benchmarks include Panoptic nuScenes/LiDAR segmentation style evaluation and SemanticKITTI semantic scene completion.
The paper formulates vision-based 3D semantic occupancy with sparse LiDAR semantic labels during training.
The CVPR paper reports that camera-only TPVFormer can be comparable with LiDAR-based methods on the nuScenes LiDAR segmentation task.
The official README lists 6 camera images, 16 semantics, sparse LiDAR semantic labels, and about 290 ms inference on a single A100 for its Tesla Occupancy Network comparison.
The repo provides a lower-memory 3090 configuration for occupancy training and separate SemanticKITTI support.
Metrics depend on task: mIoU for LiDAR segmentation or semantic scene completion, and qualitative dense occupancy for the original sparse-supervision setup.
Results should be interpreted with the supervision type stated clearly: sparse labels are not the same as dense occupancy labels.

Preserves more vertical structure than a pure BEV plane.
Much cheaper than dense 3D voxel attention or 3D convolution over the full volume.
Works naturally with multi-camera image features and calibrated projection.
Flexible enough to produce point features or dense voxel features.
Good conceptual bridge between BEV detection and full semantic occupancy.
Official implementation and project page make it easy to audit architecture choices.

TPV is still a compressed representation; complex geometry can alias across planes.
Sparse LiDAR supervision can produce plausible but unverified dense predictions in unseen voxels.
Inference cost can still be high for embedded deployment.
No temporal context in the original comparison limits handling of occlusion and motion.
Calibration and camera coverage errors directly affect image cross-attention.
Thin structures, overhangs, and rare classes can be missed if sparse labels do not cover them.

Strong fit for representing airside vertical structure better than flat BEV: aircraft tails, loader masts, jet bridges, and service equipment.
Useful for camera occupancy research where dense airside voxel labels are not yet available.
Sparse LiDAR supervision is attractive if LiDAR survey vehicles can collect training labels.
Needs validation on aircraft overhangs, tow bars, hoses, chocks, cones, personnel, and stand equipment.
The original non-temporal setup is insufficient for safety-critical occlusion handling around parked aircraft.
Best used as a representation baseline for airside occupancy, then extended with temporal context and uncertainty.

Keep top/front/side plane resolutions tied to the physical voxel grid so projected features align.
Validate coordinate conventions carefully; plane projection bugs can silently corrupt training.
If using dense labels from another pipeline, document that the supervision differs from the original TPVFormer setup.
Profile attention memory before choosing range and resolution for apron-scale scenes.
Add temporal fusion externally if operating around occluding aircraft or moving GSE.
Use class-balanced losses or sampling for rare airside objects such as cones, chocks, and pedestrians.
Compare against BEV-only and dense-voxel baselines to prove TPV's value in the target domain.