GaussianFormer

What It Is

GaussianFormer is an ECCV 2024 camera-based 3D semantic occupancy method that represents a driving scene as sparse semantic 3D Gaussians.
GaussianFormer-2 is the CVPR 2025 follow-up that improves the representation with probabilistic Gaussian superposition.
The family targets vision-centric 3D occupancy, not object-box detection or offline photorealistic reconstruction.
Its key move is to replace dense voxel features with object-centric Gaussian primitives, then splat them back to voxels only for occupancy output and benchmark evaluation.
It is a method-level complement to 3D Gaussian Splatting for Driving and Occupancy Network Architectures.

Core Technical Idea

Dense voxel grids allocate memory and computation to empty 3D space; road scenes are sparse and have object scales ranging from pedestrians to buses.
GaussianFormer predicts a set of sparse 3D semantic Gaussians, where each Gaussian carries position, scale, rotation, and semantic logits.
Image features are aggregated into Gaussian queries through cross-attention and refined over several blocks.
Sparse convolution over Gaussian means models local interactions without processing a full voxel tensor.
Gaussian-to-voxel splatting converts the Gaussian set into dense semantic occupancy predictions by aggregating nearby Gaussians for each queried voxel.
GaussianFormer-2 changes the aggregation semantics: each Gaussian is treated as a probability distribution over occupied space, and geometry is combined by probabilistic multiplication.
GaussianFormer-2 also uses an exact Gaussian mixture formulation for semantics and a distribution-based initialization module that places Gaussians near non-empty regions rather than merely estimating surface depth.

Inputs and Outputs

Input: surround-view camera images for nuScenes-style evaluation.
Input: camera intrinsics and extrinsics for cross-view feature sampling and 3D query projection.
Optional input variant: monocular camera input for KITTI-360 evaluation.
Training input: semantic occupancy labels for supervised occupancy learning.
Output: 3D semantic occupancy grid over the evaluation volume.
Intermediate output: sparse semantic Gaussians with explicit mean, scale, rotation, opacity or occupancy probability, and class distribution.
Non-output: no native LiDAR point cloud, no object tracks, and no persistent map unless wrapped in a temporal model such as GaussianWorld.

Architecture or Pipeline

Extract multi-scale image features with an image backbone.
Initialize learnable Gaussian queries and initial Gaussian properties.
Run self-encoding or sparse-convolution blocks on the Gaussian set.
Use image cross-attention to sample camera features for each Gaussian query.
Iteratively refine Gaussian position, covariance parameters, and semantics.
Decode final Gaussian attributes into a sparse Gaussian scene representation.
Convert Gaussians to voxel occupancy with local Gaussian-to-voxel splatting.
In GaussianFormer-2, add probabilistic geometry aggregation, Gaussian-mixture semantic aggregation, and distribution-based initialization before refinement.

Training and Evaluation

GaussianFormer is evaluated on nuScenes for surround-view 3D semantic occupancy and KITTI-360 for monocular 3D semantic occupancy.
The ECCV paper reports comparable performance to state-of-the-art methods while using only 17.8% to 24.8% of their memory consumption.
The official repository reports GaussianFormer checkpoints on the SurroundOcc-style nuScenes setup with 144,000 Gaussians and a 25,600-Gaussian non-empty variant.
GaussianFormer-2 is evaluated on nuScenes and KITTI-360 and is reported as state of the art with improved efficiency.
GaussianFormer-2 ablations show that 12,800 Gaussians outperform the 144,000-Gaussian GaussianFormer baseline in the reported setup.
Reported metrics include IoU and mIoU; comparisons must disclose dataset label source because SurroundOcc, Occ3D, and KITTI-360 settings are not interchangeable.
Inference includes custom Gaussian-to-voxel CUDA behavior, so throughput is not only a backbone property.

Strengths

Avoids the largest waste in dense 3D occupancy: processing empty voxels as if they were equally important.
Adapts spatial support through Gaussian scale and rotation instead of forcing a fixed grid everywhere.
Provides an explicit intermediate representation that can be inspected, visualized, and potentially carried through time.
GaussianFormer-2 improves small-object and occupied-region utilization by discouraging Gaussians from covering empty space.
The final voxel output remains compatible with standard occupancy metrics and planner-facing occupancy grids.
The official code and checkpoints make the family practical for reproduction and adaptation.

Failure Modes

Camera-only depth ambiguity can misplace Gaussians, especially under poor texture, glare, night lighting, or calibration drift.
Gaussian-to-voxel splatting can hide uncertainty because the final grid looks dense even when geometry came from sparse camera evidence.
Supervised training still depends on occupancy labels, which are scarce and expensive outside road datasets.
Sparse Gaussian allocation can miss rare small objects if initialization or attention does not place capacity near them.
nuScenes and KITTI-360 class sets do not cover aircraft, GSE, chocks, cones in airport configurations, or jet-bridge states.
Performance numbers from SurroundOcc labels should not be compared directly with Occ3D results without noting the label protocol.

Airside AV Fit

Good research fit for airside semantic occupancy once camera coverage is available.
The sparse representation is attractive for aprons because most of a large 3D volume is empty pavement or sky.
Needs new semantic taxonomy for aircraft, wing, engine, tug, belt loader, baggage cart, fuel truck, personnel, cone, chock, FOD, pavement, and terminal structure.
Camera-only occupancy should be fused with LiDAR or radar for clearance-critical operation near wings, engines, and aircraft stands.
GaussianFormer-2 is the more interesting starting point because it allocates Gaussians more explicitly to non-empty regions.
For planner use, export occupancy with confidence and unknown-space handling rather than only hard semantic labels.

Implementation Notes

Start from the official GaussianFormer repository rather than reimplementing the Gaussian-to-voxel kernels.
Reproduce the published nuScenes configuration before changing range, voxel size, class set, or Gaussian count.
Treat Gaussian count as a deployment knob: fewer Gaussians reduce cost but can remove small-object detail.
Calibrate camera extrinsics carefully; Gaussian means are explicit 3D quantities and will expose rig errors.
Keep label-protocol metadata with every experiment: SurroundOcc labels, Occ3D labels, KITTI-360, or custom airside labels.
Log Gaussian utilization metrics such as percent of centers in occupied space, distance to nearest occupied voxel, overlap, and per-class recall.
Consider temporal wrapping with Streaming Temporal Perception or GaussianWorld-style state if frame-to-frame consistency is required.

Sources

GaussianFormer paper: https://arxiv.org/abs/2405.17429
GaussianFormer ECCV 2024 PDF: https://www.ecva.net/papers/eccv_2024/papers_ECCV/papers/03958.pdf
GaussianFormer-2 CVPR 2025 paper page: https://openaccess.thecvf.com/content/CVPR2025/html/Huang_GaussianFormer-2_Probabilistic_Gaussian_Superposition_for_Efficient_3D_Occupancy_Prediction_CVPR_2025_paper.html
GaussianFormer-2 CVPR 2025 PDF: https://openaccess.thecvf.com/content/CVPR2025/papers/Huang_GaussianFormer-2_Probabilistic_Gaussian_Superposition_for_Efficient_3D_Occupancy_Prediction_CVPR_2025_paper.pdf
Official repository: https://github.com/huang-yh/GaussianFormer

SLAM Methods

Methods

GaussianFormer ​

What It Is ​

Core Technical Idea ​

Inputs and Outputs ​

Architecture or Pipeline ​

Training and Evaluation ​

Strengths ​

Failure Modes ​

Airside AV Fit ​

Implementation Notes ​

Sources ​

GaussianFormer

What It Is

Core Technical Idea

Inputs and Outputs

Architecture or Pipeline

Training and Evaluation

Strengths

Failure Modes

Airside AV Fit

Implementation Notes

Sources