GaussianOcc

What It Is

GaussianOcc is an ICCV 2025 method for fully self-supervised 3D occupancy estimation with Gaussian splatting.
It targets surround-view camera occupancy without requiring ground-truth 3D occupancy labels for training.
It also removes the need for ground-truth 6D ego pose during training by learning pose through a Gaussian Splatting for Projection stage.
The method is important because dense 3D occupancy labels are expensive, dataset-specific, and often unavailable in new domains such as airport airside operations.
It is closely related to 3D Gaussian Splatting for Driving, Self-Supervised Pretraining, and Occupancy World Models.

Train occupancy from 2D supervision signals and adjacent-frame consistency instead of dense 3D voxel labels.
Use Gaussian Splatting for Projection (GSP) to learn scale-aware 6D pose for adjacent-view projection.
Use Gaussian Splatting from Voxel space (GSV) to render signals from a 3D voxel representation faster than conventional volume rendering.
Lift image features into a 3D voxel space, then use Gaussian rendering to project predicted geometry back into 2D.
Supervise with 2D depth and semantic maps while reserving 3D occupancy labels mainly for validation.
The label-scarcity argument is central: manual voxel labels are costly in road scenes and almost nonexistent for airside object taxonomies.

Input at inference: surround camera images and calibrated camera intrinsics/extrinsics.
Training input: image sequences, spatial camera extrinsics, 2D semantic annotation or generated semantic maps, and depth-related supervision signals.
Training does not require ground-truth occupancy labels.
Training does not require ground-truth 6D ego pose in the method's fully self-supervised setting.
Output: semantic 3D occupancy prediction over a voxel volume.
Intermediate output: learned 6D pose estimates from Stage 1 and rendered depth or semantic views from the Gaussian renderer.
Validation input in the official setup can include Occ3D/nuscenes occupancy labels for evaluation only.

Stage 1 trains a scale-aware 6D pose network.
Stage 1 uses a U-Net-style network to predict Gaussian attributes in the 2D image grid for cross-view Gaussian splatting.
The GSP module supplies scale information for adjacent-view projection and avoids relying on ground-truth ego pose.
Stage 2 performs self-supervised 3D occupancy estimation.
The model lifts 2D features into 3D voxel space.
The GSV module renders from voxel space using Gaussian splatting rather than slower volume rendering.
Rendered depth and semantic outputs are compared with 2D signals to train the 3D occupancy representation.

The project evaluates on nuScenes and DDAD.
The official repository uses nuScenes V1.0, Occ3D labels for validation, generated ground-truth depth maps for validation, and generated 2D semantic labels.
The paper reports competitive self-supervised occupancy performance with lower computational cost.
Reported efficiency gains are 2.7 times faster training and 5 times faster rendering relative to volume-rendering-based alternatives in the authors' setting.
Metrics include IoU and mIoU for 3D occupancy plus depth-estimation comparisons in scale-aware settings.
Results should be separated from fully supervised occupancy methods because the supervision budget is different.
The codebase uses custom differentiable Gaussian rasterization submodules, so reproducing results requires matching CUDA, PyTorch, and data-preparation details.

Directly attacks the main blocker for domain transfer: lack of dense 3D occupancy labels.
Reduces dependence on expensive pose supervision during training.
Gaussian rendering is faster than NeRF-style volume rendering for the self-supervised projection loss.
The two-stage design separates pose-scale learning from occupancy learning, making failure analysis cleaner.
Suitable for bootstrapping occupancy in domains where only camera logs and 2D labels or foundation-model masks are initially available.
The repository includes code, dataset instructions, and pretrained checkpoint references.

Self-supervised geometry can converge to plausible but wrong 3D structure when 2D projections are ambiguous.
2D semantic maps from foundation models or generated labels can inject systematic class errors into 3D occupancy.
Learned pose or scale drift can contaminate Stage 2 occupancy training.
Camera-only supervision struggles with invisible backsides, occluded ground contact, black surfaces, glass, and specular aircraft fuselage.
Validation on road datasets does not prove reliability for airside long-tail objects or unusual camera viewpoints.
The output is still a voxel occupancy grid, so final resolution and range choices can erase small FOD or thin tow bars.

High research fit for airside because dense airside occupancy labels are unlikely to exist at useful scale.
Useful for bootstrapping camera-based occupancy from operational logs before a full annotation program exists.
Needs strong quality gates around generated 2D semantic labels for aircraft, GSE, personnel, cones, chocks, pavement markings, and FOD.
Should be trained with synchronized LiDAR where possible, even if LiDAR is used only for validation or auxiliary depth checks.
Not sufficient alone for safety-critical clearance near aircraft; use as a semantic occupancy layer fused with LiDAR/radar and maps.
Most useful early deliverable is an annotation-efficient occupancy prior for planners, simulation, and anomaly review.

Reproduce official nuScenes or DDAD runs before changing the domain.
Keep Stage 1 pose-scale diagnostics separate from Stage 2 occupancy metrics.
Store the source and version of every 2D semantic label generator; changes in masks will change 3D occupancy behavior.
Use held-out LiDAR and manual airside spot labels to audit self-supervised geometry, especially at object boundaries.
Add airside validation slices for reflective aircraft, wet pavement, floodlights, night operations, snow or de-icing residue, and apron clutter.
Avoid treating lack of 3D labels as lack of validation; self-supervised methods need stronger independent audits.
Compare against simpler label-efficient baselines such as RenderOcc and SelfOcc before adopting GaussianOcc as the default.