Open-Vocabulary Panoptic Occupancy

Last updated: 2026-05-09

Open-vocabulary panoptic occupancy combines three trends: dense 3D occupancy prediction, panoptic instance-aware scene representation, and language-aligned open-vocabulary semantics. PanoOcc, LangOcc, OpenOcc, and newer instance-centric occupancy benchmarks point toward AV perception systems that represent not only boxes and lanes, but the occupied 3D world with semantics, instances, and queryable language labels.

What It Is

Component	Role
Occupancy prediction	Predict whether each 3D voxel is occupied, free, or unknown, often with semantic labels.
Panoptic occupancy	Adds instance identity for "thing" objects while preserving "stuff" classes such as road and vegetation.
Open-vocabulary occupancy	Aligns voxel features with language so queries can include labels not fixed at training time.
PanoOcc	Camera-based 3D panoptic segmentation via a unified occupancy representation.
LangOcc	Self-supervised open-vocabulary occupancy estimation via volume rendering and vision-language alignment.
OpenOcc	PyTorch codebase supporting multiple 3D occupancy benchmarks and extensible occupancy training/evaluation.

The combined direction is attractive for autonomy because boxes are weak for irregular, partially visible, or unknown obstacles. Occupancy gives dense geometry; panoptic identity gives object coherence; language alignment gives a path to open-world labels.

Task Definition

Task	Input	Output	Metrics
Semantic occupancy	Multi-view images, LiDAR, or fused sensors	Voxel occupancy with semantic class	Occupied IoU, mIoU.
Panoptic occupancy	Multi-frame/multi-view images or LiDAR-camera features	Voxel semantics plus instance IDs	PQ, SQ, RQ, semantic mIoU.
Open-vocabulary occupancy	Images and language-aligned supervision/features	Voxel language features or query scores	Open-vocabulary IoU, query accuracy, text-class mIoU.
Dense occupancy benchmark implementation	Standard datasets such as nuScenes/Occ3D/OpenOccupancy	Train/eval pipeline outputs	Benchmark-specific IoU/mIoU and challenge metrics.

Open-vocabulary panoptic occupancy is still a research category rather than a single standard benchmark. The practical architecture usually needs a closed-set safety layer in parallel until open-vocabulary confidence and temporal consistency are proven.

Sensors And Labels

System / resource	Sensors	Labels / supervision
PanoOcc	Multi-view, multi-frame camera images	nuScenes 3D semantic/panoptic outputs and Occ3D-style dense occupancy extension.
LangOcc	Camera images for self-supervised training	Volume-rendered language-aligned voxel features; avoids requiring dense 3D labels for all semantics.
OpenOcc	Dataset-dependent; supports nuScenes LiDAR segmentation, SurroundOcc, OpenOccupancy, and 3D occupancy challenge formats	Sparse LiDAR supervision or dense occupancy annotations depending on benchmark.
Occ3D / OpenOccupancy family	Typically nuScenes/Waymo-derived sensor data	Dense voxel occupancy and semantic labels.
CarlaOcc / ADMesh direction	Synthetic CARLA data and curated 3D assets	High-resolution instance-level panoptic occupancy ground truth.

For airport autonomy, the missing ingredient is not only labels but geometry fidelity: aircraft wings, engine nacelles, ULD contours, tow bars, hoses, and FOD are precisely the shapes that box-centric labels simplify away.

Method Pattern

Encode multi-view images or fused sensor inputs into BEV/voxel features.
Lift 2D features into 3D using depth, attention, projection, or sparse voxel queries.
Predict occupancy at voxel resolution, often with semantic logits.
Add instance grouping or mask decoding for foreground objects to obtain panoptic occupancy.
For open vocabulary, align voxel features with image/text embeddings or language-supervised rendering losses.
Post-process into freespace, object instances, unknown regions, and planner-consumable occupancy.

PanoOcc's important contribution is unifying camera-based 3D segmentation and occupancy through voxel queries and coarse-to-fine spatiotemporal aggregation. LangOcc's contribution is reducing dependence on dense 3D labels by aligning a 3D occupancy field with language through self-supervised rendering. OpenOcc is useful because it makes occupancy experiments more reproducible across benchmark formats.

Metrics

Metric	Interpretation
Occupied IoU	Geometry quality for occupied vs empty space.
Semantic mIoU	Class quality over occupied voxels.
PQ / SQ / RQ	Panoptic quality, mask quality, and recognition quality for voxel instances.
Thing/stuff split	Separates movable foreground actors from background surfaces.
Open-vocabulary query IoU	Measures whether a text query localizes the right 3D region.
Free-space false negative rate	Safety metric for occupied space incorrectly predicted free.
Unknown/uncertain occupancy rate	Runtime assurance metric for areas where the model should abstain.

For safety validation, add a planner-facing metric: whether the final occupancy grid would block, slow, reroute, or allow the vehicle in the correct cases. Voxel mIoU alone does not prove safe behavior.

Failure Modes

Camera-only occupancy can hallucinate geometry in occluded regions or miss low-contrast objects.
Voxel resolution can erase thin or small hazards such as cables, straps, chocks, cones, and FOD.
Panoptic grouping can split one large object into multiple instances or merge adjacent objects in clutter.
Open-vocabulary labels can be semantically plausible but geometrically misplaced.
Language alignment can overfit to image texture and ignore 3D evidence.
Dense occupancy labels derived from existing datasets may inherit annotation gaps and class-taxonomy limits.
Synthetic occupancy benchmarks may have clean geometry that overstates real-world performance under calibration, motion blur, rolling shutter, rain, or LiDAR sparsity.

AV, Indoor, Outdoor, And Airside Relevance

Environment	Fit	Notes
Public-road AV	Strong research fit	Occupancy is already central to camera-only and fused driving perception.
Airport apron	High potential	Dense geometry helps with aircraft clearances and irregular GSE, but public airport labels are missing.
Indoor robots	Strong conceptually	Occupancy and open-vocabulary querying are useful, though benchmarks differ.
Outdoor industrial sites	Strong	Handles irregular obstacles and open-world equipment better than boxes alone.
Runtime planning	Strong if calibrated	Planner needs conservative occupied/unknown/free states more than class names.

For airside use, the best role is a conservative occupancy layer: aircraft envelopes, wings, engines, stands, cones, personnel, dollies, and unknown occupied voxels. Open-vocabulary labels should aid operator interpretation, not override geometry-based stopping rules.

Validation And Data-Engine Use

Validate geometry first: occupied/free/unknown errors matter more than language labels near safety envelopes.
Slice by voxel height and size; many airside hazards are low-profile.
Compare camera-only, LiDAR-only, radar-assisted, and fused occupancy where sensors are available.
Treat unknown voxels as operationally meaningful, not as ignored background.
Log text queries, prompt templates, and embedding versions for reproducible open-vocabulary evaluation.
Use language queries to mine rare objects from logs, then convert reviewed findings into closed safety labels where needed.
Add local acceptance scenes for aircraft pushback, ULD train crossing, belt loader under wing, cone line, chock left in path, hose/cable across stand, and FOD on wet pavement.

SLAM Methods

Methods

Open-Vocabulary Panoptic Occupancy ​

What It Is ​

Task Definition ​

Sensors And Labels ​

Method Pattern ​

Metrics ​

Failure Modes ​

AV, Indoor, Outdoor, And Airside Relevance ​

Validation And Data-Engine Use ​

Sources ​