Skip to content

Open-Vocabulary Panoptic Occupancy

Last updated: 2026-05-09

Open-vocabulary panoptic occupancy combines three trends: dense 3D occupancy prediction, panoptic instance-aware scene representation, and language-aligned open-vocabulary semantics. PanoOcc, LangOcc, OpenOcc, and newer instance-centric occupancy benchmarks point toward AV perception systems that represent not only boxes and lanes, but the occupied 3D world with semantics, instances, and queryable language labels.

Related pages: occupancy flow and 4D occupancy benchmarks, vision foundation models, BEV encoding architectures, SparseOcc, Streaming Gaussian Occupancy, 4D radar-camera occupancy


What It Is

ComponentRole
Occupancy predictionPredict whether each 3D voxel is occupied, free, or unknown, often with semantic labels.
Panoptic occupancyAdds instance identity for "thing" objects while preserving "stuff" classes such as road and vegetation.
Open-vocabulary occupancyAligns voxel features with language so queries can include labels not fixed at training time.
PanoOccCamera-based 3D panoptic segmentation via a unified occupancy representation.
LangOccSelf-supervised open-vocabulary occupancy estimation via volume rendering and vision-language alignment.
OpenOccPyTorch codebase supporting multiple 3D occupancy benchmarks and extensible occupancy training/evaluation.

The combined direction is attractive for autonomy because boxes are weak for irregular, partially visible, or unknown obstacles. Occupancy gives dense geometry; panoptic identity gives object coherence; language alignment gives a path to open-world labels.


Task Definition

TaskInputOutputMetrics
Semantic occupancyMulti-view images, LiDAR, or fused sensorsVoxel occupancy with semantic classOccupied IoU, mIoU.
Panoptic occupancyMulti-frame/multi-view images or LiDAR-camera featuresVoxel semantics plus instance IDsPQ, SQ, RQ, semantic mIoU.
Open-vocabulary occupancyImages and language-aligned supervision/featuresVoxel language features or query scoresOpen-vocabulary IoU, query accuracy, text-class mIoU.
Dense occupancy benchmark implementationStandard datasets such as nuScenes/Occ3D/OpenOccupancyTrain/eval pipeline outputsBenchmark-specific IoU/mIoU and challenge metrics.

Open-vocabulary panoptic occupancy is still a research category rather than a single standard benchmark. The practical architecture usually needs a closed-set safety layer in parallel until open-vocabulary confidence and temporal consistency are proven.


Sensors And Labels

System / resourceSensorsLabels / supervision
PanoOccMulti-view, multi-frame camera imagesnuScenes 3D semantic/panoptic outputs and Occ3D-style dense occupancy extension.
LangOccCamera images for self-supervised trainingVolume-rendered language-aligned voxel features; avoids requiring dense 3D labels for all semantics.
OpenOccDataset-dependent; supports nuScenes LiDAR segmentation, SurroundOcc, OpenOccupancy, and 3D occupancy challenge formatsSparse LiDAR supervision or dense occupancy annotations depending on benchmark.
Occ3D / OpenOccupancy familyTypically nuScenes/Waymo-derived sensor dataDense voxel occupancy and semantic labels.
CarlaOcc / ADMesh directionSynthetic CARLA data and curated 3D assetsHigh-resolution instance-level panoptic occupancy ground truth.

For airport autonomy, the missing ingredient is not only labels but geometry fidelity: aircraft wings, engine nacelles, ULD contours, tow bars, hoses, and FOD are precisely the shapes that box-centric labels simplify away.


Method Pattern

  1. Encode multi-view images or fused sensor inputs into BEV/voxel features.
  2. Lift 2D features into 3D using depth, attention, projection, or sparse voxel queries.
  3. Predict occupancy at voxel resolution, often with semantic logits.
  4. Add instance grouping or mask decoding for foreground objects to obtain panoptic occupancy.
  5. For open vocabulary, align voxel features with image/text embeddings or language-supervised rendering losses.
  6. Post-process into freespace, object instances, unknown regions, and planner-consumable occupancy.

PanoOcc's important contribution is unifying camera-based 3D segmentation and occupancy through voxel queries and coarse-to-fine spatiotemporal aggregation. LangOcc's contribution is reducing dependence on dense 3D labels by aligning a 3D occupancy field with language through self-supervised rendering. OpenOcc is useful because it makes occupancy experiments more reproducible across benchmark formats.


Metrics

MetricInterpretation
Occupied IoUGeometry quality for occupied vs empty space.
Semantic mIoUClass quality over occupied voxels.
PQ / SQ / RQPanoptic quality, mask quality, and recognition quality for voxel instances.
Thing/stuff splitSeparates movable foreground actors from background surfaces.
Open-vocabulary query IoUMeasures whether a text query localizes the right 3D region.
Free-space false negative rateSafety metric for occupied space incorrectly predicted free.
Unknown/uncertain occupancy rateRuntime assurance metric for areas where the model should abstain.

For safety validation, add a planner-facing metric: whether the final occupancy grid would block, slow, reroute, or allow the vehicle in the correct cases. Voxel mIoU alone does not prove safe behavior.


Failure Modes

  • Camera-only occupancy can hallucinate geometry in occluded regions or miss low-contrast objects.
  • Voxel resolution can erase thin or small hazards such as cables, straps, chocks, cones, and FOD.
  • Panoptic grouping can split one large object into multiple instances or merge adjacent objects in clutter.
  • Open-vocabulary labels can be semantically plausible but geometrically misplaced.
  • Language alignment can overfit to image texture and ignore 3D evidence.
  • Dense occupancy labels derived from existing datasets may inherit annotation gaps and class-taxonomy limits.
  • Synthetic occupancy benchmarks may have clean geometry that overstates real-world performance under calibration, motion blur, rolling shutter, rain, or LiDAR sparsity.

AV, Indoor, Outdoor, And Airside Relevance

EnvironmentFitNotes
Public-road AVStrong research fitOccupancy is already central to camera-only and fused driving perception.
Airport apronHigh potentialDense geometry helps with aircraft clearances and irregular GSE, but public airport labels are missing.
Indoor robotsStrong conceptuallyOccupancy and open-vocabulary querying are useful, though benchmarks differ.
Outdoor industrial sitesStrongHandles irregular obstacles and open-world equipment better than boxes alone.
Runtime planningStrong if calibratedPlanner needs conservative occupied/unknown/free states more than class names.

For airside use, the best role is a conservative occupancy layer: aircraft envelopes, wings, engines, stands, cones, personnel, dollies, and unknown occupied voxels. Open-vocabulary labels should aid operator interpretation, not override geometry-based stopping rules.


Validation And Data-Engine Use

  1. Validate geometry first: occupied/free/unknown errors matter more than language labels near safety envelopes.
  2. Slice by voxel height and size; many airside hazards are low-profile.
  3. Compare camera-only, LiDAR-only, radar-assisted, and fused occupancy where sensors are available.
  4. Treat unknown voxels as operationally meaningful, not as ignored background.
  5. Log text queries, prompt templates, and embedding versions for reproducible open-vocabulary evaluation.
  6. Use language queries to mine rare objects from logs, then convert reviewed findings into closed safety labels where needed.
  7. Add local acceptance scenes for aircraft pushback, ULD train crossing, belt loader under wing, cone line, chock left in path, hose/cable across stand, and FOD on wet pavement.

Sources

Public research notes collected from public sources.