Skip to content

OpenVox

What It Is

  • OpenVox is an instance-level open-vocabulary probabilistic voxel representation for robotics.
  • The paper title is "OpenVox: Real-time Instance-level Open-vocabulary Probabilistic Voxel Representation".
  • It was accepted to IROS 2025.
  • The method builds a live 3D map whose voxels carry instance and open-vocabulary semantic information.
  • It is a mapping and representation method, not just a frame-by-frame segmenter.
  • The authors emphasize real-time incremental operation.
  • The official project page and GitHub repository provide code and examples.

Core Technical Idea

  • Use a front-end instance segmentation and understanding pipeline on RGB frames.
  • Attach open-vocabulary semantic descriptions to detected 2D instances through caption encoding.
  • Project 2D instance masks into a 3D voxel map using camera pose and depth geometry.
  • Represent each voxel probabilistically over instance membership rather than assigning a single hard label immediately.
  • Split incremental map fusion into instance association and live map evolution.
  • Solve association with maximum likelihood estimation.
  • Solve voxel map updating with maximum a posteriori estimation.

Inputs and Outputs

  • Inputs are RGB frames, depth or RGB-D geometry, and camera poses.
  • The released code path supports Replica and ScanNet-style datasets.
  • The front end consumes color images and produces 2D instance masks with semantic annotations.
  • The back end projects observations into a 3D voxel representation.
  • Outputs include an instance-level voxel map and open-vocabulary retrieval over mapped objects.
  • Visualization tools can color voxels by RGB, instance ID, or text-query similarity.

Architecture or Evaluation Protocol

  • The front end is an Instance Segmentation and Understanding module.
  • The official README uses YOLO-World for open-vocabulary instance detection.
  • It also uses TAP and sentence-transformer embeddings for text and instance understanding components.
  • The back end maintains probabilistic instance voxels.
  • Instance association links newly observed 2D masks to existing map instances.
  • Live map evolution updates voxel probabilities as new frames arrive.
  • Evaluation covers zero-shot instance segmentation, semantic segmentation, and open-vocabulary retrieval.

Training and Evaluation

  • OpenVox is primarily an online mapping framework built on pretrained components.
  • The project page reports evaluations across multiple datasets and real-world robotics experiments.
  • The GitHub README says validation has been completed on Replica and ScanNet.
  • Experiments compare against open-vocabulary mapping baselines such as ConceptGraphs and Open-Fusion-style systems.
  • The released repository includes environment setup, dataset preparation, main execution scripts, and visualization.
  • Real-time behavior depends on the chosen detector, captioning, embedding, voxel resolution, and GPU.

Strengths

  • Maintains instance-level semantics instead of only point-wise CLIP features.
  • Probabilistic voxel updates help absorb sensor noise and segmentation noise over time.
  • Incremental mapping is a better fit for robots than offline scene reconstruction alone.
  • Open-vocabulary retrieval makes the map queryable by text after it is built.
  • The method has a concrete released codebase with dataset instructions.
  • It explicitly addresses stable online operation, which many 3D open-vocabulary methods leave to future work.

Failure Modes

  • It depends on accurate camera pose, depth, and calibration.
  • 2D segmentation or captioning errors can be fused into the map and persist.
  • Indoor RGB-D validation does not directly prove outdoor or long-range LiDAR performance.
  • Dynamic objects can corrupt instance voxels if association assumes persistence.
  • Text-query retrieval quality depends on caption encoding and language embedding choices.
  • Memory and update latency can grow with voxel resolution and scene scale.

Airside AV Fit

  • OpenVox is relevant for semantic mapping of fixed stands, service corridors, baggage areas, and maintenance spaces.
  • Instance-level voxels could support queries such as "find cones near stand boundary" or "locate carts by gate".
  • The online probabilistic update model is useful for repeatedly observed static or semi-static apron equipment.
  • Direct ramp-vehicle runtime use is less certain because OpenVox is RGB-D and indoor-robotics oriented.
  • Airside adaptation would need LiDAR/camera fusion, moving-object handling, and large outdoor map scaling.
  • It is strongest as a semantic map enrichment layer, not as the primary emergency-stop obstacle detector.

Implementation Notes

  • Start with offline mapping runs before attempting onboard real-time use.
  • Use a SLAM or localization source with explicit pose covariance and reject frames with poor pose quality.
  • Keep map update logs so incorrect instance associations can be inspected and repaired.
  • Add dynamic-object filtering for vehicles, people, aircraft, and mobile ground support equipment.
  • Validate text retrieval with airport-specific synonyms and abbreviations.
  • Choose voxel resolution based on the smallest object that must be mapped, such as chocks or cones.

Sources

Public research notes collected from public sources.