OpenVox
What It Is
- OpenVox is an instance-level open-vocabulary probabilistic voxel representation for robotics.
- The paper title is "OpenVox: Real-time Instance-level Open-vocabulary Probabilistic Voxel Representation".
- It was accepted to IROS 2025.
- The method builds a live 3D map whose voxels carry instance and open-vocabulary semantic information.
- It is a mapping and representation method, not just a frame-by-frame segmenter.
- The authors emphasize real-time incremental operation.
- The official project page and GitHub repository provide code and examples.
Core Technical Idea
- Use a front-end instance segmentation and understanding pipeline on RGB frames.
- Attach open-vocabulary semantic descriptions to detected 2D instances through caption encoding.
- Project 2D instance masks into a 3D voxel map using camera pose and depth geometry.
- Represent each voxel probabilistically over instance membership rather than assigning a single hard label immediately.
- Split incremental map fusion into instance association and live map evolution.
- Solve association with maximum likelihood estimation.
- Solve voxel map updating with maximum a posteriori estimation.
Inputs and Outputs
- Inputs are RGB frames, depth or RGB-D geometry, and camera poses.
- The released code path supports Replica and ScanNet-style datasets.
- The front end consumes color images and produces 2D instance masks with semantic annotations.
- The back end projects observations into a 3D voxel representation.
- Outputs include an instance-level voxel map and open-vocabulary retrieval over mapped objects.
- Visualization tools can color voxels by RGB, instance ID, or text-query similarity.
Architecture or Evaluation Protocol
- The front end is an Instance Segmentation and Understanding module.
- The official README uses YOLO-World for open-vocabulary instance detection.
- It also uses TAP and sentence-transformer embeddings for text and instance understanding components.
- The back end maintains probabilistic instance voxels.
- Instance association links newly observed 2D masks to existing map instances.
- Live map evolution updates voxel probabilities as new frames arrive.
- Evaluation covers zero-shot instance segmentation, semantic segmentation, and open-vocabulary retrieval.
Training and Evaluation
- OpenVox is primarily an online mapping framework built on pretrained components.
- The project page reports evaluations across multiple datasets and real-world robotics experiments.
- The GitHub README says validation has been completed on Replica and ScanNet.
- Experiments compare against open-vocabulary mapping baselines such as ConceptGraphs and Open-Fusion-style systems.
- The released repository includes environment setup, dataset preparation, main execution scripts, and visualization.
- Real-time behavior depends on the chosen detector, captioning, embedding, voxel resolution, and GPU.
Strengths
- Maintains instance-level semantics instead of only point-wise CLIP features.
- Probabilistic voxel updates help absorb sensor noise and segmentation noise over time.
- Incremental mapping is a better fit for robots than offline scene reconstruction alone.
- Open-vocabulary retrieval makes the map queryable by text after it is built.
- The method has a concrete released codebase with dataset instructions.
- It explicitly addresses stable online operation, which many 3D open-vocabulary methods leave to future work.
Failure Modes
- It depends on accurate camera pose, depth, and calibration.
- 2D segmentation or captioning errors can be fused into the map and persist.
- Indoor RGB-D validation does not directly prove outdoor or long-range LiDAR performance.
- Dynamic objects can corrupt instance voxels if association assumes persistence.
- Text-query retrieval quality depends on caption encoding and language embedding choices.
- Memory and update latency can grow with voxel resolution and scene scale.
Airside AV Fit
- OpenVox is relevant for semantic mapping of fixed stands, service corridors, baggage areas, and maintenance spaces.
- Instance-level voxels could support queries such as "find cones near stand boundary" or "locate carts by gate".
- The online probabilistic update model is useful for repeatedly observed static or semi-static apron equipment.
- Direct ramp-vehicle runtime use is less certain because OpenVox is RGB-D and indoor-robotics oriented.
- Airside adaptation would need LiDAR/camera fusion, moving-object handling, and large outdoor map scaling.
- It is strongest as a semantic map enrichment layer, not as the primary emergency-stop obstacle detector.
Implementation Notes
- Start with offline mapping runs before attempting onboard real-time use.
- Use a SLAM or localization source with explicit pose covariance and reject frames with poor pose quality.
- Keep map update logs so incorrect instance associations can be inspected and repaired.
- Add dynamic-object filtering for vehicles, people, aircraft, and mobile ground support equipment.
- Validate text retrieval with airport-specific synonyms and abbreviations.
- Choose voxel resolution based on the smallest object that must be mapped, such as chocks or cones.
Sources
- Official project page: https://open-vox.github.io/
- arXiv paper: https://arxiv.org/abs/2502.16528
- Official GitHub repository: https://github.com/BIT-DYN/OpenVox
- ar5iv rendered paper: https://ar5iv.org/abs/2502.16528