OpenVox

What It Is

OpenVox is an instance-level open-vocabulary probabilistic voxel representation for robotics.
The paper title is "OpenVox: Real-time Instance-level Open-vocabulary Probabilistic Voxel Representation".
It was accepted to IROS 2025.
The method builds a live 3D map whose voxels carry instance and open-vocabulary semantic information.
It is a mapping and representation method, not just a frame-by-frame segmenter.
The authors emphasize real-time incremental operation.
The official project page and GitHub repository provide code and examples.

Use a front-end instance segmentation and understanding pipeline on RGB frames.
Attach open-vocabulary semantic descriptions to detected 2D instances through caption encoding.
Project 2D instance masks into a 3D voxel map using camera pose and depth geometry.
Represent each voxel probabilistically over instance membership rather than assigning a single hard label immediately.
Split incremental map fusion into instance association and live map evolution.
Solve association with maximum likelihood estimation.
Solve voxel map updating with maximum a posteriori estimation.

Inputs are RGB frames, depth or RGB-D geometry, and camera poses.
The released code path supports Replica and ScanNet-style datasets.
The front end consumes color images and produces 2D instance masks with semantic annotations.
The back end projects observations into a 3D voxel representation.
Outputs include an instance-level voxel map and open-vocabulary retrieval over mapped objects.
Visualization tools can color voxels by RGB, instance ID, or text-query similarity.

The front end is an Instance Segmentation and Understanding module.
The official README uses YOLO-World for open-vocabulary instance detection.
It also uses TAP and sentence-transformer embeddings for text and instance understanding components.
The back end maintains probabilistic instance voxels.
Instance association links newly observed 2D masks to existing map instances.
Live map evolution updates voxel probabilities as new frames arrive.
Evaluation covers zero-shot instance segmentation, semantic segmentation, and open-vocabulary retrieval.

OpenVox is primarily an online mapping framework built on pretrained components.
The project page reports evaluations across multiple datasets and real-world robotics experiments.
The GitHub README says validation has been completed on Replica and ScanNet.
Experiments compare against open-vocabulary mapping baselines such as ConceptGraphs and Open-Fusion-style systems.
The released repository includes environment setup, dataset preparation, main execution scripts, and visualization.
Real-time behavior depends on the chosen detector, captioning, embedding, voxel resolution, and GPU.

Maintains instance-level semantics instead of only point-wise CLIP features.
Probabilistic voxel updates help absorb sensor noise and segmentation noise over time.
Incremental mapping is a better fit for robots than offline scene reconstruction alone.
Open-vocabulary retrieval makes the map queryable by text after it is built.
The method has a concrete released codebase with dataset instructions.
It explicitly addresses stable online operation, which many 3D open-vocabulary methods leave to future work.

It depends on accurate camera pose, depth, and calibration.
2D segmentation or captioning errors can be fused into the map and persist.
Indoor RGB-D validation does not directly prove outdoor or long-range LiDAR performance.
Dynamic objects can corrupt instance voxels if association assumes persistence.
Text-query retrieval quality depends on caption encoding and language embedding choices.
Memory and update latency can grow with voxel resolution and scene scale.

OpenVox is relevant for semantic mapping of fixed stands, service corridors, baggage areas, and maintenance spaces.
Instance-level voxels could support queries such as "find cones near stand boundary" or "locate carts by gate".
The online probabilistic update model is useful for repeatedly observed static or semi-static apron equipment.
Direct ramp-vehicle runtime use is less certain because OpenVox is RGB-D and indoor-robotics oriented.
Airside adaptation would need LiDAR/camera fusion, moving-object handling, and large outdoor map scaling.
It is strongest as a semantic map enrichment layer, not as the primary emergency-stop obstacle detector.

Start with offline mapping runs before attempting onboard real-time use.
Use a SLAM or localization source with explicit pose covariance and reject frames with poor pose quality.
Keep map update logs so incorrect instance associations can be inspected and repaired.
Add dynamic-object filtering for vehicles, people, aircraft, and mobile ground support equipment.
Validate text retrieval with airport-specific synonyms and abbreviations.
Choose voxel resolution based on the smallest object that must be mapped, such as chocks or cones.