OpenVox: Open-Vocabulary 3D Mapping
- OpenVox is a real-time, open-vocabulary 3D mapping framework that integrates instance segmentation and natural language captioning with probabilistic volumetric fusion.
- It employs a two-stage Bayesian update combining geometric and caption feature similarities to robustly associate instances and evolve semantic voxel maps.
- The framework achieves state-of-the-art performance on benchmarks like Replica and ScanNet and operates efficiently in real-world robotic deployments.
OpenVox is a real-time, incremental open-vocabulary probabilistic instance-level voxel mapping framework designed for 3D environmental reconstruction with advanced semantic understanding. It unifies vision-LLM-driven open-vocabulary segmentation with fully probabilistic volumetric fusion to yield robust, zero-shot instance- and semantic-level 3D scene maps. The architecture enables fine-grained and function-based semantic reasoning, efficient instance association, and resilience to segmentation and sensor noise across streaming observations (Deng et al., 23 Feb 2025).
1. System Architecture and Front-End Processing
OpenVox processes a live stream of RGB–D data (color image, depth, and camera pose). The front end focuses on instance-level semantic segmentation and natural language captioning through a multi-stage pipeline:
- Open-vocabulary detection: Employs Yolo-world (denoted as ) to extract 2D bounding boxes from RGB input.
- Mask and caption generation: TAP () then generates a per-instance mask and a concise textual caption.
- Caption encoding: SBERT () encodes each caption into a dense 384-dimensional feature vector .
This yields for each incoming frame. The design allows free-form, open-vocabulary descriptions, empowering the system to operate in a zero-shot capacity across previously unseen categories.
The output is delivered to the voxel-based fusion back end, with each encoded mask-caption pair supporting subsequent probabilistic data association and map evolution steps (Deng et al., 23 Feb 2025).
2. Probabilistic Voxel Representation
Each voxel in the scene maintains a categorical probability vector , where indexes the current set of mapped instances and . Associated with each instance , the codebook stores a fused 384-dimensional caption feature .
Key Bayesian Updates
OpenVox's fusion framework comprises two major probabilistic routines at each timestep:
- Instance association (MLE): Determines whether a new detection corresponds to an existing map instance using a mixture of geometric () and caption feature () similarities. The combined score guides association decisions.
- Live map evolution (MAP): Integrates new mask evidence across observed voxels using Dirichlet–Categorical conjugacy. Posterior voxel parameters evolve per
and expected probabilities update via
Key assumptions include independence of mask observations given the map, and Law of Large Numbers approximations for voxel-level evidence aggregation (Deng et al., 23 Feb 2025).
3. Cross-Frame Incremental Fusion and Noise Robustness
Incremental fusion operates per frame in two stages:
3.1 Instance Association
Pixels from each mask are projected to voxel indices . If these indices lack prior observations, a new instance is initialized in the map and codebook. Otherwise, association scoring between and each is determined via geometric and caption-based similarity, with assignment based on the maximum exceeding a threshold .
3.2 Map Evolution
For voxels , Dirichlet parameters are incremented according to new one-hot instance labels . The updated instance feature is recomputed through weighted averaging, where the credibility combines the association score with the visibility ratio (fraction of voxels observed for instance ).
3.3 Noise Handling
The Dirichlet model preserves uncertainty in , allowing gradual correction of spurious or noisy labels. Codebook fusion further mitigates erroneous updates by down-weighting partial, noisy, or brief observations—particularly effective against segmentation jitter and sensor artifacts (Deng et al., 23 Feb 2025).
4. Open-Vocabulary Reasoning and Evaluation
OpenVox enables open-vocabulary, fine-grained scene queries by leveraging caption embeddings aggregated per instance in the codebook. This supports both ontology-based lookup and compositional language queries (e.g., "red leather sofa").
Benchmarks
Empirical evaluations on Replica and ScanNet datasets report:
| Task | Metric | OpenVox | Best Baseline |
|---|---|---|---|
| 3D Zero-shot Instance Seg. | AP (Replica, 8 scenes) | 11.73 | 6.01 |
| AP50 | 27.29 | 12.37 | |
| AP25 | 38.46 | 24.71 | |
| AP (ScanNet, 6 scenes) | 3.70 | 3.27 | |
| AP50 | 11.90 | 9.77 | |
| AP25 | 31.20 | 27.89 | |
| 3D Zero-shot Semantic Seg. | mIoU / mAcc (Replica) | 27.30/43.42 | 22.53/41.71 |
| mIoU / mAcc (ScanNet) | 22.84/54.23 | 20.91/49.04 | |
| Open-vocabulary retrieval | Top-1 recall (Ontology) | 0.905 | 0.810 |
| Top-1 recall (Relevance) | 0.762 | 0.429 | |
| Top-1 recall (Functionality) | 0.714 | 0.476 |
These results establish state-of-the-art performance in zero-shot instance segmentation, semantic segmentation, and open-vocabulary retrieval (Deng et al., 23 Feb 2025).
5. Real-World Deployment and System Characteristics
OpenVox has been validated on physical mobile robotic platforms (Autolabor M1 with Azure Kinect RGB–D and Livox MID-360 LiDAR for pose estimation), operating fully online at a voxel resolution of 0.04 m. The instance segmentation and captioning front end achieves 10–15 Hz, with back-end fusion completing within milliseconds per frame. The architecture maintains stability in the presence of reflective surfaces, glass, partial occlusion, and segmentation jitter. The memory requirement for the sparse voxel map scales linearly with the physical scene volume.
Noted limitations include occasional transient mis-associations for briefly visible instances—in practice quickly corrected—and a reliance on predominantly static environments. Query latency and codebook memory can become significant in extremely large-scale deployments (Deng et al., 23 Feb 2025).
6. Extensions, Limitations, and Future Work
OpenVox’s main contributions are its unified, real-time open-vocabulary mapping pipeline and robust probabilistic voxel representation, supported by LLM-based caption features and two-stage Bayesian fusion. The system is currently limited by assumptions of static scenes, with open research directions encompassing:
- Dynamic object modeling for moving-object identification and online tracking.
- Lifelong map enhancement via interactive, language-based human feedback.
- Compression strategies for codebook scalability in outdoor and large-scale deployments.
A plausible implication is that integrating dynamic tracking and interactive feedback could further augment OpenVox’s generality and applicability to more complex, real-world environments (Deng et al., 23 Feb 2025).