Papers
Topics
Authors
Recent
Search
2000 character limit reached

OpenVox: Open-Vocabulary 3D Mapping

Updated 16 February 2026
  • OpenVox is a real-time, open-vocabulary 3D mapping framework that integrates instance segmentation and natural language captioning with probabilistic volumetric fusion.
  • It employs a two-stage Bayesian update combining geometric and caption feature similarities to robustly associate instances and evolve semantic voxel maps.
  • The framework achieves state-of-the-art performance on benchmarks like Replica and ScanNet and operates efficiently in real-world robotic deployments.

OpenVox is a real-time, incremental open-vocabulary probabilistic instance-level voxel mapping framework designed for 3D environmental reconstruction with advanced semantic understanding. It unifies vision-LLM-driven open-vocabulary segmentation with fully probabilistic volumetric fusion to yield robust, zero-shot instance- and semantic-level 3D scene maps. The architecture enables fine-grained and function-based semantic reasoning, efficient instance association, and resilience to segmentation and sensor noise across streaming observations (Deng et al., 23 Feb 2025).

1. System Architecture and Front-End Processing

OpenVox processes a live stream of RGB–D data {Ct,Dt,Pt}\{C_t, D_t, P_t\} (color image, depth, and camera pose). The front end focuses on instance-level semantic segmentation and natural language captioning through a multi-stage pipeline:

  • Open-vocabulary detection: Employs Yolo-world (denoted as Det()\mathrm{Det}(\cdot)) to extract 2D bounding boxes from RGB input.
  • Mask and caption generation: TAP (SegCap()\mathrm{SegCap}(\cdot)) then generates a per-instance mask mtim^i_t and a concise textual caption.
  • Caption encoding: SBERT (Enc()\mathrm{Enc}(\cdot)) encodes each caption into a dense 384-dimensional feature vector ftif^i_t.

This yields {mti,fti}i=1N\{m^i_t, f^i_t\}_{i=1}^N for each incoming frame. The design allows free-form, open-vocabulary descriptions, empowering the system to operate in a zero-shot capacity across previously unseen categories.

The output is delivered to the voxel-based fusion back end, with each encoded mask-caption pair supporting subsequent probabilistic data association and map evolution steps (Deng et al., 23 Feb 2025).

2. Probabilistic Voxel Representation

Each voxel vjv^j in the scene maintains a categorical probability vector θj={θj,γ:γΓ}\theta^j = \{\theta^{j,\gamma} : \gamma \in \Gamma\}, where Γ\Gamma indexes the current set of mapped instances and γθj,γ=1\sum_\gamma \theta^{j,\gamma} = 1. Associated with each instance γ\gamma, the codebook B\mathcal{B} stores a fused 384-dimensional caption feature fγf^\gamma.

Key Bayesian Updates

OpenVox's fusion framework comprises two major probabilistic routines at each timestep:

  • Instance association (MLE): Determines whether a new detection corresponds to an existing map instance using a mixture of geometric (SγgeoS^\mathrm{geo}_\gamma) and caption feature (SγfeaS^\mathrm{fea}_\gamma) similarities. The combined score Aγ=λSγgeo+(1λ)SγfeaA_\gamma = \lambda S^\mathrm{geo}_\gamma + (1-\lambda)S^\mathrm{fea}_\gamma guides association decisions.
  • Live map evolution (MAP): Integrates new mask evidence across observed voxels using Dirichlet–Categorical conjugacy. Posterior voxel parameters evolve per

αtj,γ=αt1j,γ+ytj,γ\alpha^{j,\gamma}_t = \alpha^{j,\gamma}_{t-1} + y^{j,\gamma}_t

and expected probabilities update via

θtj,γ=αtj,γ/ταtj,τ.\theta^{j,\gamma}_t = \alpha^{j,\gamma}_t / \sum_\tau \alpha^{j,\tau}_t.

Key assumptions include independence of mask observations given the map, and Law of Large Numbers approximations for voxel-level evidence aggregation (Deng et al., 23 Feb 2025).

3. Cross-Frame Incremental Fusion and Noise Robustness

Incremental fusion operates per frame in two stages:

3.1 Instance Association

Pixels from each mask mtim^i_t are projected to voxel indices VmtiV_{m^i_t}. If these indices lack prior observations, a new instance is initialized in the map and codebook. Otherwise, association scoring between mtim^i_t and each γΓ\gamma \in \Gamma is determined via geometric and caption-based similarity, with assignment based on the maximum AγA_\gamma exceeding a threshold τa\tau_a.

3.2 Map Evolution

For voxels vjVmtiv^j \in V_{m^i_t}, Dirichlet parameters αtj,γ\alpha^{j,\gamma}_t are incremented according to new one-hot instance labels ytj,γy^{j,\gamma}_t. The updated instance feature fγf^\gamma is recomputed through weighted averaging, where the credibility wtiw^i_t combines the association score AγA_\gamma with the visibility ratio RtiR^i_t (fraction of voxels observed for instance γ\gamma).

3.3 Noise Handling

The Dirichlet model preserves uncertainty in θ\theta, allowing gradual correction of spurious or noisy labels. Codebook fusion further mitigates erroneous updates by down-weighting partial, noisy, or brief observations—particularly effective against segmentation jitter and sensor artifacts (Deng et al., 23 Feb 2025).

4. Open-Vocabulary Reasoning and Evaluation

OpenVox enables open-vocabulary, fine-grained scene queries by leveraging caption embeddings aggregated per instance in the codebook. This supports both ontology-based lookup and compositional language queries (e.g., "red leather sofa").

Benchmarks

Empirical evaluations on Replica and ScanNet datasets report:

Task Metric OpenVox Best Baseline
3D Zero-shot Instance Seg. AP (Replica, 8 scenes) 11.73 6.01
AP50 27.29 12.37
AP25 38.46 24.71
AP (ScanNet, 6 scenes) 3.70 3.27
AP50 11.90 9.77
AP25 31.20 27.89
3D Zero-shot Semantic Seg. mIoU / mAcc (Replica) 27.30/43.42 22.53/41.71
mIoU / mAcc (ScanNet) 22.84/54.23 20.91/49.04
Open-vocabulary retrieval Top-1 recall (Ontology) 0.905 0.810
Top-1 recall (Relevance) 0.762 0.429
Top-1 recall (Functionality) 0.714 0.476

These results establish state-of-the-art performance in zero-shot instance segmentation, semantic segmentation, and open-vocabulary retrieval (Deng et al., 23 Feb 2025).

5. Real-World Deployment and System Characteristics

OpenVox has been validated on physical mobile robotic platforms (Autolabor M1 with Azure Kinect RGB–D and Livox MID-360 LiDAR for pose estimation), operating fully online at a voxel resolution of 0.04 m. The instance segmentation and captioning front end achieves 10–15 Hz, with back-end fusion completing within milliseconds per frame. The architecture maintains stability in the presence of reflective surfaces, glass, partial occlusion, and segmentation jitter. The memory requirement for the sparse voxel map scales linearly with the physical scene volume.

Noted limitations include occasional transient mis-associations for briefly visible instances—in practice quickly corrected—and a reliance on predominantly static environments. Query latency and codebook memory can become significant in extremely large-scale deployments (Deng et al., 23 Feb 2025).

6. Extensions, Limitations, and Future Work

OpenVox’s main contributions are its unified, real-time open-vocabulary mapping pipeline and robust probabilistic voxel representation, supported by LLM-based caption features and two-stage Bayesian fusion. The system is currently limited by assumptions of static scenes, with open research directions encompassing:

  • Dynamic object modeling for moving-object identification and online tracking.
  • Lifelong map enhancement via interactive, language-based human feedback.
  • Compression strategies for codebook scalability in outdoor and large-scale deployments.

A plausible implication is that integrating dynamic tracking and interactive feedback could further augment OpenVox’s generality and applicability to more complex, real-world environments (Deng et al., 23 Feb 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OpenVox.