OpenVox: Open-Vocabulary 3D Mapping

Updated 16 February 2026

OpenVox is a real-time, open-vocabulary 3D mapping framework that integrates instance segmentation and natural language captioning with probabilistic volumetric fusion.
It employs a two-stage Bayesian update combining geometric and caption feature similarities to robustly associate instances and evolve semantic voxel maps.
The framework achieves state-of-the-art performance on benchmarks like Replica and ScanNet and operates efficiently in real-world robotic deployments.

OpenVox is a real-time, incremental open-vocabulary probabilistic instance-level voxel mapping framework designed for 3D environmental reconstruction with advanced semantic understanding. It unifies vision-LLM-driven open-vocabulary segmentation with fully probabilistic volumetric fusion to yield robust, zero-shot instance- and semantic-level 3D scene maps. The architecture enables fine-grained and function-based semantic reasoning, efficient instance association, and resilience to segmentation and sensor noise across streaming observations (Deng et al., 23 Feb 2025).

1. System Architecture and Front-End Processing

OpenVox processes a live stream of RGB–D data $\{C_t, D_t, P_t\}$ (color image, depth, and camera pose). The front end focuses on instance-level semantic segmentation and natural language captioning through a multi-stage pipeline:

Open-vocabulary detection: Employs Yolo-world (denoted as $\mathrm{Det}(\cdot)$ ) to extract 2D bounding boxes from RGB input.
Mask and caption generation: TAP ( $\mathrm{SegCap}(\cdot)$ ) then generates a per-instance mask $m^i_t$ and a concise textual caption.
Caption encoding: SBERT ( $\mathrm{Enc}(\cdot)$ ) encodes each caption into a dense 384-dimensional feature vector $f^i_t$ .

This yields $\{m^i_t, f^i_t\}_{i=1}^N$ for each incoming frame. The design allows free-form, open-vocabulary descriptions, empowering the system to operate in a zero-shot capacity across previously unseen categories.

The output is delivered to the voxel-based fusion back end, with each encoded mask-caption pair supporting subsequent probabilistic data association and map evolution steps (Deng et al., 23 Feb 2025).

2. Probabilistic Voxel Representation

Each voxel $v^j$ in the scene maintains a categorical probability vector $\theta^j = \{\theta^{j,\gamma} : \gamma \in \Gamma\}$ , where $\Gamma$ indexes the current set of mapped instances and $\sum_\gamma \theta^{j,\gamma} = 1$ . Associated with each instance $\gamma$ , the codebook $\mathcal{B}$ stores a fused 384-dimensional caption feature $f^\gamma$ .

Key Bayesian Updates

OpenVox's fusion framework comprises two major probabilistic routines at each timestep:

Instance association (MLE): Determines whether a new detection corresponds to an existing map instance using a mixture of geometric ( $S^\mathrm{geo}_\gamma$ ) and caption feature ( $S^\mathrm{fea}_\gamma$ ) similarities. The combined score $A_\gamma = \lambda S^\mathrm{geo}_\gamma + (1-\lambda)S^\mathrm{fea}_\gamma$ guides association decisions.
Live map evolution (MAP): Integrates new mask evidence across observed voxels using Dirichlet–Categorical conjugacy. Posterior voxel parameters evolve per

$\alpha^{j,\gamma}_t = \alpha^{j,\gamma}_{t-1} + y^{j,\gamma}_t$

and expected probabilities update via

$\theta^{j,\gamma}_t = \alpha^{j,\gamma}_t / \sum_\tau \alpha^{j,\tau}_t.$

Key assumptions include independence of mask observations given the map, and Law of Large Numbers approximations for voxel-level evidence aggregation (Deng et al., 23 Feb 2025).

3. Cross-Frame Incremental Fusion and Noise Robustness

Incremental fusion operates per frame in two stages:

3.1 Instance Association

Pixels from each mask $m^i_t$ are projected to voxel indices $V_{m^i_t}$ . If these indices lack prior observations, a new instance is initialized in the map and codebook. Otherwise, association scoring between $m^i_t$ and each $\gamma \in \Gamma$ is determined via geometric and caption-based similarity, with assignment based on the maximum $A_\gamma$ exceeding a threshold $\tau_a$ .

3.2 Map Evolution

For voxels $v^j \in V_{m^i_t}$ , Dirichlet parameters $\alpha^{j,\gamma}_t$ are incremented according to new one-hot instance labels $y^{j,\gamma}_t$ . The updated instance feature $f^\gamma$ is recomputed through weighted averaging, where the credibility $w^i_t$ combines the association score $A_\gamma$ with the visibility ratio $R^i_t$ (fraction of voxels observed for instance $\gamma$ ).

3.3 Noise Handling

The Dirichlet model preserves uncertainty in $\theta$ , allowing gradual correction of spurious or noisy labels. Codebook fusion further mitigates erroneous updates by down-weighting partial, noisy, or brief observations—particularly effective against segmentation jitter and sensor artifacts (Deng et al., 23 Feb 2025).

4. Open-Vocabulary Reasoning and Evaluation

OpenVox enables open-vocabulary, fine-grained scene queries by leveraging caption embeddings aggregated per instance in the codebook. This supports both ontology-based lookup and compositional language queries (e.g., "red leather sofa").

Benchmarks

Empirical evaluations on Replica and ScanNet datasets report:

Task	Metric	OpenVox	Best Baseline
3D Zero-shot Instance Seg.	AP (Replica, 8 scenes)	11.73	6.01
	AP50	27.29	12.37
	AP25	38.46	24.71
	AP (ScanNet, 6 scenes)	3.70	3.27
	AP50	11.90	9.77
	AP25	31.20	27.89
3D Zero-shot Semantic Seg.	mIoU / mAcc (Replica)	27.30/43.42	22.53/41.71
	mIoU / mAcc (ScanNet)	22.84/54.23	20.91/49.04
Open-vocabulary retrieval	Top-1 recall (Ontology)	0.905	0.810
	Top-1 recall (Relevance)	0.762	0.429
	Top-1 recall (Functionality)	0.714	0.476

These results establish state-of-the-art performance in zero-shot instance segmentation, semantic segmentation, and open-vocabulary retrieval (Deng et al., 23 Feb 2025).

5. Real-World Deployment and System Characteristics

OpenVox has been validated on physical mobile robotic platforms (Autolabor M1 with Azure Kinect RGB–D and Livox MID-360 LiDAR for pose estimation), operating fully online at a voxel resolution of 0.04 m. The instance segmentation and captioning front end achieves 10–15 Hz, with back-end fusion completing within milliseconds per frame. The architecture maintains stability in the presence of reflective surfaces, glass, partial occlusion, and segmentation jitter. The memory requirement for the sparse voxel map scales linearly with the physical scene volume.

Noted limitations include occasional transient mis-associations for briefly visible instances—in practice quickly corrected—and a reliance on predominantly static environments. Query latency and codebook memory can become significant in extremely large-scale deployments (Deng et al., 23 Feb 2025).

6. Extensions, Limitations, and Future Work

OpenVox’s main contributions are its unified, real-time open-vocabulary mapping pipeline and robust probabilistic voxel representation, supported by LLM-based caption features and two-stage Bayesian fusion. The system is currently limited by assumptions of static scenes, with open research directions encompassing:

Dynamic object modeling for moving-object identification and online tracking.
Lifelong map enhancement via interactive, language-based human feedback.
Compression strategies for codebook scalability in outdoor and large-scale deployments.

A plausible implication is that integrating dynamic tracking and interactive feedback could further augment OpenVox’s generality and applicability to more complex, real-world environments (Deng et al., 23 Feb 2025).

Markdown Report Issue Upgrade to Chat

References (1)

OpenVox: Real-time Instance-level Open-vocabulary Probabilistic Voxel Representation (2025)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to OpenVox.