BoxOVIS: Open-Vocabulary 3D Segmentation

Updated 29 December 2025

BoxOVIS is a system for open-vocabulary 3D instance segmentation that decomposes the task into proposal generation and mask classification using both point-based and RGBD box-guided methods.
It combines pretrained Mask3D and YOLO-World detectors to generate and classify segmentation proposals, achieving state-of-the-art generalization to rare object classes with practical inference speeds.
The approach bypasses heavy vision models by leveraging frozen 2D detectors and innovative 3D mask aggregation, balancing computational efficiency with enhanced object retrieval.

BoxOVIS (Box-Guided Open-Vocabulary Instance Segmentation) is a system for open-vocabulary 3D instance segmentation, targeting the retrieval of objects from scene-scale point clouds using arbitrary user-supplied text queries. It decomposes the problem into proposal generation using both point cloud- and RGBD box-guided workflows, and mask classification using 2D open-vocabulary detectors. BoxOVIS attains state-of-the-art generalization to rare object classes while achieving practical inference speed by avoiding heavy image foundation models such as SAM and CLIP (Nguyen et al., 22 Dec 2025).

1. Problem Formulation and Motivation

Open-vocabulary 3D instance segmentation (OV-3DIS) seeks, from a 3D scene point cloud $P \in \mathbb{R}^{N \times 3}$ , a set of RGBD frames $\{\mathcal I_i\}_{i=1}^{K_f}$ (each with intrinsics $I_i$ and extrinsics $E_i$ ), and a prompt comprising an unbounded list of class names, to produce a set of binary instance masks $\{M_j\}_{j=1}^K$ , with each mask $M_j \in \{0,1\}^N$ assigned a prompt label. A salient requirement is that at test time, masks may correspond to categories never observed during 3D training.

BoxOVIS structures OV-3DIS as two tightly integrated subproblems:

Proposal generation: Class-agnostic 3D masks are extracted both from a 3D pretrained segmenter (Mask3D) and by “lifting” YOLO-World 2D open-vocabulary detections into 3D using superpoint grouping.
Mask classification: Each mask is assigned a class from the user prompt by projecting the 3D mask back into 2D, intersecting with each frame’s YOLO-World label map, and aggregating the most frequent class across top-visibility frames.

This approach addresses limitations in contemporary systems: those relying on Mask3D alone fail to generalize to unseen or rare categories, while pipelines using CLIP or SAM alongside 3D segmenters are computationally infeasible for real-time use. BoxOVIS achieves practical speeds (<1 min/scene) and strong rare-category retrieval by inheriting the recognition abilities of powerful 2D open-vocabulary detectors.

2. Architecture and Data Processing Pipeline

BoxOVIS is organized into dual branches—point-based proposals and RGBD box-guided proposals—followed by pooled classification. The flow is depicted in the original work’s Figure 1; its steps are described as follows:

3D Branch (Point-based Masks):
- Input point cloud $P$ is partitioned via Felzenszwalb–Huttenlocher graph-based superpoint segmentation.
- Superpoints are segmented by pretrained Mask3D, producing $K_{\rm point}$ candidate masks $M^{\rm point}$ .
2D Branch (Box-guided Masks):
- Each RGB frame $\mathcal I_i$ is passed to YOLO-World to extract bounding boxes $b_{ij}$ with class labels $c_{ij}$ .
- Each 2D box is lifted into a 3D oriented box $b^{3D}_{ij}$ by projecting contained pixels to 3D using the frame’s depth and calibration.
- Superpoints with ≥ $\tau_{\rm spp}$ % of their points inside $b^{3D}_{ij}$ are grouped into a coarse mask $S_{ij}$ .
- Across frames, masks $S_{ij}$ are merged when their 3D IoU ≥ $\tau_{\rm merge}$ and class labels match, resulting in $M^{\rm RGBD}$ .
- Masks overlapping a point-based $M^{\rm point}$ by ≥ $\tau_{\rm filter}$ IoU are filtered out.
Pooling and Classification:
- All remaining masks, $\{M^{\rm point}\} \cup \{M^{\rm RGBD}\}$ , are pooled.
- 2D label maps $\mathcal{L}_i$ are built per frame by painting YOLO-World detections (prioritizing larger boxes).
- Visible points of each mask are projected into the frames; for each, the distribution $\mathcal D_j$ of label assignments is computed. The majority label becomes $M_j$ ’s class.

A concise summary of the mask proposal and classification steps is provided below.

Branch	Proposal Method	Classification Mechanism
3D (point)	Mask3D on superpoints	2D label map projection
2D (RGBD box)	Uplift YOLO-World Detections	2D label map projection + merging

Parameter defaults include $\tau_{\rm box}=0.7$ , $\tau_{\rm spp}=0.6$ , $\tau_{\rm merge}=0.4$ , $\tau_{\rm filter}=0.8$ , and $\tau_{\rm depth}=0.1~\mathrm{m}$ .

3. Key Algorithms and Quantitative Formulas

Central computational steps are formally expressed as follows:

Mask IoU for deduplication and merge:

$\mathrm{IoU}(M_a, M_b) = \frac{\sum_{p=1}^N M_{a,p} \wedge M_{b,p}}{\sum_{p=1}^N M_{a,p} \vee M_{b,p}}$

Visibility filtering for each projected point:
- Frustum: $V^f_{i,p} = \mathbf{1}(0 < x_{i,p} < W) \wedge \mathbf{1}(0 < y_{i,p} < H)$
- Depth: $V^d_{i,p} = \mathbf{1}(|z_{i,p} - D_{i,p}| < \tau_{\rm depth})$
- Full per-frame mask: $M_{j}^{(i)} = V_i^f \odot V_i^d \odot M_j$
Class aggregation draws from the “prompt distribution” over projected visible points in the top- $k$ most visible frames.

A distinguishing feature is that BoxOVIS reuses Mask3D (pretrained, class-agnostic) and YOLO-World (extra-large) as frozen, off-the-shelf modules. The system introduces no new end-to-end segmentation or classification losses.

4. Detailed Inference Workflow

The BoxOVIS inference pipeline is executed as:

Run YOLO-World on sampled frames to extract 2D bounding boxes and classes.
Each box: Project enclosed pixels with depth to 3D, fit Open3D box, and filter pre-existing Mask3D overlaps ( $\tau_{\rm box}$ ).
Assign superpoints with ≥ $\tau_{\rm spp}$ proportion of points in each box; build coarse masks.
Merge same-class, high-IoU ( $\tau_{\rm merge}$ ) coarse masks across frames into RGBD instance proposals.
Remove RGBD masks with high overlap ( $\tau_{\rm filter}$ ) with point-based masks.
Pool all proposals into the candidate mask set.
Build dense per-frame 2D label maps from YOLO-World detections.
For each 3D mask, aggregate projected 2D labels and assign the majority class from the prompt.

This enables robust retrieval of rare and unseen objects by leveraging YOLO-World’s open-vocabulary understanding while aligning mask geometry to 3D structure.

5. Experimental Protocols and Results

Experiments use ScanNet200 (1 201 training, 312 validation; 198 classes, head/common/tail split) and Replica (8 synthetic scenes, 48 classes). Evaluation relies on mAP@[.50:.05:.95], mAP₅₀, and mAP₂₅.

Dataset	System	mAP	mAP₅₀	Tail mAP	Inference Time (s/scene)
ScanNet200	Open-YOLO 3D	24.7	31.7	21.6	21.8
	BoxOVIS	24.9	32.1	22.4	55.9
Replica	Open-YOLO 3D	23.7	28.6	N/A	16.6
	BoxOVIS	24.0	31.8	N/A	43.7

On ScanNet200, BoxOVIS gains +0.2 mAP (overall) and +0.8 tail-class mAP over Open-YOLO 3D, indicating improved rare object recovery. Replica results show a larger mAP₅₀ improvement (+3.2).

SAM+CLIP methods require 300–550 s/scene; Open-YOLO 3D and BoxOVIS are at least 5–10× faster. This suggests BoxOVIS is appropriate for time-sensitive applications, striking a balance between speed and rare-class performance.

6. Component Analysis, Limitations, and Efficiency

Ablation shows the addition of RGBD box-guided proposals yields 0.8–1.0 mAP improvements for tail categories in ScanNet200, supporting the claim that the approach is effective for rare or novel object segmentation. Replica experiments confirm greater gains at lower IoU thresholds, indicating that box-guided masks recover more rare classes, albeit sometimes with coarser geometry.

The principal computational bottleneck is CPU-based oriented box fitting in Open3D; a GPU implementation could halve runtime. RGBD masks derived from superpoint grouping can be noisy. Further mask refinement (e.g., a lightweight SAM pass on final candidates) is proposed for future exploration, as is improved back-end classification using open-world 3D models like OpenShape or DuoMamba.

BoxOVIS also inherits the limitation that heavily occluded instances may be incorrectly labeled, since current 2D label map aggregation struggles with strong occlusions.

7. Practical Impact and Future Directions

BoxOVIS achieves a compromise between rapid, generalizable 3D segmentation and minimal computational overhead. The system’s reliance on frozen 2D and 3D foundation models enables strong recognition of rare and unseen objects, while its efficiency (sub–60 s/scene) supports applications in real-time robotics and augmented reality settings. Proposed extensions include GPU-based 3D box fitting, enhanced mask refinement, and replacement of the 2D labeling stage with open-world 3D back-ends for improved occlusion robustness and semantic coverage (Nguyen et al., 22 Dec 2025).

PDF Markdown Chat (Pro)

References (1)

Retrieving Objects from 3D Scenes with Box-Guided Open-Vocabulary Instance Segmentation (2025)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to BoxOVIS.