Papers
Topics
Authors
Recent
2000 character limit reached

BoxOVIS: Open-Vocabulary 3D Segmentation

Updated 29 December 2025
  • BoxOVIS is a system for open-vocabulary 3D instance segmentation that decomposes the task into proposal generation and mask classification using both point-based and RGBD box-guided methods.
  • It combines pretrained Mask3D and YOLO-World detectors to generate and classify segmentation proposals, achieving state-of-the-art generalization to rare object classes with practical inference speeds.
  • The approach bypasses heavy vision models by leveraging frozen 2D detectors and innovative 3D mask aggregation, balancing computational efficiency with enhanced object retrieval.

BoxOVIS (Box-Guided Open-Vocabulary Instance Segmentation) is a system for open-vocabulary 3D instance segmentation, targeting the retrieval of objects from scene-scale point clouds using arbitrary user-supplied text queries. It decomposes the problem into proposal generation using both point cloud- and RGBD box-guided workflows, and mask classification using 2D open-vocabulary detectors. BoxOVIS attains state-of-the-art generalization to rare object classes while achieving practical inference speed by avoiding heavy image foundation models such as SAM and CLIP (Nguyen et al., 22 Dec 2025).

1. Problem Formulation and Motivation

Open-vocabulary 3D instance segmentation (OV-3DIS) seeks, from a 3D scene point cloud PRN×3P \in \mathbb{R}^{N \times 3}, a set of RGBD frames {Ii}i=1Kf\{\mathcal I_i\}_{i=1}^{K_f} (each with intrinsics IiI_i and extrinsics EiE_i), and a prompt comprising an unbounded list of class names, to produce a set of binary instance masks {Mj}j=1K\{M_j\}_{j=1}^K, with each mask Mj{0,1}NM_j \in \{0,1\}^N assigned a prompt label. A salient requirement is that at test time, masks may correspond to categories never observed during 3D training.

BoxOVIS structures OV-3DIS as two tightly integrated subproblems:

  • Proposal generation: Class-agnostic 3D masks are extracted both from a 3D pretrained segmenter (Mask3D) and by “lifting” YOLO-World 2D open-vocabulary detections into 3D using superpoint grouping.
  • Mask classification: Each mask is assigned a class from the user prompt by projecting the 3D mask back into 2D, intersecting with each frame’s YOLO-World label map, and aggregating the most frequent class across top-visibility frames.

This approach addresses limitations in contemporary systems: those relying on Mask3D alone fail to generalize to unseen or rare categories, while pipelines using CLIP or SAM alongside 3D segmenters are computationally infeasible for real-time use. BoxOVIS achieves practical speeds (<1 min/scene) and strong rare-category retrieval by inheriting the recognition abilities of powerful 2D open-vocabulary detectors.

2. Architecture and Data Processing Pipeline

BoxOVIS is organized into dual branches—point-based proposals and RGBD box-guided proposals—followed by pooled classification. The flow is depicted in the original work’s Figure 1; its steps are described as follows:

  • 3D Branch (Point-based Masks):
    • Input point cloud PP is partitioned via Felzenszwalb–Huttenlocher graph-based superpoint segmentation.
    • Superpoints are segmented by pretrained Mask3D, producing KpointK_{\rm point} candidate masks MpointM^{\rm point}.
  • 2D Branch (Box-guided Masks):
    • Each RGB frame Ii\mathcal I_i is passed to YOLO-World to extract bounding boxes bijb_{ij} with class labels cijc_{ij}.
    • Each 2D box is lifted into a 3D oriented box bij3Db^{3D}_{ij} by projecting contained pixels to 3D using the frame’s depth and calibration.
    • Superpoints with ≥τspp\tau_{\rm spp}% of their points inside bij3Db^{3D}_{ij} are grouped into a coarse mask SijS_{ij}.
    • Across frames, masks SijS_{ij} are merged when their 3D IoU ≥τmerge\tau_{\rm merge} and class labels match, resulting in MRGBDM^{\rm RGBD}.
    • Masks overlapping a point-based MpointM^{\rm point} by ≥τfilter\tau_{\rm filter} IoU are filtered out.
  • Pooling and Classification:
    • All remaining masks, {Mpoint}{MRGBD}\{M^{\rm point}\} \cup \{M^{\rm RGBD}\}, are pooled.
    • 2D label maps Li\mathcal{L}_i are built per frame by painting YOLO-World detections (prioritizing larger boxes).
    • Visible points of each mask are projected into the frames; for each, the distribution Dj\mathcal D_j of label assignments is computed. The majority label becomes MjM_j’s class.

A concise summary of the mask proposal and classification steps is provided below.

Branch Proposal Method Classification Mechanism
3D (point) Mask3D on superpoints 2D label map projection
2D (RGBD box) Uplift YOLO-World Detections 2D label map projection + merging

Parameter defaults include τbox=0.7\tau_{\rm box}=0.7, τspp=0.6\tau_{\rm spp}=0.6, τmerge=0.4\tau_{\rm merge}=0.4, τfilter=0.8\tau_{\rm filter}=0.8, and τdepth=0.1 m\tau_{\rm depth}=0.1~\mathrm{m}.

3. Key Algorithms and Quantitative Formulas

Central computational steps are formally expressed as follows:

  • Mask IoU for deduplication and merge:

IoU(Ma,Mb)=p=1NMa,pMb,pp=1NMa,pMb,p\mathrm{IoU}(M_a, M_b) = \frac{\sum_{p=1}^N M_{a,p} \wedge M_{b,p}}{\sum_{p=1}^N M_{a,p} \vee M_{b,p}}

  • Visibility filtering for each projected point:
    • Frustum: Vi,pf=1(0<xi,p<W)1(0<yi,p<H)V^f_{i,p} = \mathbf{1}(0 < x_{i,p} < W) \wedge \mathbf{1}(0 < y_{i,p} < H)
    • Depth: Vi,pd=1(zi,pDi,p<τdepth)V^d_{i,p} = \mathbf{1}(|z_{i,p} - D_{i,p}| < \tau_{\rm depth})
    • Full per-frame mask: Mj(i)=VifVidMjM_{j}^{(i)} = V_i^f \odot V_i^d \odot M_j
  • Class aggregation draws from the “prompt distribution” over projected visible points in the top-kk most visible frames.

A distinguishing feature is that BoxOVIS reuses Mask3D (pretrained, class-agnostic) and YOLO-World (extra-large) as frozen, off-the-shelf modules. The system introduces no new end-to-end segmentation or classification losses.

4. Detailed Inference Workflow

The BoxOVIS inference pipeline is executed as:

  1. Run YOLO-World on sampled frames to extract 2D bounding boxes and classes.
  2. Each box: Project enclosed pixels with depth to 3D, fit Open3D box, and filter pre-existing Mask3D overlaps (τbox\tau_{\rm box}).
  3. Assign superpoints with ≥τspp\tau_{\rm spp} proportion of points in each box; build coarse masks.
  4. Merge same-class, high-IoU (τmerge\tau_{\rm merge}) coarse masks across frames into RGBD instance proposals.
  5. Remove RGBD masks with high overlap (τfilter\tau_{\rm filter}) with point-based masks.
  6. Pool all proposals into the candidate mask set.
  7. Build dense per-frame 2D label maps from YOLO-World detections.
  8. For each 3D mask, aggregate projected 2D labels and assign the majority class from the prompt.

This enables robust retrieval of rare and unseen objects by leveraging YOLO-World’s open-vocabulary understanding while aligning mask geometry to 3D structure.

5. Experimental Protocols and Results

Experiments use ScanNet200 (1 201 training, 312 validation; 198 classes, head/common/tail split) and Replica (8 synthetic scenes, 48 classes). Evaluation relies on mAP@[.50:.05:.95], mAP₅₀, and mAP₂₅.

Dataset System mAP mAP₅₀ Tail mAP Inference Time (s/scene)
ScanNet200 Open-YOLO 3D 24.7 31.7 21.6 21.8
BoxOVIS 24.9 32.1 22.4 55.9
Replica Open-YOLO 3D 23.7 28.6 N/A 16.6
BoxOVIS 24.0 31.8 N/A 43.7

On ScanNet200, BoxOVIS gains +0.2 mAP (overall) and +0.8 tail-class mAP over Open-YOLO 3D, indicating improved rare object recovery. Replica results show a larger mAP₅₀ improvement (+3.2).

SAM+CLIP methods require 300–550 s/scene; Open-YOLO 3D and BoxOVIS are at least 5–10× faster. This suggests BoxOVIS is appropriate for time-sensitive applications, striking a balance between speed and rare-class performance.

6. Component Analysis, Limitations, and Efficiency

Ablation shows the addition of RGBD box-guided proposals yields 0.8–1.0 mAP improvements for tail categories in ScanNet200, supporting the claim that the approach is effective for rare or novel object segmentation. Replica experiments confirm greater gains at lower IoU thresholds, indicating that box-guided masks recover more rare classes, albeit sometimes with coarser geometry.

The principal computational bottleneck is CPU-based oriented box fitting in Open3D; a GPU implementation could halve runtime. RGBD masks derived from superpoint grouping can be noisy. Further mask refinement (e.g., a lightweight SAM pass on final candidates) is proposed for future exploration, as is improved back-end classification using open-world 3D models like OpenShape or DuoMamba.

BoxOVIS also inherits the limitation that heavily occluded instances may be incorrectly labeled, since current 2D label map aggregation struggles with strong occlusions.

7. Practical Impact and Future Directions

BoxOVIS achieves a compromise between rapid, generalizable 3D segmentation and minimal computational overhead. The system’s reliance on frozen 2D and 3D foundation models enables strong recognition of rare and unseen objects, while its efficiency (sub–60 s/scene) supports applications in real-time robotics and augmented reality settings. Proposed extensions include GPU-based 3D box fitting, enhanced mask refinement, and replacement of the 2D labeling stage with open-world 3D back-ends for improved occlusion robustness and semantic coverage (Nguyen et al., 22 Dec 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to BoxOVIS.