MV3DIS: Multi-View Mask Matching via 3D Guides for Zero-Shot 3D Instance Segmentation

Published 10 Apr 2026 in cs.CV | (2604.08916v1)

Abstract: Conventional 3D instance segmentation methods rely on labor-intensive 3D annotations for supervised training, which limits their scalability and generalization to novel objects. Recent approaches leverage multi-view 2D masks from the Segment Anything Model (SAM) to guide the merging of 3D geometric primitives, thereby enabling zero-shot 3D instance segmentation. However, these methods typically process each frame independently and rely solely on 2D metrics, such as SAM prediction scores, to produce segmentation maps. This design overlooks multi-view correlations and inherent 3D priors, leading to inconsistent 2D masks across views and ultimately fragmented 3D segmentation. In this paper, we propose MV3DIS, a coarse-to-fine framework for zero-shot 3D instance segmentation that explicitly incorporates 3D priors. Specifically, we introduce a 3D-guided mask matching strategy that uses coarse 3D segments as a common reference to match 2D masks across views and consolidates multi-view mask consistency via 3D coverage distributions. Guided by these view-consistent 2D masks, the coarse 3D segments are further refined into precise 3D instances. Additionally, we introduce a depth consistency weighting scheme that quantifies projection reliability to suppress ambiguities from inter-object occlusions, thereby improving the robustness of 3D-to-2D correspondence. Extensive experiments on the ScanNetV2, ScanNet200, ScanNet++, Replica, and Matterport3D datasets demonstrate the effectiveness of MV3DIS, which achieves superior performance over previous methods

Abstract PDF Upgrade to Chat

Authors (3)

Summary

The paper introduces a coarse-to-fine 3D-guided framework that enforces multi-view mask consistency to overcome the limitations of 2D-only segmentation methods.
It employs superpoint decomposition, SAM-based 2D mask acquisition, and depth consistency weighting to reliably lift 2D segmentation masks into coherent 3D segments.
Empirical results on ScanNet benchmarks show MV3DIS outperforming supervised models, especially in open-vocabulary and sparse view scenarios.

MV3DIS: Multi-View Mask Matching via 3D Guides for Zero-Shot 3D Instance Segmentation

Introduction and Motivation

Zero-shot 3D instance segmentation remains a central challenge in computer vision due to the cost-prohibitive nature of large-scale 3D annotation. Prevailing approaches attempt to leverage multi-view 2D segmentation masks, especially those generated by foundation models such as SAM, aggregating these cues into 3D via geometric reasoning. However, existing methods process 2D views independently and depend exclusively on per-frame, 2D-centric metrics—resulting in view-inconsistent masks and consequently fragmented 3D segmentation outputs. Such fragmentation reveals a key shortfall: the absence of explicit 3D priors and mechanisms to enforce cross-view mask agreement.

MV3DIS addresses these limitations. It introduces a coarse-to-fine, 3D-guided framework designed to enforce view consistency in mask assignment and improve the reliability of the 2D-to-3D lifting process. By leveraging 3D segments as common reference anchors for mask matching and incorporating a novel depth consistency weighting scheme, MV3DIS produces significantly more coherent 3D segmentations, as demonstrated empirically on ScanNetV2, ScanNet200, and ScanNet++. Notably, MV3DIS claims positive transfer even in open-vocabulary regimes, where supervised methods typically fail to generalize.

Figure 1: View inconsistency in existing methods versus view consistency with MV3DIS. Existing methods lead to significant mask fragmentation, while MV3DIS enforces multi-view alignment and produces coherent 3D instance segments.

Methodology

Coarse-to-Fine 3D Instance Segmentation Pipeline

MV3DIS proceeds in two main stages: a coarse, SAM-guided 3D segmentation phase and a subsequent 3D instance refinement phase. The framework's high-level strategy involves:

Superpoint Decomposition and SAM Mask Acquisition: The point cloud is over-segmented into geometrically coherent superpoints using a graph-cut algorithm. For each RGB-D frame, SAM (or other 2D foundation models) predicts candidate masks, which are disambiguated and merged into per-frame 2D segmentation maps.
Superpoint Merging for Coarse 3D Segmentation: A superpoint affinity graph is constructed using the 2D segmentation maps. Region growing—weighted by local affinity, spatial proximity, and superpoint size—merges superpoints into initial coarse 3D segments.
Figure 2: MV3DIS's two-stage pipeline: (a) Coarse segmentation via SAM-guided superpoint merging; (b) 3D-guided mask matching and iterative refinement via multi-view mask consistency.

3D-Guided Multi-View Mask Matching

To counteract view inconsistency, MV3DIS introduces a 3D-guided mechanism for mask matching:

Projection and Visibility Estimation: Coarse 3D segments are projected onto every view using intrinsic and extrinsic calibration; visibility conditions are strictly enforced to filter out occluded or ambiguous correspondences.
Depth Consistency Weighting: For each projected 3D point, a depth consistency weight quantifies the reliability of that correspondence, penalizing near-threshold occluded points, thus suppressing spurious 2D-3D associations.
Figure 3: Depth consistency and visibility weights. Misidentifications due to occlusions are down-weighted, and object observability is robustly quantified.
Consistency Score and Mask Selection: For every segment, candidate 2D masks are shortlisted based on visibility. Consistent masks are selected by maximizing consensus in their 3D coverage distribution, measured by the cosine similarity of subsegment occupancy vectors weighted by depth consistency.
Affinity-Based Region Refinement: The globally consistent masks refine 3D instance assignments: boundary superpoints are iteratively reassigned to maximize affinity with updated, view-consistent regions.

Experimental Evaluation

MV3DIS is benchmarked against open-vocabulary and supervised closed-vocabulary baselines on ScanNetV2, ScanNet200, and ScanNet++. The evaluation considers class-agnostic and semantic instance segmentation, measuring AP at various IoU thresholds.

Key empirical findings:

MV3DIS surpasses all prior zero-shot open-vocabulary methods in every benchmark, often by non-trivial margins in both mAP and AP $_{50}$ .
On ScanNet200, MV3DIS achieves 54.7 AP $_{50}$ —exceeding supervised Mask3D models trained directly on the benchmark (51.2), an especially strong result.
Robustness is observed on challenging datasets (e.g., ScanNet++), where supervised models’ generalization collapses, but MV3DIS maintains strong performance.
Figure 4: Qualitative results comparing MV3DIS to SAM3D. MV3DIS produces cleaner, less fragmented segments, corresponding more closely to object boundaries.

Ablation studies isolate the contribution of each system component (region refinement, 3D mask matching, depth consistency weighting), with each intervention resulting in measurable mAP increases. The depth consistency weighting and 3D-guided mask matching are especially influential for view consistency and occlusion-handling.

Scalability Analysis:

MV3DIS demonstrates favorable scaling with respect to the number of images: performance gains saturate at lower view counts compared to competitors, evidencing efficient multi-view aggregation.
Figure 5: Correlation between segmentation performance and number of input images. MV3DIS remains robust under sparse view scenarios.

Theoretical and Practical Implications

MV3DIS’s explicit enforcement of view-consistent mask matching via 3D priors and geometric reasoning signals a robust direction for zero-shot 3D instance segmentation. The method demonstrates that integrating geometric projection reliability can substantially close the gap between zero-shot and supervised approaches, particularly in open-vocabulary or long-tail class regimes. Furthermore, the absence of video-dependent tracking or temporal constraints renders the approach well-suited to static, arbitrarily captured multi-view datasets—broadening its practical deployment potential.

On the theoretical front, the results suggest that multi-view consistency is not simply a matter of effective 2D mask aggregation but requires consistent 3D referencing and error-aware projection models. The depth consistency weighting mechanism, in particular, may be extensible to other vision tasks involving noisy or uncertain geometric transformations.

Future Directions

Potential future research avenues include:

Generalizing the 3D-guided mask matching paradigm to address category-level semantic segmentation or panoptic segmentation in zero-shot settings.
Extending projection reliability modeling to dynamic scenes or outdoor environments exhibiting complex occlusion patterns and lighting/capture variance.
Integrating language or multimodal cues into the mask matching process, enabling tighter coupling between 3D geometry and high-level semantic understanding.

Conclusion

MV3DIS establishes a new robustness baseline in zero-shot 3D instance segmentation by leveraging 3D-guided mask matching and projection reliability. Its strong empirical results, including superior performance to well-trained closed-set baselines in open-vocabulary and few-shot scenarios, substantiate the critical role of 3D priors and multi-view consistency practices. These findings underscore the necessity of principled 3D-2D aggregation in the design of scalable, annotation-efficient 3D scene understanding systems.

Markdown Report Issue