Open-Vocabulary Instance Segmentation
- Open-Vocabulary Instance Segmentation is a paradigm that enables pixel- or point-wise extraction of object instances using free-text queries and vision-language models.
- It integrates multi-modality techniques including two-stage mask proposals, single-stage joint segmentation, and hybrid 2D/3D approaches to address traditional closed-set limitations.
- Empirical benchmarks on datasets like ScanNet200 and BURST demonstrate robust gains in zero-shot recognition and efficiency improvements over conventional methods.
Open-vocabulary instance segmentation is a rapidly advancing paradigm in computer vision and 3D scene understanding, enabling the segmentation and categorization of object instances specified by arbitrary text queries—potentially unseen during training. By leveraging vision-LLMs, caption corpora, dual-modality fusion, and instance proposal mechanisms, these systems transcend the rigid closed-set taxonomies of traditional segmentation. The following sections systematically survey core methodologies, algorithmic frameworks, and empirical benchmarks that define the state of the art in open-vocabulary instance segmentation across 2D, 3D, and video domains.
1. Problem Definition and Distinctions
Open-vocabulary instance segmentation (OVIS) requires the system to localize and extract pixel-wise (in 2D) or point/voxel-wise (in 3D) masks for all instances in an input image or scene, using a label space defined by arbitrary free-text descriptions. At test time, instances may be of categories not present in the training annotation set, necessitating zero-shot identification. This contrasts with:
- Closed-vocabulary instance segmentation: models trained to output masks for a fixed, finite set of category labels (e.g., Mask R-CNN on COCO or Mask3D on ScanNet200), failing to generalize to novel terms or composite natural language queries (Takmaz et al., 2023).
- Open-vocabulary semantic segmentation: prior methods could produce per-point or per-pixel heatmaps aligned to text queries (Takmaz et al., 2023), but lacked the ability to output distinct instance masks.
- Referring segmentation: resolves segmentation given a natural language phrase, but usually for a single object per input (Warner et al., 2023).
The open-vocabulary instance segmentation paradigm subsumes the above, providing instance-level masks—and, frequently, fine-grained attributes or compositional cues—in response to both known and novel class prompts (Takmaz et al., 2023, Nguyen et al., 2023).
2. Fundamental Architectural Strategies
Several architectural templates have emerged:
(A) Two-stage: Mask Proposal + Open-Vocabulary Classification
This approach decouples spatial localization and semantic assignment:
- Class-agnostic mask proposal modules generate candidate instance masks using either 2D image features (Mask2Former or Mask R-CNN) or 3D backbones (sparse UNet, Mask3D, Gaussian splatting) (Takmaz et al., 2023, Nguyen et al., 2023, Jung et al., 30 Jul 2025, Piekenbrinck et al., 9 Jun 2025, Huang et al., 21 Oct 2025).
- Language-driven classification assigns free-text labels to proposals using embedding-based similarity (CLIP, Alpha-CLIP), or by voting/aggregation across multiple views and modalities (Takmaz et al., 2023, Jung et al., 30 Jul 2025, Huang et al., 21 Oct 2025).
(B) Single-stage: Joint Visual-Language Segmentation
These methods integrate region, pixel, and text processing in a single end-to-end network:
- Query-based mask decoders operate on object queries conditioned on image features and language embeddings (Zhang et al., 2023, Xiao et al., 7 Jul 2025).
- Contrastive pretraining and proposal-free matchers align visual features with text across both seen and unseen classes (Wu et al., 2023, Huynh et al., 2021).
(C) Dual-Modality and Hybrid Designs
Emergent systems unify multimodal 2D and 3D reasoning or exploit synthetic view synthesis:
- Dual-pathway fusion leverages 3D point clouds and 2D multi-view images, using each modality’s proposals to cover limitations of the other (Nguyen et al., 2023, Ton et al., 2024).
- Synthetic rendering plus lookup generates scene-level images from a 3D model with a virtual camera and runs open-vocabulary 2D detectors, mapping 2D mask outputs back to the 3D domain (Huang et al., 2023).
A summary of model ingredients is given below.
| Core Component | Description (cites) | Typical Methods/Backbones |
|---|---|---|
| Proposal generation | Class-agnostic 2D/3D mask heads, multi-view aggregation | Mask3D, ISBNet, SAM, Mask2Former |
| Language embedding | Encode text queries and class names to joint space | CLIP ViT, Alpha-CLIP, BEiT-3, BERT |
| Cross-modal alignment | Similarity scoring, contrastive losses, mask/region pooling | Dot/cosine product, InfoNCE, SMS filtering |
| Multi-modality fusion | Project and aggregate 2D/3D, context injection | Dual-path, feature splatting, region-aware |
| Supervision | Manual masks, pseudo-masks (VLM, captions), synthetic render | MS-COCO, OpenImages, ScanNet200, S3DIS |
3. Training Protocols and Supervision
Open-vocabulary systems are trained via combinations of supervised and weakly-supervised objectives:
- Supervised segmentation uses available base class mask annotations for query-based or proposal-based segmentation (Zhang et al., 2023, Piekenbrinck et al., 9 Jun 2025).
- Weak supervision via captions/corpora leverages image–caption pairs for pseudo-label generation and ambiguous grounding, often disambiguated by restricting supervision to noun tokens or by generating masks from VLM activation maps (Wu et al., 2023, VS et al., 2023, Huynh et al., 2021).
- Pseudo-mask annotation replaces manual mask annotation with activations or region proposals obtained from a VLM, e.g., ALBEF cross-attention, GradCAM, iterative masking, and weakly-supervised proposal networks (WSPN) (VS et al., 2023, Huynh et al., 2021).
- Feature-level alignment and contrastive loss is used for both 2D and 3D: embedding region/mask features and text in a joint space, minimizing contrastive distances between matching region–token pairs (Takmaz et al., 2023, Nguyen et al., 2023, Ma et al., 16 Jan 2025, Huang et al., 17 Jul 2025).
End-to-end pipelines may involve pre-training on base mask classes, then joint training or distillation using captions or synthetic pseudo-masks (Huynh et al., 2021, Zhang et al., 2023, VS et al., 2023, Xiao et al., 7 Jul 2025).
4. 3D and Multiview Methods
Open-vocabulary 3D instance segmentation extends 2D paradigms by incorporating point clouds, reconstructed mesh, or Gaussian splatting representations.
- Multi-view mask aggregation: 2D instance masks are generated per view (Grounding-DINO+SAM, YOLO-World), back-projected into 3D, and fused with point clouds using superpoint clustering or region-growing. This addresses small-object or geometry-challenged cases missed by pure 3D (Nguyen et al., 2023, Takmaz et al., 2023).
- CLIP-based 3D feature fusion: Instance segment proposals are associated with multi-view CLIP features through average or attention-based aggregation, yielding a robust descriptor for each 3D object (Nguyen et al., 2023, Takmaz et al., 2023, Huang et al., 21 Oct 2025, Piekenbrinck et al., 9 Jun 2025).
- Proposal fusion and SMS filtering: Combining multiple proposal sources (e.g., 2D-guided and 3D proposals), removing duplications via NMS, and normalizing text–proposal similarities (Standardized Maximum Similarity) to suppress false positives (Jung et al., 30 Jul 2025).
- Gaussian Splatting with contrastive and feature losses: Feature splatting allows per-Gaussian feature learning via contrastive objectives w.r.t. 2D masks, supporting cluster-based instance formation and open-vocabulary language assignment (Piekenbrinck et al., 9 Jun 2025, Huang et al., 21 Oct 2025).
- Synthetic snapshot–lookup frameworks: Render the 3D scene at multiple virtual viewpoints, run open-vocabulary 2D detection and look up mask correspondence for label assignment in 3D-only input settings (Huang et al., 2023).
5. Video Instance Segmentation and Temporal Techniques
Video extends open-vocabulary segmentation to spatiotemporal consistency and object tracking:
- Frame-to-text vs. temporal alignment: Traditional OV-VIS models independently align each frame’s instances to text (Cheng et al., 2024), whereas advanced systems link instance embeddings across time, e.g., via Brownian bridge dynamics (Cheng et al., 2024), temporal instance resamplers, and temporal contrastive objectives (Fang et al., 2024, Zhu et al., 2024).
- Unified embedding alignment: To address domain gaps between VLM and instance features, modules such as Unified Embedding Alignment fuse segmentor queries with CLIP embedding space before text similarity—substantially improving generalization to novel categories (Fang et al., 2024).
- Mask tracking: Tracking modules based on rollout token prediction or TopK-enhanced association provide robust ID continuity for open-set objects (Guo et al., 2023, Zhu et al., 2024).
- Zero-shot transfer: Pretraining on closed-set categories enables substantial zero-shot gains on open-vocabulary benchmarks like BURST, LV-VIS, and YouTube-VIS (Cheng et al., 2024, Zhu et al., 2024, Fang et al., 2024).
6. Empirical Results and Analysis
State-of-the-art systems demonstrate robust open-vocabulary instance segmentation capabilities across diverse benchmarks:
- 2D image segmentation: Mask-free approaches relying solely on VLM-based pseudo-masks and weak supervision achieve or exceed previous methods trained on large numbers of human-annotated masks (VS et al., 2023). Methods such as OpenSeeD provide unified handling of segmentation/detection and strong transfer to ADE20K, COCO LVIS splits, and more (Zhang et al., 2023).
- 3D instance segmentation: Fusing multi-view 2D masks and CLIP-embedded features with 3D proposals yields significant gains on ScanNet200, S3DIS, and Replica datasets—especially in the long-tail and open-set regimes (Nguyen et al., 2023, Takmaz et al., 2023, Jung et al., 30 Jul 2025, Huang et al., 21 Oct 2025). The integration of Alpha-CLIP and SMS filtering further boosts precision and mitigates background noise (Jung et al., 30 Jul 2025).
- Efficiency: Inference speed has been improved by circumventing expensive 2D foundation models (SAM, CLIP) through 2D bounding box detectors plus label-mapping, delivering up to 16x speed gains without accuracy loss (Boudjoghra et al., 2024).
- Benchmark performance:
- Open-YOLO 3D: 24.7% mAP in 22s/scene on ScanNet200; surpasses prior open-vocab methods (Boudjoghra et al., 2024).
- Open3DIS: 23.7% mAP on ScanNet200, outperforming prior 2D and 3D guidance methods (Nguyen et al., 2023).
- Video: BriVIS achieves a 49.5% performance gain over OV2Seg for open-vocabulary video segmentation on BURST (Cheng et al., 2024).
A selection of characteristic results is tabulated:
| Method | Domain | Key Benchmarks | Notable Result | Reference |
|---|---|---|---|---|
| CGG | 2D, weakly sup | COCO OVIS, COCO OSPS | +6.8% novel mAP, +15 PQ on novel | (Wu et al., 2023) |
| OpenMask3D | 3D, instance | ScanNet200, Replica | 15.4% mAP, robust tail class perf | (Takmaz et al., 2023) |
| Open3DIS | 3D, hybrid | ScanNet200, S3DIS, Replica | 23.7% mAP (ScanNet200, 2D+3D combo) | (Nguyen et al., 2023) |
| Open-YOLO 3D | 3D, fast inf. | ScanNet200, Replica | 24.7% mAP in 22s/scene | (Boudjoghra et al., 2024) |
| BriVIS | Video | BURST, YouTube-VIS, OVIS | 7.43 mAP on BURST, +49.5% over OV2Seg | (Cheng et al., 2024) |
| OVFormer | Video | LV-VIS, YTVIS, OVIS | +7.7 mAP over prior SOTA | (Fang et al., 2024) |
| Details Matter OV3DIS | 3D, tracking | ScanNet200, S3DIS, Replica | 32.7 mAP (ScanNet200, Top-K proto) | (Jung et al., 30 Jul 2025) |
| OpenSplat3D | 3DGS | LERF, ScanNet++ | 84.0% mIoU (LERF-mask), 24.5 AP | (Piekenbrinck et al., 9 Jun 2025) |
7. Key Insights, Challenges, and Future Directions
Certain recurrent challenges and open lines of research arise:
- Quality of proposals: 3D mask quality (over-/undersegmentation) directly determines upper-bound performance across methods (Jung et al., 30 Jul 2025, Takmaz et al., 2023).
- Multi-modal fusion: Effective cross-view aggregation of 2D/3D features, alignment of context, and avoidance of per-view inconsistencies are crucial (Huang et al., 21 Oct 2025).
- Vocabulary generalization: Reliance on VLMs like CLIP or Alpha-CLIP enhances ability to recognize rare and unseen classes, but is still limited by pretraining distribution (Takmaz et al., 2023, Huang et al., 17 Jul 2025).
- Precision–recall tradeoffs: Aggressive filtering, e.g., via SMS normalization, boosts precision at minimal recall cost (Jung et al., 30 Jul 2025).
- Speed and scalability: Eliminating computationally intensive empirical 2D segmenters (SAM, CLIP per mask) for practical applications is an ongoing focus (Boudjoghra et al., 2024).
- Synthetic data and simulation: Rendering-based methods (OpenIns3D, Open-YOLO 3D) facilitate 3D open-vocabulary labeling even in absence of image data (Huang et al., 2023).
- Generalizable benchmarks and annotation sparsity: The field continues to require more diverse, challenging datasets, and more robust evaluation frameworks for fully open-vocabulary scenarios.
Future work emphasizes end-to-end joint optimization, stronger contextual reasoning, more complex attribute and relational queries, and efficient architectures capable of both high precision and broad class transfer (Wu et al., 2023, Xiao et al., 7 Jul 2025, Jung et al., 30 Jul 2025). The demonstrated frameworks offer robust foundations for downstream AR/VR, robotics, and large-scale geospatial analysis.