Zero-Shot Instance Segmentation

Updated 13 August 2025

Zero-shot instance segmentation is a method that detects and segments objects in unseen classes using visual-semantic alignment and transfer learning.
Methodologies integrate generative models, semantic-promoted debiasing, and decoupled grouping to address semantic gaps and background confusion.
Recent benchmarks demonstrate significant gains in mAP and instance accuracy across diverse domains including robotics, 3D mapping, and remote sensing.

Zero-shot instance segmentation refers to the task of detecting and segmenting object instances that belong to categories not seen during training, i.e., classes for which no annotated instance labels are available in the training set. This field encompasses methodologies that transfer knowledge from seen to unseen classes by leveraging visual, semantic, or generative relationships—enabling models to generalize beyond the closed sets of traditional segmentation benchmarks. The following review details the fundamental principles, methodological advancements, benchmarking protocols, and emerging trends in zero-shot instance segmentation, emphasizing developments as of 2025.

1. Fundamental Problem Formulation and Challenges

Zero-shot instance segmentation (ZSI) formalizes the task where the training set comprises only a subset of possible semantic categories ("seen" classes), while inference must localize and segment object instances from both seen and "unseen" classes (Zheng et al., 2021). This requires the model to not only detect objects never observed during training but also to delineate each object’s spatial extent at the instance level.

Crucial challenges in ZSI arise from:

Semantic transfer gap: Leveraging semantic relationships among categories—typically via word embeddings or vision-LLMs—to transfer knowledge from seen to unseen classes.
Instance-level granularity: Unlike semantic segmentation, ZSI must provide separate masks for each object, even when multiple instances are present and possibly belonging to unseen categories (Ding et al., 2021).
Background confusion: Deciding whether to assign a region to an unseen object or to background, particularly when training treats all out-of-vocabulary pixels as background (He et al., 2023).
Domain and modality gap: Transferring from modalities or data distributions used during pre-training (e.g., web images, ground-level or open-world images) to the target domain, such as remote sensing, medical, or cluttered indoor robotics (Huang et al., 2024, 2212.11922).

2. Core Methodological Approaches and Model Architectures

Architectures for ZSI have evolved from direct extensions of closed-set detectors to new forms incorporating generative models, semantic alignment, and vision-language coupling.

2.1 Visual-Semantic Alignment and Embedding Transfer

Early ZSI frameworks adapt architectures like Faster R-CNN or Mask R-CNN by projecting region-wise visual embeddings into a semantic space defined by word- or sentence-level embeddings (e.g., Word2Vec, GloVe, CLIP). Classification then operates in this shared space using dot-product or cosine similarity between visual and text-derived category vectors (Zheng et al., 2021, Khandelwal et al., 2023). Some works employ auxiliary projection layers learned in a second training stage, with the entire backbone frozen, to mitigate overfitting to the seen set and improve knowledge transfer robustness.

2.2 Semantic-Promoted Debiasing and Background Disambiguation

State-of-the-art architectures such as D²Zero introduce mechanisms to reduce the bias toward seen classes (semantic-promoted debiasing) and to resolve ambiguities between background and unseen object regions (background disambiguation) (He et al., 2023). This is accomplished by:

Supervising the feature extractor with soft pseudo-labels based on the semantic similarity between seen and unseen categories.
Employing a lightweight transformer to generate input-conditional (image-adaptive) class centers for dynamic classification.
Constructing image-adaptive background prototypes using masked average pooling over class-agnostic mask proposals.

2.3 Generative and Primitive-based Models

Generative models synthesize features or mask images for unseen classes by exploiting semantic-visual relationships. Generative Moment Matching Networks and primitive-based transformers have been used to bridge the semantic gap, where learnable primitives represent fine-grained attributes and are assembled based on class embeddings (He et al., 2023). Instance coloring losses in generative segmentation recast the problem as image-to-image translation, training diffusion models or autoencoders to "paint" instances with unique colors (Khangaonkar et al., 21 May 2025).

2.4 Decoupled Grouping and Classification

Recent decoupling strategies advocate separating grouping (class-agnostic instance proposals) from zero-shot classification. Grouping is achieved via detector-based or transformer models independent of class semantics; classification at the instance level leverages semantic similarity with text or LLM embeddings (Ding et al., 2021). This modularity allows independent improvement and integration of grouping and classification modules.

2.5 Template and Patch-based Matching

Zero-shot referring segmentation frameworks (e.g., TAS) and template-based inference (e.g., NOCTIS) combine strong mask proposal generators (such as Grounded-SAM 2 and DINOv2) with fine-grained descriptor matching between query and template images. Patch-level roundtrip (cyclic) consistency and confidence-weighted matching scores enhance robustness over both global and local appearance cues (Gandyra et al., 2 Jul 2025).

3. Key Loss Functions and Training Strategies

Several custom loss functions have been introduced targeting ZSI’s unique supervision constraints:

Zero-Shot Maximum Mean Discrepancy (ZS-MMD) Loss: Weighs MMD contributions by the confidence in pseudo-labels, enabling generative feature transfer when no direct supervision is available for unseen classes (Wang et al., 2021).
Instance Coloring Loss: Enforces consistent color assignment within instances and maximizes color distance between different instances, supporting generative models’ ability for fine-grained grouping (Khangaonkar et al., 21 May 2025).
Semantic Alignment Losses: Force the pairwise similarity structure in the visual space (via cosine distances of feature vectors) to match that in the semantic embedding space, often via KL-divergence or contrastive alignment (He et al., 2023).
Cross-entropy, Dice, and Bipartite Matching Losses: Used for proposal assignment and mask prediction in transformer-based architectures, ensuring coverage of all object instances (Shin et al., 2023).

Crucially, end-to-end recursive training and pseudo-label bootstrapping are employed for self-improvement, with high-confidence predictions guiding further iterations (Wang et al., 2021).

4. Benchmarks, Evaluation Protocols, and Results

ZSI research has established several benchmarks and analytical criteria:

Standard Splits: The MS-COCO and Pascal VOC datasets are split into seen and unseen categories (e.g., 48/17, 65/15 splits), with unseen classes absent from training data (Zheng et al., 2021, Khandelwal et al., 2023).
Specialized and Domain-centric Benchmarks: Datasets for remote sensing (iSAID, NWPU-VHR-10), robotics (OCID, HIOD), and 3D mapping (ScanNet200, Replica) assess robustness across domains (Huang et al., 2024, Zhang et al., 5 Feb 2025, Kim et al., 13 Jun 2025).
Metrics: Instance segmentation assessment focuses on mean Average Precision (mAP), Recall@100 (typically at IoU=0.5), harmonic mean mAP (HM-mAP), and task-specific metrics such as Intersection-over-Union (IoU) and F-measure for both seen and unseen categories.
Empirical Findings: D²Zero demonstrates a 16.86% improvement in HM-mAP over previous methods on COCO (He et al., 2023). Gen2seg achieves performance competitive with supervised SAM, especially in recovering fine structures on iShape and COCO-exc (Khangaonkar et al., 21 May 2025). OC-DiT's diffusion-based approach achieves up to 71.7 AP on YCB-V, establishing a new state-of-the-art for template-conditioned instance segmentation (Ulmer et al., 6 Aug 2025). Lightweight models such as SupeRGB-D achieve competitive or superior zero-shot F-measure and boundary precision in hierarchical and cluttered indoor settings (2212.11922, Zhang et al., 5 Feb 2025).

5. Expansions: 3D, Robotics, and Open-vocabulary Settings

Contemporary ZSI extends beyond 2D images:

3D Zero-Shot Mapping and Panoptic Reconstruction: Frameworks integrate 2D open-vocabulary segmenters with 3D point cloud or mesh association, resolving feature spillover using mask voting and instance graph algorithms (Yu et al., 2024, Kim et al., 13 Jun 2025). OV-MAP and PanopticRecon demonstrate robust zero-shot generalization over diverse indoor and outdoor environments.
Robotic Manipulation and Service Robots: ZISVFM exploits depth-enhanced proposals and attention-based ViT filters for reliable robotic grasping without requiring category-level supervision, validated in real-world hierarchical environments (Zhang et al., 5 Feb 2025).
Remote Sensing: ZoRI copes with high intra-class variance and domain gap by refining text embeddings, selectively fine-tuning visual-specific channels (knowledge-maintained adaptation), and introducing cache banks of aerial prototypes (Huang et al., 2024).

6. Practical Considerations and Future Directions

ZSI research encompasses several practical and methodological trends:

Scalability and Model Complexity: Simpler two-stage transfer schemes (frozen base detectors with learned projection heads) are shown to outperform more complex pipelines, provided high-quality semantic embeddings are used (Khandelwal et al., 2023).
Generative Model Regularization: Recent work indicates that generative objective functions (e.g., instance coloring in diffusion models) lead to more natural grouping abilities that generalize across categories, suggesting a new avenue distinct from discriminative mask prediction (Khangaonkar et al., 21 May 2025, Ulmer et al., 6 Aug 2025).
Mask Proposals and Prompt Integration: Advances in open-vocabulary segmentation (Grounded-SAM, Segment Anything) and template-proposal matching offer modularity and strong zero-shot generalization, but their practical effectiveness is bounded by the quality of initial instance proposals (Gandyra et al., 2 Jul 2025, Suo et al., 2023).
Domain Adaptation: Methods such as knowledge-maintained adaptation (with frozen semantic-aligned channels and trainable visual-specific channels) and cache-bank fusion are essential when applying vision-LLMs to novel domains (e.g., remote sensing) (Huang et al., 2024).

Emerging research points toward a deeper exploitation of generative priors for perceptual grouping, advanced fusion of 2D and 3D open-vocabulary instances, and more sophisticated adaptation (e.g., uncertainty estimation, multimodal fusion, or transformer-based dynamic heads). Future work is likely to focus on fine-grained open-world understanding, domain-agnostic modularity, and resource-efficient designs for deployment in computationally constrained settings.