Zero-Shot Segmentation Overview

Updated 16 October 2025

Zero-shot segmentation is a technique that transfers learned knowledge from labeled seen classes to classify pixels for unseen object categories using semantic embeddings.
Approaches involve projection, generative synthesis, self-training, and contextual modeling to achieve dense predictions in applications like medical imaging and remote sensing.
Key challenges include bridging the visual-semantic gap, mitigating bias toward seen classes, and ensuring accurate mask localization for robust performance.

@@@@2@@@@ is a class of approaches in computer vision enabling dense prediction of semantic or instance masks for object categories that lack any annotated training samples. The central premise is to decouple the purely label-driven paradigm by transferring knowledge from seen categories—with available training masks or images—to unseen categories, typically by leveraging auxiliary semantic information such as word embeddings or visual-language priors. The problem formulation imposes that, at inference, pixels (or points, segments, or instances) are classified into both seen and unseen categories without having seen any labeled mask data for the unseen set.

1. Problem Definition and Technical Scope

A zero-shot segmentation approach learns to assign pixel-level (or region-level) labels to previously unseen categories by transferring learned knowledge from annotated "seen" categories. This transfer is typically mediated via a semantic embedding space, such as those constructed from pre-trained word embeddings (e.g., word2vec), sentence encoders, or vision-LLMs. Let $\mathcal{S}$ denote the set of seen classes and $\mathcal{U}$ the disjoint set of unseen classes, with $\mathcal{C} = \mathcal{S} \cup \mathcal{U}$ . Training is performed on images annotated only for $\mathcal{S}$ , whereas evaluation requires prediction for $\mathcal{C}$ .

Mathematically, if an image $x$ has feature representations $f_{ij} = F(x)_{ij}$ at pixel location $(i, j)$ , the zero-shot segmentation task requires learning a mapping $f_{ij} \mapsto y_{ij}$ for $y_{ij}\in\mathcal{C}$ , with no supervision for $y_{ij} \in \mathcal{U}$ during training. Instead, a class embedding function, typically $a[c] \in \mathbb{R}^{d}$ , is provided for each class $c\in\mathcal{C}$ , bridging the visual and semantic modalities.

This paradigm generalizes across semantic segmentation, instance-level segmentation, and even to 3D and video domains, as evidenced by applications in remote sensing (Huang et al., 17 Dec 2024), medical imaging (Towle et al., 2 Jun 2024), plant and agricultural settings (Ravé et al., 14 Oct 2025), and video tasks (Wang et al., 27 May 2024, Guo et al., 10 Apr 2025).

2. Methodological Taxonomy

Zero-shot segmentation frameworks can be broadly categorized along the following methodological axes:

a. Projection and Metric Learning

These approaches project pixel-wise or region-based visual features into the semantic embedding space, enabling classification based on nearest neighbor or similarity metrics. The mapping is typically learned with supervision from seen classes, then generalized to unseen classes via their semantic embeddings (Ren et al., 2022).

Example architecture:

DeepLabV3+ or other backbone extracts visual features.
For each class $c$ , obtain $a[c]$ (e.g., word2vec).
Predict class for each pixel using similarity: $y_{ij} = \arg\max_{c \in \mathcal{C}} S(f_{ij}, a[c])$ , with $S(\cdot)$ a similarity function (often dot product).

b. Generative Feature Synthesis

Generative models attempt to directly synthesize (pseudo) visual features corresponding to unseen classes using their semantic descriptions. A common instantiation is to train a generator $G(a, z; w)$ —often a GMMN or adversarial network—so that the statistics of generated features $\hat{x} = G(a, z; w)$ for unseen categories match the real distribution for seen classes, regularized via a loss such as Maximum Mean Discrepancy (MMD) (Bucher et al., 2019). These synthetic features are then used to retrain or fine-tune the segmentation classifier to make it capable of classifying both seen and unseen classes.

c. Self-Training and Transductive Approaches

Methods such as ZS5Net (Bucher et al., 2019) apply self-training by generating pseudo-labels for pixels in unlabeled images predicted as unseen classes (with high confidence), then retrain the classifier on these automatic labels to reduce bias. Transductive variants directly regularize the network to increase the likelihood of target (unseen) classes for unlabeled target images using a bias rectification loss, avoiding the need for explicit pseudo-labels (Liu et al., 2020).

d. Contextual and Graph-based Modeling

Context-aware extensions condition feature generation not only on the semantic class embedding but also on pixel-wise or global context features, enhancing the diversity and quality of synthetic features (Gu et al., 2020). Graph-based approaches encode spatial relationships and scene priors via adjacency graphs and apply graph convolutional layers to provide context-aware pixel or segment classification (Bucher et al., 2019, Shen et al., 2021).

e. Decoupled Grouping and Classification

Segmentation can be decoupled into a class-agnostic grouping task (segmenting regions without class information) followed by segment-level classification using large vision-LLMs (e.g., CLIP) (Ding et al., 2021). This formulation is more aligned with human perception and leverages segment-to-text matching rather than pixel-level classification, mitigating confusion among visually similar classes.

f. Foundation Model Prompting and Zero-Shot Engines

With the advent of models like SAM and PlantNet, prompt-based zero-shot segmentation via interactive or simulated clicks is possible (Towle et al., 2 Jun 2024, Ravé et al., 14 Oct 2025). These techniques generate candidate masks through prompt-driven segmentation backbones, often leveraging domain-specific representations or simulated user interactions.

g. Video and 3D Zero-Shot Segmentation

Zero-shot segmentation in spatiotemporal and 3D data requires further strategies, such as integrating motion cues (e.g., via optical flow in video for camouflaged object detection) (Guo et al., 10 Apr 2025), multi-modal fusion (point clouds and images) (Lu et al., 2023), and context models for temporal consistency (Wang et al., 27 May 2024).

3. Representative Architectures and Loss Functions

Approaches unify traditional segmentation backbones (e.g., DeepLabv3+, Mask2Former) with the following components:

Semantic Embeddings: Extracted from pre-trained word2vec, fastText, or vision-LLMs. Used as class anchors for unseen objects (Bucher et al., 2019, Zhang et al., 13 Mar 2024).
Generator Loss (MMD): For real class feature set $X(a)$ and generated set $\hat{X}(a; w)$ ,

$L_{GMMN}(a) = \sum_{x,x'\in X(a)} k(x,x') + \sum_{\hat{x},\hat{x}' \in \hat{X}(a; w)} k(\hat{x},\hat{x}') - 2 \sum_{x\in X(a)}\sum_{\hat{x}\in \hat{X}(a; w)} k(x,\hat{x})$

with $k(x, x') = \exp\left(-\frac{1}{2\sigma^2} \|x-x'\|^2\right)$ .

Recursive Supervision: High-confidence (pseudo) outputs from the classifier are weighted and fed back to the generator, with a confidence-weighted ZS-MMD loss (Wang et al., 2021).
Contrastive and Cross-Entropy Losses: Ensure alignment in the joint visual-semantic space and, in the transductive case, encourage the assignment of target image pixels to unseen semantic regions (Liu et al., 2020).
Graph-Context Encoding: Nodes represent connected regions, and features are generated with GCN conditioned on both semantic meaning and local context (Bucher et al., 2019, Shen et al., 2021).
Shape- and Boundary-Aware Losses: Enforce shape consistency via boundary prediction or spectral cues, augmenting standard semantic alignment (Liu et al., 2023).

4. Performance Evaluation and Experimental Benchmarks

Standard benchmarks include PASCAL VOC, PASCAL Context, COCO-Stuff, iSAID, NWPU-VHR-10, SemanticKITTI, and nuScenes. Key metrics:

Mean Intersection-over-Union (mIoU): Computed separately for seen and unseen classes, as well as the harmonic mean for balanced reporting.
Pixel/Mean Accuracy: Fraction of correctly classified pixels.
mAP, Recall@100: Used for instance segmentation tasks.
Domain-Specific Indices: Jaccard index in plant segmentation (Ravé et al., 14 Oct 2025), Dice similarity and normalised surface distance (NSD) for medical imaging (Towle et al., 2 Jun 2024), specialized detection success measures for camouflaged objects (Guo et al., 10 Apr 2025).

Reported gains include, for example:

mIoU improvements of 4.5 and 3.6 on PASCAL VOC 2012 and COCO-Stuff 164k, respectively, for LDVC (Zhang et al., 13 Mar 2024).
Up to 15.5% improvement in contour segmentation accuracy for SimSAM over zero-shot SAM (Towle et al., 2 Jun 2024).
Substantial gains on zero-shot instance benchmarks with advanced pipelines (e.g., ZoRI improving HM-mAP by 14–21% over competitors in remote sensing (Huang et al., 17 Dec 2024)).
Zero-shot approaches occasionally outperforming supervised methods, as observed with ZS-VCOS on MoCA-Mask (Guo et al., 10 Apr 2025).

5. Contextual Extensions and Applications

Zero-shot segmentation approaches have been systematically extended beyond natural image benchmarks:

Medical Imaging: Simulated user interactions and mask aggregation in SimSAM yield improved segmentation for ambiguous anatomical boundaries (Towle et al., 2 Jun 2024).
Remote Sensing: Discrimination-enhanced classifiers and knowledge-maintained adaptation (KMA) overcome domain gaps and large intra-class variance (Huang et al., 17 Dec 2024).
Plant and Agricultural Images: PlantNet’s domain-aware features, integrated with DinoV2 and SAM, deliver improved segmentation in conditions with limited labeled data (Ravé et al., 14 Oct 2025).
Video and Camouflage Detection: Integration of optical flow, open-vocabulary detection, and promptable segmentation models address the challenges posed by indistinguishable (camouflaged) foregrounds (Guo et al., 10 Apr 2025).
3D Point Cloud Data: Multi-modal fusion modules exploiting both 3D LiDAR and RGB images via semantic-guided visual fusion strongly improve mIoU for unseen 3D classes (Lu et al., 2023).

6. Challenges, Open Problems, and Future Directions

Despite measurable advances, zero-shot segmentation approaches face several persistent challenges:

Cross-Modal Gap: Bridging distributional disparities between visual features and semantic embeddings remains a limiting factor, especially in domains with little linguistic support or for classes poorly represented in LLMs.
Bias Toward Seen Classes: Most methods require explicit mechanisms (self-training, bias rectification losses, or counterfactual reasoning) to mitigate over-prediction for seen categories (Liu et al., 2020, Shen et al., 2021).
Domain Shift and Robustness: Robustness to domain transfer, context variation, or adversarial settings is an open research question, as highlighted by the performance in remote sensing and agricultural applications (Huang et al., 17 Dec 2024, Ravé et al., 14 Oct 2025).
Localization and Mask Quality: Fundamental limitations of vision-LLMs in pixel-level localization often necessitate explicit inclusion of shape, context, or spectral decomposition techniques (Liu et al., 2023).
Scalability and Efficient Training: Methods such as SimZSS (Stegmüller et al., 23 Jun 2024) advocate decoupled training—freezing the vision backbone and training only the text encoder—to enable rapid adaptation on large-scale datasets with minimal compute.

A plausible implication is that future research will further combine domain-specific priors, advanced modality fusion strategies, prompt engineering for universal segmentation backbones, and unsupervised/self-supervised spectral or affinity-based methods to address the remaining limitations.

7. Summary Table: Key Paradigms and Exemplary Methods

Paradigm	Representative Method(s)	Key Component(s)
Projection/Metric	SPNet, Transductive ZSS (Liu et al., 2020)	Visual-semantic mapping, bias loss
Generative Synthesis	ZS3Net (Bucher et al., 2019), ZS-MMD (Wang et al., 2021)	Feature generator, MMD/ZS-MMD losses
Self-Training	ZS5Net (Bucher et al., 2019)	Pseudo-labels, confidence sampling
Contextual/Graph-based	CaGNet (Gu et al., 2020), GCN+GC encoding (Bucher et al., 2019)	Context modules, graph convolutions
Decoupled Group+Classify	ZegFormer (Ding et al., 2021)	Segment-level CLIP classification
Prompt-based/Found. Model	SimSAM (Towle et al., 2 Jun 2024), PlantNet+SAM (Ravé et al., 14 Oct 2025)	Simulated interaction, prompt segmentation
Video/3D Fusion	ZS-VCOS (Guo et al., 10 Apr 2025), 3D Multi-modal (Lu et al., 2023)	Motion, LiDAR-image fusion

This survey of zero-shot segmentation approaches demonstrates the confluence of semantic embedding transfer, generative modeling, contextual reasoning, and promptable dense prediction, with continuing research focused on overcoming cross-domain gaps, improving mask granularity, and minimizing bias toward seen classes.