Zero-Shot Segmentation

Updated 3 August 2025

Zero-shot segmentation is a computer vision approach that segments unseen object categories by utilizing semantic embeddings, language descriptions, and contextual cues.
It extends zero-shot learning to dense prediction tasks, enabling pixel-wise classification of both seen and unseen classes while reducing annotation efforts.
Recent methods incorporate generative models, vision-language frameworks like CLIP, and bias rectification techniques to improve segmentation accuracy and open-world adaptability.

Zero-shot segmentation is a computer vision paradigm in which a model is required to perform semantic, instance, or panoptic segmentation for object categories or semantic concepts for which there are no corresponding pixel-level training annotations. Instead, the model must leverage auxiliary information—typically in the form of semantic embeddings, language descriptions, or structural knowledge—extracted from seen classes, in order to generalize to never-seen target categories. This capability is of significant theoretical and practical interest, as it reduces the annotation burden for dense prediction tasks, and enables models to operate in continually evolving open-world environments.

1. Foundations and Formulation

Zero-shot segmentation (ZSS) builds upon the foundations of zero-shot learning (ZSL) established in image classification but extends them to the dense, structured output regime of segmentation. In the canonical setup, the set of all categories $\mathcal{C}$ is partitioned into seen classes $\mathcal{S}$ (with pixel-wise labeled examples) and unseen classes $\mathcal{U}$ (with none). The most widely studied problem is generalized zero-shot segmentation (GZSS), where both $\mathcal{S}$ and $\mathcal{U}$ may appear at test time, and the model must not only assign correct pixels to novel classes, but also preserve performance on seen classes (Bucher et al., 2019).

Key elements include:

Semantic Embeddings: Semantic categories are represented by continuous vectors, typically extracted from word embedding models such as word2vec, GloVe, fastText, or vision-LLMs (CLIP).
Pixel-wise Prediction: For input image $x$ , the model predicts a pixel-wise labeling $y = \{y_{ij} \mid 1 \leq i \leq H, 1 \leq j \leq W\}$ , where each $y_{ij} \in \mathcal{C}$ , conditioned only on semantic knowledge for unseen classes.
Auxiliary Knowledge Transfer: Knowledge is transferred from $\mathcal{S}$ to $\mathcal{U}$ via architectural or optimization constraints so that semantic structure, visual features, or contextual relationships are preserved.

If semantic word embeddings for class $c$ are denoted $a_c$ , and per-pixel visual features as $f_{ij}$ , almost all methods rely on a function $\phi(f_{ij}, a_c)$ to compute compatibility or similarity, which forms the basis for the zero-shot assignment.

2. Generative and Embedding-Based Approaches

Early approaches introduced a two-branch mechanism: a deep segmentation network for extracting visual features, and a semantic embedding model for category transfer (Bucher et al., 2019). The ZS3Net architecture exemplifies this class:

Pixel Feature Synthesis via Word Embeddings: A generative model $G(a, z; w)$ , such as a Generative Moment Matching Network (GMMN), is trained to generate visual pixel features conditioned on the word embedding $a$ and a random noise vector $z$ . The GMMN objective matches the distribution of generated features to that of real pixel features from seen classes using maximum mean discrepancy (MMD):

$L_{\mathrm{GMMN}}(a) = \sum_{x, x'\in X(a)} k(x, x') + \sum_{\hat x, \hat x' \in \hat X(a; w)} k(\hat x, \hat x') - 2\sum_{x\in X(a)} \sum_{\hat x \in \hat X(a; w)} k(x, \hat x)$

Classifier Hybridization: The final classifier in ZS3Net is retrained on a hybrid feature set: true features from seen classes and synthetic features for unseen classes, yielding a unified pixel classifier for $\mathcal{C} = \mathcal{S} \cup \mathcal{U}$ .
Extension with Contextual Graphs: Generators may be augmented with spatial graph-context encoding, where segmentation masks are represented as adjacency graphs and additional graph convolution layers enforce spatial priors, as in complex scenes in Pascal-Context.

Patch-based and context-aware feature generation further enhance generalization. In CaGNet (Gu et al., 2020), features are synthesized in a context-aware manner, where the generator $G$ receives both word embeddings and contextual latent codes (sampled via a contextual module) to match pixel-level context. Patch-wise extensions use PixelCNN to account for inter-pixel relationships, enabling realistic feature patches for classifier fine-tuning.

3. Bias Rectification and Transductive Learning

A persistent problem in ZSS is bias toward seen classes, particularly in the GZSS regime. Supervised training on only seen pixels strongly anchors the network's feature space around $\mathcal{S}$ , impeding transfer. Several strategies address this:

Bias Rectification Loss: In transductive ZSS (Liu et al., 2020), unlabeled target images (containing unseen classes) are used during training. The network is regularized by an auxiliary loss $L_b$ that maximizes the summed probability over all target (unseen) classes for each pixel in target images,

$L_b = - \sum_{t \in T} \sum_{i=1}^H \sum_{j=1}^W \ln \left( \sum_{k \in Y_t} p(\hat y_{ij} = k \mid x_t) \right)$

This encourages pixels in target images to have high probability mass over $\mathcal{U}$ , explicitly helping rectification.

Self-Training with Pseudo-Labeling: Models such as ZS5Net (Bucher et al., 2019) and recursive schemes (Wang et al., 2021) iteratively select high-confidence pseudo-labels for unseen classes and retrain, gradually mitigating class bias.
Counterfactual Deconfounding: (Shen et al., 2021) introduces counterfactual rationale: the causal path $R \rightarrow F \rightarrow L$ (real features influencing fake/generative features) can induce spurious seen-class bias. By explicitly removing this bias with counterfactual interventions (using Total Effect, Natural Direct Effect, and Natural Indirect Effect computations), a deconfounded output is fused with the standard model, and further improved by semantic message passing via GCNs.

4. Vision-LLMs and Decoupled Grouping

The advent of large-scale vision-LLMs (e.g., CLIP) enabled direct exploitation of joint visual and semantic embeddings:

Segment-Level Decoupling: ZegFormer (Ding et al., 2021) decouples segmentation into class-agnostic grouping (e.g., transformer-based mask generation) and segment-level zero-shot classification. Each segment is labeled by matching a learned segment embedding with CLIP text embeddings via cosine similarity and temperature scaling:

$p_q(c) = \frac{ \exp \left( \frac{1}{\tau} s_c(T_i, G_q) \right) }{ \sum_{i=0}^{|\mathcal{C}|} \exp \left( \frac{1}{\tau} s_c(T_i, G_q) \right) }$

Here, segment queries and prompt-based text encoding enable efficient transfer of CLIP's semantic understanding to dense prediction, overcoming pixel-wise weak alignment.

Optimal Transport Methods: ZegOT (Kim et al., 2023) leverages multiple trainable text prompts for each class. The mapping between pixel-level features and text embeddings is performed via optimal transport (Sinkhorn) for maximal one-to-many flexibility, aligning multi-prompt text features to image features from frozen CLIP. This achieves state-of-the-art unseen class segmentation without costly retraining.
Prompt Tuning and Route Attention: Recent work implements joint deep prompt tuning for both vision and language encoders (Zhang et al., 13 Mar 2024), as well as local-consensus attention mechanisms to enforce regional consistency and combat mask fragmentation, leveraging learnable prompt tokens and route-based self-attention to enhance segmentation reliability.

5. Beyond 2D: Open-Vocabulary, 3D, and Video Zero-Shot Segmentation

ZSS has been extended beyond conventional 2D semantic segmentation:

Open-Vocabulary Segmentation: SimZSS (Stegmüller et al., 23 Jun 2024) demonstrates that open-vocabulary zero-shot segmentation can be efficiently achieved by freezing a spatially-aware vision model (e.g., DINOv2), training a text encoder to align phrase-level (noun phrase) representations from captions with local visual embeddings. The cross-modality concept-level loss provides accurate spatial grounding from only paired image-caption data.
3D and Point Cloud Segmentation: SATR (Abdelreheem et al., 2023) and MeshSegmenter (Zhong et al., 18 Jul 2024) apply zero-shot “lifting” approaches to 3D segmentation. They render meshes from multiple views, apply 2D zero-shot detectors/segmenters (e.g., Segment Anything), and aggregate predictions via topology-aware voting or confidence revoting across faces, using additional text-to-texture synthesis where geometry is non-prominent.
Video Semantic Segmentation: Zero-shot methods for video (Wang et al., 27 May 2024) build on pre-trained diffusion models, leveraging temporally consistent scene context modeling and correspondence-based refinement to generate per-frame coarse segmentations that are upsampled to high quality using masked spatial modulation, all without explicit video supervision.

The open-vocabulary regime is enabled by vision-language contrastive models and prompt-aware adapters, providing strong generalization to arbitrary categories described in free-text.

6. Additional Developments: Objective Misalignment and Robustness

Recent research has identified obstacles and explored robustness improvements:

Objective Misalignment: AlignZeg (Ge et al., 8 Apr 2024) addresses the problem wherein standard training objectives favor seen-class recognition due to dense supervision, yielding feature spaces over-occupied by seen prototypes. AlignZeg introduces mutually-refined proposal extraction, synthetic feature expansion, multiple learned background prototypes, and a predictive bias correction module to actively suppress seen-class bias in the inference logits, thereby improving hIoU by 3.8% on COCO-Stuff and 7.1% on unseen class mIoU.
Shape-Awareness and Spectral Cues: SAZS (Liu et al., 2023) combines vision-language pixel embedding alignment with auxiliary shape constraints (boundary head predicting edges) and spectral eigensegment fusion, showing that embedding locality and mask compactness govern the utility of shape priors for zero-shot generalization.
Partial CLIP and Knowledge Distillation: Chimera-Seg (Chen et al., 27 Jun 2025) fuses spatially precise segmentation models with partial, frozen CLIP visual encoder components, using selective global distillation (SGD) and a semantic alignment module (SAM) to overcome the gap between local pixel-level features and global vision-language representations, producing consistent hIoU improvements.

7. Benchmarks, Evaluation, and Future Directions

Prominent benchmarks include Pascal-VOC, Pascal-Context, COCO-Stuff, Cityscapes, ADE20K, SemanticKITTI, nuScenes, and ShapeNetPart, using zero-shot splits with varying numbers of unseen classes. Key metrics are:

Pixel Accuracy (PA), Mean Accuracy (MA), Mean IoU (mIoU), Harmonic mean of mIoU (hIoU), and task-specific recall/mAP for instance or 3D settings.

Current challenges include class imbalance, segmentation of small or context-entangled regions, and temporal consistency in video. Directions for future research as outlined in the literature include:

Broader exploitation of large-scale vision-LLMs and prompt engineering.
Improved spatial and temporal feature alignment, possibly with more sophisticated attention mechanisms or causal modeling.
Efficient open-vocabulary and continual learning techniques for open-world deployment.
Application of spectral, causal, and graph-based priors to further boost generalization.
Reducing computational overhead and optimizing parameter-efficient prompt tuning strategies.

Zero-shot segmentation has become a crucial subfield in vision, providing methods to extend dense annotation capabilities into open, realistic environments where annotation for every category or scene configuration is infeasible.