Zero-Shot Learning: Techniques & Challenges

Updated 7 April 2026

Zero-shot learning is a paradigm where models use auxiliary semantic information such as attributes and word embeddings to recognize unseen classes.
It employs methods including embedding techniques, generative models, and graph-based approaches to align visual and semantic spaces and mitigate domain shift.
Recent advances integrating meta-learning and knowledge-enhanced embeddings have achieved superior performance on benchmarks like CUB and ImageNet.

Zero-shot learning (ZSL) is the paradigm in machine perception and natural language understanding in which a model is required to recognize or assign correct semantic labels to instances from classes that were entirely unseen during training. Unlike traditional supervised learning, ZSL relies on auxiliary semantic information—such as human-defined attributes, word embeddings, or knowledge-graph representations—to enable transfer of knowledge from seen to unseen categories. The field encompasses inductive, transductive, generative, meta-learning, and knowledge-regularized frameworks, and has achieved significant empirical success on challenging benchmarks in vision, language, and multimodal domains.

1. Problem Formulation and Core Principles

The canonical ZSL setting decomposes the label set into seen classes $\mathcal{Y}_{s}$ (training) and unseen classes $\mathcal{Y}_{u}$ (test), with $\mathcal{Y}_{s} \cap \mathcal{Y}_{u} = \emptyset$ (Cacheux et al., 2021, Saad et al., 2022). Training data take the form $\mathcal{D}_{s} = \{(x_{i}, y_{i}) \mid y_{i} \in \mathcal{Y}_{s}\}$ , where $x_{i} \in \mathbb{R}^{D}$ is a deep feature or text representation. Each class $y \in \mathcal{Y}_{s} \cup \mathcal{Y}_{u}$ is associated with a semantic prototype $a_{y} \in \mathbb{R}^{p}$ , typically binary/real attributes, word-embeddings, or graph-based vectors. The objective is to learn a scoring function $f : \mathbb{R}^{D} \times \mathbb{R}^{p} \to \mathbb{R}$ such that at test time, the model predicts $\hat{y} = \arg\max_{y \in \mathcal{Y}_{u}} f(x, a_{y})$ without access to $x$ labeled as $\mathcal{Y}_{u}$ 0 during training (Cacheux et al., 2021, Li et al., 2017).

Generalized ZSL (GZSL) expands the test-time label set to $\mathcal{Y}_{u}$ 1, creating a strong bias toward seen classes and emphasizing the need for calibration or generative modeling (Verma et al., 2019, Badirli et al., 2019).

2. Methodological Taxonomy

2.1 Compatibility-Based and Embedding Methods

These methods define a compatibility function (bilinear, linear, or deep) between input features and class prototypes. Seminal approaches include DeViSE ( $\mathcal{Y}_{u}$ 2), SJE, ALE, ESZSL, and SAE, with optimization objectives ranging from margin-based ranking losses to closed-form regression (Saad et al., 2022, Cacheux et al., 2021). Embedding-based methods may map $\mathcal{Y}_{u}$ 3 to $\mathcal{Y}_{u}$ 4 (or vice versa) via direct regression, or optimize a distance in a learned joint space.

2.2 Generative Models: VAE and GAN-Based ZSL

Generative models leverage a conditional generative network—conditional VAE or GAN—capable of synthesizing (pseudo) instances for unseen classes from semantic embeddings (Wang et al., 2017, Yu et al., 2019, Elhoseiny et al., 2019, Ting et al., 2021). Notable generative ZSLs:

VAE-based: Each class is associated with an attribute-conditioned Gaussian latent prior $\mathcal{Y}_{u}$ 5, enabling sampling of unseen-class features via $\mathcal{Y}_{u}$ 6; classification is by the class whose prior best matches the posterior inferred from $\mathcal{Y}_{u}$ 7 (Wang et al., 2017, Yu et al., 2019).
GAN-based: A generator $\mathcal{Y}_{u}$ 8 synthesizes visual features conditioned on the semantic vector $\mathcal{Y}_{u}$ 9 and noise $\mathcal{Y}_{s} \cap \mathcal{Y}_{u} = \emptyset$ 0; a discriminator $\mathcal{Y}_{s} \cap \mathcal{Y}_{u} = \emptyset$ 1 ensures generated features are plausible and class-discriminative (Ting et al., 2021, Elhoseiny et al., 2019).

Creative regularization and meta-learning (MAML-style adaptation) strategies have further advanced generative ZSL, directly addressing seen/unseen domain shift and improving performance in the few-shot regime (Verma et al., 2019, Elhoseiny et al., 2019).

2.3 Structured and Graph-Based ZSL

Graph-based ZSL exploits inter-class relationships within a knowledge graph or shared reconstruction graph (SRG). For example, SRG learns a reconstruction matrix $\mathcal{Y}_{s} \cap \mathcal{Y}_{u} = \emptyset$ 2 allowing both semantic and feature prototypes to be written as sparse combinations of others, reducing geometric “space-shift” between modalities (Zhao et al., 2017). ZSL-KG learns class embeddings from ConceptNet with a Transformer Graph Convolutional Network (TrGCN), capturing non-linear semantic neighborhood structure (Nayak et al., 2020).

2.4 Latent Space and Joint Projection Models

Methods such as Latent Space Encoding (LSE) (Yu et al., 2017) and Joint Concept Matching (JCMSPL) (Tang et al., 2019) project both visual and semantic modalities into a common latent space, optimizing bidirectional reconstruction losses and enforcing class-specific anchoring, which robustly generalizes to unseen classes and mitigates projection domain shift. These are amenable to direct extension with multiple modalities.

2.5 Bayesian and Meta-Class Models

Bayesian ZSL proposes a hierarchical generative model, pooling information from meta-classes constructed by semantic similarity. Posterior predictive distributions for seen and unseen classes are derived analytically, with hyperparameters directly controlling the trade-off between seen and unseen accuracy (Badirli et al., 2019).

3. Challenges: Domain Shift, Hubness, and Semantic Quality

The “domain shift” phenomenon describes the failure of mapping functions learned on seen classes to generalize to unseen—arising from disjoint visual/semantic distributions or semantic gap (Cacheux et al., 2021, Zhao et al., 2017). Hubness, where certain semantic prototypes act as universal nearest neighbors, impairs nearest-neighbor ZSL; mitigation strategies include normalization, hubness reduction re-ranking, or mapping from semantic to visual space (Cacheux et al., 2021).

Semantic prototype quality is critical: hand-crafted attributes provide fine control but are costly and limit scalability, while distributional word-vectors lack alignment with visual similarity (Cacheux et al., 2021, Sikka et al., 2020). Knowledge-sharing approaches enrich semantic features via aggregation from similar classes or external resources, and knowledge-graph embeddings further enhance transferability (Ting et al., 2021, Nayak et al., 2020).

4. Training Paradigms: Inductive, Transductive, and Meta-Learning

The classical inductive setting restricts training to labeled seen-class instances and auxiliary class prototypes (Cacheux et al., 2021). Transductive ZSL lifts this constraint by allowing access to unlabeled unseen-class data, enabling manifold regularization, pseudo-labeling, or entropy minimization to align distributions (Wang et al., 2017). Meta-learning frameworks cast ZSL as the problem of rapid adaptation to novel classes, training the model on episodic tasks mimicking the zero-shot condition (Verma et al., 2019).

Sparse attribute propagation (SAP) further considers annotation-richness as a spectrum, and uses graph-based approaches to propagate sparse labels or attributes to unannotated instances via structured sparsity constraints, reducing manual annotation cost and permitting augmentation with web-mined data (Fei et al., 2018).

5. Empirical Benchmarks and Comparative Findings

The field has standardized on a suite of ZSL/GZSL benchmarks, such as CUB, AwA1/2, SUN, aPY, and large-scale ImageNet (Saad et al., 2022, Cacheux et al., 2021, Yu et al., 2017). SOTA methods are typically evaluated by per-class Top-1 accuracy on unseen classes or the harmonic mean between seen and unseen class accuracy (GZSL). Recent frameworks—GAN-based ZSL with creative regularization (Elhoseiny et al., 2019), meta-learned GANs (Verma et al., 2019), knowledge-enhanced embeddings (Sikka et al., 2020), and SRG (Zhao et al., 2017)—reliably outperform compatibility or regression-only ZSL on both image and text datasets.

Meta-classifier ensembles, such as voting classifiers over multiple ZSL models, can deliver superior or more robust performance than any individual base model. However, the effectiveness of such ensembles is highly dataset- and metric-dependent (Saad et al., 2022).

6. Open Problems and Future Directions

Persistent challenges include:

Semantic gap and representation quality: Integrating richer and more visually oriented class semantics, progressing beyond word-embeddings and shallow attributes.
Domain shift: Advanced generative and structure-transfer models continue to address, but not fully resolve, domain misalignment.
Open-world and incremental ZSL: Enabling continual assimilation of novel classes and semantic prototyping without retraining, possibly via meta- or lifelong learning (Cacheux et al., 2021).
Knowledge integration: Tightening the coupling of structured external knowledge, such as commonsense graphs, logical rules, and text, with inductive or generative ZSL remains a promising line (Nayak et al., 2020, Sikka et al., 2020).
Unbiased GZSL evaluation: Developing evaluation protocols and loss formulations that avoid biasing toward seen or unseen classes (Badirli et al., 2019, Verma et al., 2019).

The field has demonstrated that techniques from generative modeling, meta-learning, structured regularization, and hybrid knowledge-augmented embeddings, when tuned for semantic fidelity and robust transfer, collectively push the boundaries of zero-shot inference well beyond traditional deterministic mappings. Open avenues include robust multimodal ZSL, synthetic data augmentation, adaptive meta-class construction, and the principled integration of symbolic and neural semantic encodings (Badirli et al., 2019, Zhao et al., 2017, Sikka et al., 2020).