Zero-Shot Classification

Updated 28 September 2025

Zero-shot classification is a paradigm that uses semantic attributes and language embeddings to assign inputs to unseen classes without direct training examples.
It employs methods like deep embedding, generative models, and graph-based reasoning to map inputs and class descriptions into a shared semantic space.
Recent advances improve transfer learning accuracy in vision, language, and audio tasks, with state-of-the-art methods achieving robust performance under diverse evaluation protocols.

Zero-shot classification refers to the task of assigning inputs (such as images, utterances, or documents) to classes for which no labeled training examples are available. This paradigm enables generalization beyond the set of classes exposed to the model during training by leveraging auxiliary information, such as semantic attributes, natural language descriptions, or structural category representations. Zero-shot classification is a central topic in transfer learning, addressing scenarios where collecting labeled data for all possible categories is infeasible. Recent years have seen effective methodologies developed for vision, language, audio, and cross-modal domains, reflecting both foundational principles and domain-specific adaptations.

1. Foundational Principles and General Framework

In zero-shot classification (ZSC), training is performed exclusively on a set of “seen” categories, while during inference, the model is expected to correctly classify instances from “unseen” classes. A crucial ingredient is semantic side-information that describes all classes, including those unavailable during training. This is typically achieved using semantic attributes, word embeddings, or textual descriptions.

The general workflow comprises three stages:

Embedding both inputs and class descriptions into a shared semantic or feature space (e.g., $f: X \to H$ for input $X$ , with class prototypes $K(C)$ ),
Computing similarities or distances between an input’s embedding and each class embedding,
Assigning the class with maximal similarity or minimal distance.

For example, the probabilistic assignment in (Dauphin et al., 2013) is:

$P(C_r | X_r) = \frac{1}{Z} \exp(-|\mathcal{K}(X_r) - \mathcal{K}(C_r)|),$

where $Z$ normalizes over all candidate classes, and $|\cdot|$ is a suitable distance metric. This general approach—embedding alignment plus distance-based assignment—is ubiquitously adopted, although instantiations vary across domains and architectures.

2. Semantic Embedding Construction and Transfer Mechanisms

A critical challenge in zero-shot classification is constructing informative and discriminative semantic embeddings for both inputs and classes. Several strategies have been proposed:

Attribute- and Language-based Embeddings: Class names or short descriptions are embedded using word vectors, attribute sets, or LLMs (e.g., word2vec, GloVE, LSTM representations) (Fu et al., 2014, Olson et al., 2020). These serve as class “signatures” for unseen categories.
Deep Semantic Representation Learning: Deep neural networks are trained to map the input (utterance, image, document) to a feature space that aligns with class prototypes (Dauphin et al., 2013, Li et al., 2017). The semantic space may be learned in a supervised or self-supervised fashion.
Graph- and Structure-based Semantic Models: Relations among classes (seen and unseen) can be modeled explicitly using semantic graphs, where nodes represent classes and edges encode similarities (as in word embedding cosine similarity) (Fu et al., 2014, Li et al., 2017). Graph-based propagation algorithms leverage these structures for transductive or inductive label inference.

For fine-grained recognition, hierarchical semantic structures (e.g., taxonomy: family $\rightarrow$ genus $\rightarrow$ species) and graph constructions support label propagation and semantic smoothing across related classes.

3. Training Paradigms and Classifiers

Methodologies for zero-shot classification can be classified as follows:

Embedding-based Approaches: Models are trained to directly project inputs and class descriptions into a joint semantic space. Classification is performed by nearest-neighbor search or compatibility scoring: $f(x) = \arg\max_y S(x, y)$ (Bucher et al., 2017), where $S(\cdot, \cdot)$ is a parameterized similarity.
Conditional Generative Models: To overcome limitations of embedding methods (including the inability to use strong discriminative models and biases in generalized settings), conditional generators are trained to synthesize feature vectors for unseen classes from their semantic descriptions. These features turn ZSC into a standard supervised problem (Bucher et al., 2017).
Label Propagation and Structural Alignment: For settings where semantic graphs are available, label propagation over a constructed similarity graph refines initial predictions and encourages coherence among semantically related classes (Li et al., 2017).
Dictionary and Sparse Coding Approaches: Visual features and semantic attributes are modeled via coupled dictionaries that share a sparse code, enabling cross-modal transfer from seen to unseen classes (Rostami et al., 2019).

Adaptive training objectives further regularize the semantic embedding (e.g., with entropy minimization (Dauphin et al., 2013), domain adaptation losses (Li et al., 2017), or attribute-aware regularization (Rostami et al., 2019)) to achieve more discriminative, domain-invariant, and semantically aligned feature spaces.

4. Evaluation Protocols and Applications

Standard evaluation of zero-shot classification occurs on benchmark datasets divided into disjoint seen and unseen classes. Protocols include:

Transductive vs. Inductive: Transductive methods allow unlabelled test data during training (e.g., for label propagation), while inductive methods only use labeled data from seen classes.
Generalized Zero-Shot Classification (GZSC): Models must classify test samples from both seen and unseen classes, mitigating the typical bias toward seen categories (Bucher et al., 2017).
Risk-Coverage Analysis: Certain work (e.g., selective ZSL (Song et al., 2018)) evaluates how models can abstain from low-confidence predictions, balancing risk and coverage especially for safety-critical uses.

Applications and domain-specific adaptations include:

Vision: Object classification (Fu et al., 2014, Li et al., 2017), fine-grained recognition, material and texture categorization (Olson et al., 2020).
Language: Semantic utterance classification (Dauphin et al., 2013), document and job classification with evolving taxonomies (Lake, 2022).
Audio: Audio event classification using cross-modal embeddings (including image-based semantics) (Dogan et al., 2022).
Relation Extraction: Zero-shot relation classification leveraging side information such as entity hypernyms and label synonyms (Gong et al., 2020).

5. Technical Innovations and Representative Methodological Trends

Graph-based Reasoning: Absorbing Markov chains on semantic graphs provide principled, closed-form solutions for knowledge transfer and prediction, achieving both computational and conceptual efficiency (Fu et al., 2014).
Hierarchical and Compositional Embeddings: Integrating hierarchical semantic structures and compositional part-based architectures produces representations highly effective for fine-grained and transfer tasks (Li et al., 2017, Sylvain et al., 2020).
Coupled and Augmented Attribute Spaces: Jointly learning defined and residual (automatically discovered) attributes addresses under-completeness in human annotations and improves selective classification (Song et al., 2018).
Generative and Hybrid Approaches: Conditional generators (e.g., GMMN, GANs, autoencoders) allow artificial feature synthesis, effectively transforming zero-shot learning into a supervised setting and improving performance in both standard and generalized tasks (Bucher et al., 2017).
Efficient Optimization and Scaling: Methods using closed-form solutions, confidence-based risk/coverage frameworks, and scalable association structures facilitate deployment in real-time and large-scale operational settings.

6. Empirical Outcomes and Performance Insights

Zero-shot classification techniques have demonstrated substantial improvements in transfer scenarios across domains:

Methods combining deep semantic embedding with auxiliary regularizers (entropy, domain adaptation) achieve state-of-the-art results in semantic utterance classification and fine-grained recognition (Dauphin et al., 2013, Li et al., 2017).
Graph- and label propagation-based algorithms outperform naive embedding-based nearest neighbors by leveraging structural relationships among classes, with efficient linear scaling in the number of test instances (Fu et al., 2014, Li et al., 2017).
Conditional generative models deliver 10–30% accuracy improvements in standard and generalized zero-shot learning settings, especially when synthetic features are used for unseen classes (Bucher et al., 2017).
Selective zero-shot classifiers with augmented or residual attribute confidence outperform traditional attribute-based ZSC approaches on the risk-coverage trade-off (Song et al., 2018).
Dictionary learning frameworks achieve hit@1 accuracies up to 91.0% (SUN), demonstrating that attribute-aware regularization mitigates both domain shift and high-dimensional “hubness” (Rostami et al., 2019).
In non-visual domains, embedding-based zero-shot document and relation classification rivals or exceeds performance of supervised baselines under dynamic or expanding taxonomies (Lake, 2022, Gong et al., 2020).

7. Open Challenges and Future Directions

Despite significant progress, key challenges remain for zero-shot classification:

Semantic Gap and Modality Bias: Ensuring that learned (or constructed) class prototypes for unseen categories are as discriminative as those for seen classes remains central, especially in multimodal or generalized settings (Bucher et al., 2017, Rostami et al., 2019).
Representation Completeness: Overcoming the limitations of under-complete attribute vocabularies necessitates the discovery or learning of complementary features and residual attributes (Song et al., 2018).
Compositional and Local Representation: Evidence underscores the necessity of models that capture compositional, part-based or local representations, particularly when eschewing external pretraining (Sylvain et al., 2020).
Selective and Reliable Decision-making: Incorporating calibrated confidence measures is essential for practical deployment where abstention from uncertain decisions is preferable to erroneous predictions (Song et al., 2018).
Scalability and Adaptation: Real-time and large-vocabulary scenarios require methods with closed-form or scalable solutions, and the ability to update or expand class vocabularies without retraining.
Domain-Specific and Multi-Modal Transfer: There is an ongoing need to further develop approaches that handle audio, video, and cross-modal cases (e.g., using image embeddings for audio classification (Dogan et al., 2022)) and exploit knowledge from LLMs and ontologies.

Further research is directed at refining semantic embedding spaces (potentially through self-supervision or active learning), developing dynamic and compositional representations, investigating hybrid and generative models for richer synthetic data, and exploring integration with knowledge graphs, retrieval systems, or broader open-domain learning frameworks.