Zero-shot Image Classification

Updated 9 April 2026

Zero-shot image classification is a paradigm where models predict unseen classes by aligning image features with auxiliary semantic representations.
It employs compatibility learning, generative modeling, and vision-language pretraining to overcome the limitations of conventional supervised approaches.
Techniques like prompt engineering, guided cropping, and hypernetwork mapping enable robust performance across diverse real-world and fine-grained classification tasks.

Zero-shot image classification is the paradigm in which a model is required to recognize classes at test time for which it has seen no visual training examples, relying instead on auxiliary class information such as attributes, descriptions, or semantic embeddings. This setting tests a system's ability to generalize to novel concepts beyond its training distribution and has motivated a suite of algorithmic frameworks grounded in compatibility learning, generative modeling, cross-modal alignment, and large-scale vision-language pretraining.

1. Foundations and Problem Definition

In zero-shot image classification, the dataset is split into disjoint sets of "seen" classes (with labeled training images) and "unseen" classes (with only side information, no images) (Ruffino et al., 2024). The goal is to build a model that—given auxiliary information describing unseen classes—accurately predicts the correct label for images of those classes at test time.

Let $S$ and $U$ denote seen and unseen class sets, respectively. For each class $c$ , an auxiliary descriptor is available (e.g., an attribute vector, semantic embedding, or textual description). The dominant architectures cast zero-shot classification as a compatibility problem: an image $x$ and a class embedding $\phi(c)$ receive a compatibility score $F(x, \phi(c))$ , and the predicted label is $\hat{c} = \arg\max_{c \in U} F(x, \phi(c))$ (Karessli et al., 2016). In generalized zero-shot classification (GZSC), both seen and unseen classes are included at test time, further exposing models to domain shift and bias phenomena (Bucher et al., 2017).

An early and influential stream of methods learns a compatibility function between image features and auxiliary class representations. Approaches include:

Structured Joint Embedding (SJE) and bilinear models: $F(x, \phi(c)) = \theta(x)^\top W \phi(c)$ , where $W$ is optimized to assign higher scores to correct (image, class) pairs (Karessli et al., 2016, Naeem et al., 2022).
Attribute-based encoders: In HDC-ZSC, stationary binary codebooks for attribute groups/values are fixed, and class attributes are composed into high-dimensional hypervectors. A trainable image encoder projects images into this space, and classification proceeds by $K(\mathbf{x},\phi(\mathbf{A}_c))$ via temperature-scaled cosine similarity (Ruffino et al., 2024).
Relational and pairwise losses: Some models incorporate explicit alignment of intra-class structure. A neural mapping from semantic descriptor space to image-feature space is trained to minimize both per-sample regression and pairwise inter-class distance matrices, e.g., $U$ 0 (Das et al., 2019).

Regularization against "hubness" and domain shift is critical in high-dimensional compatibility methods; entropic penalties and transductive approaches (e.g., label propagation on graphs of class prototypes) are used to mitigate these effects (Rostami et al., 2019).

3. Generative and Synthetic Feature Approaches

Generative models address two main ZSL burdens: the inability to train discriminative visual classifiers on unseen classes and strong bias towards seen classes in the GZSC regime.

Conditional feature generation: Generators $U$ 1 synthesize image features conditioned on class descriptors $U$ 2 and noise $U$ 3. These synthetic examples (for unseen classes) are then used to train a standard supervised classifier (e.g., softmax over seen + generated features) (Bucher et al., 2017). Several architectures are deployed: Generative Moment Matching Networks (GMMN), (Auxiliary-Classifer) GANs, Denoising Autoencoders, and Adversarial Autoencoders. Explicit moment matching (GMMN) is empirically most stable.
Text-to-model and hypernetworks: Text2Model instantiates, at inference time, a task-specific classifier by mapping class textual descriptions via a permutation-equivariant hypernetwork to classifier parameters. This supports rich class descriptions, including negative constraints, and yields non-linear classifiers tailored to the current task's label set (Amosy et al., 2022).
Domain-specific synthetic image generation: The AttrSyn pipeline uses LLMs to create diverse attributed prompts (e.g., background, pose, style), leveraged by text-to-image models (Stable Diffusion XL) to synthesize data. Logistics regression (linear-probe) classifiers trained on these synthetic CLIP features consistently outperform simple prompt or direct zero-shot CLIP methods (Wang et al., 6 Apr 2025).

4. Pretrained Vision-LLMs and Prompt-Based Retrieval

Large-scale vision-language pretraining—most notably CLIP—has dramatically advanced zero-shot classification:

Textual prompt retrieval: CLIP encodes images and class descriptions into a joint embedding space. Zero-shot classification is performed by matching a query image's embedding against $U$ 4 textual embeddings (e.g., "an image of a {class}"), and returning the class with the highest cosine similarity (Sammani et al., 2024, Abdelhamed et al., 2024).
Prompt engineering and multi-modal fusion: Methods have investigated prompt manipulation (e.g., adding class attributes, mutual concepts, or data-driven prompts). Incorporation of multimodal LLMs (e.g., Gemini Pro) to produce detailed captions and initial predictions from the test image itself—fused in the CLIP space—yields substantial accuracy gains over vanilla CLIP (Abdelhamed et al., 2024).
Guided cropping and self-localization: When target objects are small or backgrounds confound CLIP's global features, integrating object detectors (e.g., OWL-ViT) to crop and focus the input can boost zero-shot performance, particularly for small-object images (Saranrittichai et al., 2023).
Collaborative Self-Learning: Methods combine VLMs for high-confidence semantic pseudo-labeling with strong visual encoders (e.g., ViT-G-14), initializing and iteratively improving a lightweight classifier on test data via a self-learning loop, without additional annotation or VLM fine-tuning (Todescato et al., 23 Sep 2025).

5. Alternative Side Information and Real-World Extensions

Beyond attributes and pure text embeddings, zero-shot models have leveraged alternative class-side information, enabling applications in settings with weak supervision or non-standard modalities.

Gaze-based embeddings: Human gaze data collected during discriminative tasks can be encoded into spatial and temporal embeddings that—when coupled with image features—yield competitive or superior performance to expert-annotated attributes on fine-grained datasets (Karessli et al., 2016).
Document and web-text alignment: I2DFormer jointly encodes images and entire class-level documents (e.g., Wikipedia articles), with cross-modal attention learning to align image patches to discriminative words, producing highly interpretable decisions without manual attribute annotation (Naeem et al., 2022).
Image-free classifier injection: ICIS learns mappings from semantic class descriptions to classifier weight vectors, enabling the "injection" of new zero-shot classes into arbitrary pre-trained models post-hoc, with no access to image data, by optimizing cross-reconstruction and alignment losses (Christensen et al., 2023).
Hyperspectral and domain-specific scenarios: Interpolation of hyperspectral data into pseudo-RGB permits CLIP-based pseudo-labeling, and subsequent spectral refinement via Gaussian Mixture Models enables fully zero-annotation classification pipelines in remote sensing (Pang et al., 27 Jan 2025). Retrieval-based pipelines with foundation models (e.g., DINOv2+FAISS) allow generic species classification in camera trap images, matching large supervised models with no location-specific retraining (Vyskočil et al., 2024).

6. Interpretation, Evaluation, and Emerging Directions

Interpretability and confidence estimation have become core aspects of modern zero-shot pipelines:

Mutual knowledge analysis: The alignment ("mutual information") between concepts detected by vision and language encoders predicts zero-shot performance and robustness across CLIP architectures; AUC of MI-drop curves correlates tightly with accuracy. Injecting mutual concepts into prompts improves performance (Sammani et al., 2024).
Predicting zero-shot performance: Synthetic image generation—using models such as SDXL-Lightning—conditions on class descriptions to forecast how well a VLM is expected to perform on arbitrary class queries, providing class-level and dataset-wide predictive metrics for VLM selection and task planning (Robbins et al., 24 Jan 2026).

Fine-grained zero-shot tasks have also received new attention, with LVLMs re-cast into iterative VQA frameworks combined with "attention intervention" techniques to compensate for reliance on language priors and shallow visual reasoning. These methods now achieve multi-decade improvements over prior SOTA on fine-grained bird, car, aircraft, and food benchmarks (Atabuzzaman et al., 4 Oct 2025).

A notable trend is the transition from single-modality attribute or embedding matching to fully multimodal, generative, and collaborative architectures that exploit advances in conditioned generation, cross-modal LLMs, and strong unsupervised visual backbones.

Key References by arXiv ID:

(Ruffino et al., 2024, Karessli et al., 2016, Naeem et al., 2022, Bucher et al., 2017, Amosy et al., 2022, Sammani et al., 2024, Yin et al., 2024, Saranrittichai et al., 2023, Pang et al., 27 Jan 2025, Vyskočil et al., 2024, Atabuzzaman et al., 4 Oct 2025, Das et al., 2019, Rostami et al., 2019, Wang et al., 6 Apr 2025, Abdelhamed et al., 2024, Christensen et al., 2023, Todescato et al., 23 Sep 2025, Robbins et al., 24 Jan 2026)