Zero-Shot and Few-Shot Learning
- Zero-shot and few-shot learning are paradigms that generalize models to novel classes using semantic transfer and minimal annotated data.
- They leverage techniques like metric learning, generative models, and meta-learning to adapt prototypes for new or unseen classes.
- Unified methods integrate multimodal data and instruction-based strategies to balance bias, improve calibration, and enhance scalability.
Zero-shot and few-shot learning scenarios address the problem of generalizing to novel classes or tasks with little to no annotated training data. These regimes are essential for extending recognition, classification, and generation systems to long-tail or evolving domains where exhaustive labeling is intractable. Both approaches have catalyzed methodological innovation across metric learning, generative modeling, meta-learning, and vision–language foundations, and have recently converged toward unified frameworks for open-domain adaptation.
1. Definitions, Scope, and Problem Formalization
Zero-shot learning (ZSL) refers to the setting where a model is expected to recognize or describe instances from classes that are entirely absent from the training set. To enable inference, ZSL methods bridge seen and unseen classes via shared semantic structures—commonly attributes, word vectors, or textual descriptions—so that knowledge can be transferred by associating new inputs to their semantic representations. The canonical ZSL formulation assumes a disjoint partitioning of class labels into seen (training) set and unseen (test) set , with a semantic embedding function . Classification is achieved by matching an input to the prototype of each label based on a learned or fixed compatibility function, often parameterizing (Saad et al., 2022).
Few-shot learning (FSL) extends this paradigm by providing a small support set (often ) of labeled examples for each novel class at test time. The learning goal is rapid adaptation—a model must leverage experience on related base classes to quickly generalize to these novel low-data classes.
Both ZSL and FSL have broad deployment in vision (Snell et al., 2017, Sung et al., 2017, Zhang et al., 2019), language, tabular (Wang et al., 12 Aug 2025), and multimodal evaluation (Spinaci et al., 23 Sep 2025), and in inductive, transductive, and generalized (GZSL/GFSL) regimes.
2. Key Methodological Frameworks
Prototype- and Metric-Based Models
Prototypical networks (Snell et al., 2017) provide a foundational FSL approach, learning an embedding such that each class is represented as a mean ("prototype"),
Query points are classified via a softmax over (typically squared Euclidean) distances to all prototypes, with squared Euclidean distance shown to be optimal for Bregman-divergence-based clustering and to outperform cosine similarity. This metric-based classifier permits direct ZSL extension by replacing samples with semantic prototypes , where is the attribute vector for class .
Relation networks (Sung et al., 2017), in contrast, learn a non-linear deep metric via a relation module applied to concatenated support and query embeddings, trained end-to-end in episodic (meta-learning) regimes to regress relation scores to target equivalence.
Class Adapting Principal Directions (CAPD) (Rahman et al., 2017) generalize metric transfer by learning per-class projection matrices mapping image features into a semantic space. For zero-shot classes, unseen prototypes are synthesized as linear combinations of seen-class CAPDs, weighted by semantic similarity, and further updated when few-shot samples are available. This framework explicitly supports automatic selection of relevant base classes for each novel class and bias correction in GZSL.
Generative and Semantic Embedding Approaches
Generative ZSL methods model each class as a distribution in feature space, with parameters (means, covariances) predicted from class-level attributes via learned mappings (Mishra et al., 2018). In inductive settings, these mappings are fit to seen classes and extrapolated to unseen ones; in transductive regimes, unlabeled data from unseen classes can further refine these parameters, e.g., via EM updates on Gaussian mixture models.
Recent frameworks integrate generative adversarial networks (GANs) and variational autoencoders (VAEs) both for direct feature synthesis (CVAE, f-CLSWGAN, etc.) and for meta-learned generation (Verma et al., 2020, Chochlakis et al., 2021). For example, Z2FSL (Chochlakis et al., 2021) reduces ZSL to FSL by generating synthetic features for unseen classes using a conditional generative model and directly applying few-shot learners such as prototypical networks to these support sets; the entire system is trained end-to-end, allowing FSL loss gradients to influence the generative model, thus improving the utility of synthesized data.
Knowledge-sharing methods (Ting et al., 2021) further enrich semantic representations by augmenting textual descriptors with information aggregated from semantically similar classes; GAN-based synthesis on these enriched vectors addresses semantic incompleteness and domain drift.
Semi-supervised scenarios (Fluss et al., 2023) introduce a KL-divergence penalty into the loss to align empirical prediction frequencies with known class distributions, thereby calibrating confidence between seen (labeled) and unseen (zero-shot) classes even when only a subset of classes are labeled.
Unified and Multimodal Systems
Unified approaches such as BinBin (Xu et al., 6 Mar 2024) recast multi-class tasks as instruction-following binary decisions, combining indirect supervision (pretraining on diverse instruction tasks) and weak supervision (using LLMs to generate labeled examples for zero-shot classes). Such binary-inference transformation enables handling mixed frequent-/few-/zero-shot ("X-shot") label scenarios without architectural changes.
In the multimodal domain, vision–LLMs (CLIP, SigLIP, multimodal LLMs) have demonstrated strong zero- and few-shot performance in both curated and heterogeneous datasets (Spinaci et al., 23 Sep 2025). Prompt enrichment and in-context exemplars may further improve zero-shot accuracy, though few-shot gains are dataset- and prompt-sensitive. Prompt-enhanced aggregation, as in PEVA-Net (Lin et al., 30 Apr 2024), leverages text-guided fusion to produce strong descriptors in 3D shape recognition; self-distillation from zero-shot to few-shot descriptors robustly boosts accuracy.
For tabular data, ProtoLLM (Wang et al., 12 Aug 2025) demonstrates that LLMs can be used to generate feature values for prototype construction via example-free prompts, sidestepping data leakage and token-length constraints, and supporting rapid adaptation in zero-/few-shot settings by fusing LLM priors with empirical support values.
Data-Augmentation and Optimization-Based Meta-Learning
A detailed taxonomy (Bendre et al., 2020) summarizes methods for few- and zero-shot learning:
- Data augmentation (e.g., feature hallucination, set operations) enlarges the effective support set.
- Embedding/metric methods (as above) facilitate clustering and comparison.
- Optimization/meta-learning methods (MAML, LSTM meta-learners) rapidly adapt model parameters to new classes.
- Semantics-based techniques incorporate hierarchical or attribute information, variational approaches, and knowledge transfer networks.
3. Design Principles and Training Strategies
A recurring principle is episodic/meta-training: models are trained on a sequence of tasks mimicking few-/zero-shot test conditions, enforcing task-adaptive embeddings and calibrating metrics or generative processes for generalization (Snell et al., 2017, Sung et al., 2017). Matching training and test "shot" and "way" improves generalization. For generalized settings (GZSL/GFSL), explicit mechanisms to balance class priors or regularize seen/unseen representation gaps are essential to mitigate seen-class bias (Rahman et al., 2017).
Semantic regularization strategies such as dual-purpose semantic regularization (DPSR) (Shohag et al., 18 Jun 2025) enable classifiers to leverage semantic similarities between classes, reducing overfitting to sparse synthetic prototypes and improving decision boundary robustness.
4. Practical Performance and Benchmarking
Benchmark comparisons on datasets such as CUB, AwA, SUN, Omniglot, MiniImageNet, CIFAR-100, and ModelNet40 exhibit distinct trends:
- On fine-grained benchmarks (e.g., CUB birds, ICONCLASS iconography (Spinaci et al., 23 Sep 2025)), multimodal LLMs (Gemini-2.5 Pro, GPT-4o) may outperform both vision-only and conventional contrastive VLMs under zero-shot configurations, especially with prompt enrichment.
- In action recognition, generative models achieve substantial accuracy gains in both inductive (no unseen data) and transductive (unlabeled unseen data) settings, and few labeled samples rapidly bridge the performance gap (Mishra et al., 2018).
- Meta-classifier ensembles (e.g., majority voting, DNNs) can produce modest improvements over single ZSL baselines, but no "overall winner" emerges across datasets or metrics (Saad et al., 2022). Performance remains bounded, with top-1 accuracies frequently below 70% even for leading algorithms.
- On tabular tasks, training-free LLM-based prototype construction matches or surpasses traditional models, particularly in low-shot regimes, with substantial advantages in scalability and privacy (Wang et al., 12 Aug 2025).
- Real-world open-domain applications require handling label frequency variability and dataset heterogeneity; unified, instruction-following approaches (Xu et al., 6 Mar 2024), and robust prompt engineering are pivotal.
The table below summarizes typical characteristics and strategies in representative scenarios:
Scenario | Required Labeled Data | Key Mechanism |
---|---|---|
Zero-shot | 0 per class | Semantic transfer, attr/GAN/LLM |
Few-shot (k-shot) | k per class ( large) | Metric, meta-learning, prototypes |
Generalized ZSL | mixed seen/unseen | Bias correction, feature synthesis, balancing losses |
Unified X-shot | 0 to many per class | Instruction-based binary inference, indirect/weak supervision (Xu et al., 6 Mar 2024) |
5. Major Challenges and Future Directions
Persistent challenges include:
- Semantic gap and domain shift: Misalignment between visual features and semantic embeddings leads to prediction errors, especially in settings with high intra-class variability or out-of-distribution context (Ting et al., 2021, Badawi et al., 23 Jun 2024).
- Calibration and bias: Class imbalance (seen/unseen, synthetic/real) necessitates explicit regularization or balancing techniques (e.g., DPSR, KL-penalty (Fluss et al., 2023)) and careful classifier design.
- Scalability and privacy: Large-scale feature synthesis is computationally costly and may violate ZSL assumptions of "no examples"; strategies that reduce synthetic data (e.g., MSAS in FSIGenZ (Shohag et al., 18 Jun 2025), privacy-aware mixing in DPS-MOZO#1 (Flemings et al., 31 Jan 2025)) balance efficiency and robustness.
- Prompt sensitivity: In multimodal/LLM models, prompt quality and alignment with metadata or class descriptions are decisive for zero-shot performance (Spinaci et al., 23 Sep 2025).
Areas identified for further research include: automated prompt optimization, more effective domain-specific adaptations (particularly for medical (Badawi et al., 23 Jun 2024) or cultural (Spinaci et al., 23 Sep 2025) data), development challenges (e.g., hubness, instability in generative models), and meta-ensemble strategies for robust inference in realistic open-world tasks (Saad et al., 2022, Shohag et al., 18 Jun 2025).
6. Application Domains and Impact
Zero-shot and few-shot approaches have been validated in a spectrum of domains:
- Computer vision: object detection/classification (Rahman et al., 2017, Snell et al., 2017), 3D shape recognition (Lin et al., 30 Apr 2024), fine-grained iconography (Spinaci et al., 23 Sep 2025), and medical imaging (Badawi et al., 23 Jun 2024).
- Natural language processing: relation extraction and event detection in "X-shot" settings (Xu et al., 6 Mar 2024), instruction following, and tabular prediction (Wang et al., 12 Aug 2025).
- Semi-supervised and open-world scenarios: leveraging uncurated data (inductive ZSL, ZSL with auxiliary data (Bhatt et al., 2021), or limited supervision (Fluss et al., 2023)) to close the gap between experimental and practical deployment.
A plausible implication is that the continued synthesis of generative, metric, and instruction-following architectures—combined with more principled prompt engineering and explicit semantic regularization—will remain central in scaling zero-shot and few-shot learning to broader, real-world tasks. The measured progress toward efficient, data-frugal, and robust systems underscores the growing maturity of the field, while simultaneously revealing the practical trade-offs and research questions that persist across domains.