Few-Shot Learning Approach

Updated 22 December 2025

Few-shot learning is a paradigm that rapidly adapts to novel tasks with a few labeled examples, employing episodic training and robust metric-based methods.
It leverages meta-learning, auxiliary supervision, and semantic fusion to boost generalization and performance across domains such as vision and NLP.
Recent advancements include rank-based metrics and adversarial training techniques that enhance accuracy and resilience to noisy or scarce data.

Few-shot learning (FSL) refers to the paradigm in which a model must quickly adapt to novel classes or tasks using only a small number of labeled examples per class, typically far fewer than required in conventional supervised deep learning. FSL approaches pursue robustness, generalization, and efficient transfer by leveraging prior knowledge, meta-learning, geometry-aware metrics, auxiliary modalities, unlabeled data, and sometimes explicit robustness constraints. Research in FSL spans image recognition, natural language processing, malware detection, continual learning, and other domains.

1. Problem Formalizations and Principles

FSL is typically cast in the $N$ -way, $K$ -shot episodic setting: each episode samples $N$ classes (ways), $K$ labeled support examples per class (shots), and a set of queries for evaluation or adaptation. Formally, episodes are drawn from a task distribution $p(\mathcal{T})$ , where a task $\mathcal{T}_i$ provides support $S_i$ and query $Q_i$ sets. The central challenge is generalizing to new classes or domains with only a few labeled supports per class.

Canonical FSL methods are characterized by:

Metric-based meta-learning: Directly construct class prototypes (e.g., mean embeddings) and perform inference using learned distances in embedding space (e.g., ProtoNet, RelationNet) (Cheng et al., 2019).
Optimization-based meta-learning: Meta-learn an initialization and/or update rule such that a few gradient steps on the small support set lead to rapid adaptation (e.g., MAML, Meta-SGD) (Wang et al., 2019).
Hybrid approaches: Integrate both metric and optimization-based strategies, typically enabling variable-way/shot adaptation and learning task-specific metrics (Wang et al., 2019).
Auxiliary supervision: Employ self-supervision, per-sample attribute signals, or auxiliary data sources to improve representation quality, especially under data scarcity (Gidaris et al., 2019, Visotsky et al., 2019).
Semantic or multi-modal transfer: Incorporate textual/LLM features or supplementary domain signals to boost generalization to rarely observed or unseen categories (Zhou et al., 2024).
Unlabeled data utilization: Transductive and semi-supervised FSL introduces unlabeled instance pools and applies regularizers, pseudo-labeling, or dependency maximization (Hou et al., 2021, Wei et al., 2022).
Robustness mechanisms: Address vulnerability to label noise or adversarial perturbations by prototype refinement (Mazumder et al., 2020) or adversarial query-based meta-training (Goldblum et al., 2019).
Continual and lifelong learning: Tackle class-incremental scenarios where new few-shot classes appear sequentially, maintaining knowledge and minimizing catastrophic forgetting (Mazumder et al., 2021, Wang et al., 2021).

2. Distance Metrics, Correlation, and Similarity Measures

Prototypical and metric-based few-shot learners depend critically on class similarity measures. Traditionally, geometric metrics such as cosine similarity and (negative) Euclidean distance are applied to representations. Advances targeting feature-space regularity and channel-scale invariance have led to the use of rank-based measures:

Scale-sensitive geometric metrics: Dominated by large-valued channels; suboptimal when features are highly clustered or informative cues are in small-value channels (Zheng et al., 2023).
Kendall’s $\tau$ rank correlation: A channel ranking-based metric insensitive to feature scaling. The DiffKendall method replaces geometric similarities with a differentiable approximation of Kendall’s $\tau$ :

$\tilde \tau_\alpha(x,y) = \frac{1}{N_{\text{tot}}} \sum_{i<j} \tanh[\alpha(x_i-x_j)] \cdot \tanh[\alpha(y_i-y_j)]$

This enables effective integration into episodic meta-learning pipelines and yields 1–4% accuracy gains on multiple FSL benchmarks (Zheng et al., 2023).

Impact and ablations: Replacing geometric similarity with rank correlation, even only at test time, increases robustness to channel masking or domain shift, and provides greater accuracy when fine-grained discriminative cues are subtle.

3. Optimization, Meta-Learning, and Task-Specific Adaptation

Meta-learning frameworks aim to learn how to learn, either by optimizing an initialization, an update procedure, or both, that enables fast adaptation to novel few-shot tasks. Two major streams are:

A. Meta-Initialization and Adaptation

Meta-SGD: Learns both an initialization $\theta_0$ and per-parameter learning rates $\alpha$ for a metric-based base learner, enabling a small number of gradient steps on each task's support to yield task-specific parameters $\theta_N = \theta_0 - \sum_{n=0}^{N-1} \alpha \circ \nabla_{\theta_n} \mathcal{L}_{\text{inner}}$ (Wang et al., 2019).
Meta-Metric-Learner: Couples a metric learner (e.g., Matching Networks) with an LSTM-based or learned update rule that adapts the metric to each task, supporting variable- $N$ scenarios, unbalanced classes, or multi-domain transfer (Cheng et al., 2019).

B. Strong Teacher Supervision

Knowledge distillation meta-objectives: LastShot introduces a strong "teacher" classifier (e.g., nearest centroid over all data) for each meta-training episode's classes and supplements sparsely-labeled query supervision with KL divergence between the meta-learner and teacher predictions (Ye et al., 2021). This approach increases the effective supervision, especially for higher-shot tasks, and outperforms pure meta-learners in the many-shot regime.

4. Auxiliary Modalities, Unlabeled Data, and Self-Supervision

A. Self-Supervised Feature Augmentation

Rotation prediction, relative patch location (jigsaw tasks), and unsupervised objectives are integrated into base FSL pipelines to yield richer, more invariant visual representations (Gidaris et al., 2019). These auxiliary losses improve transfer, and the benefit is amplified when external, unlabeled data are included in the self-supervised objective.

B. Semantic-Driven Few-Shot Learning

LLM (LM) features for class names are actively exploited via prompt-tuned textual embeddings, with learned adapters and simple additive fusion with visual features. SimpleFSL demonstrates that CLIP's zero-shot capability, augmented with meta-learnable prompts and self-ensembling, matches or surpasses prior semantic-fusion models by 2–3% in 1-shot accuracy (Zhou et al., 2024).

C. Pseudo-Labeling and Negative Learning

Negative pseudo-labeling selects which class each unlabeled point almost certainly does not belong to, performing iterative exclusion and entropy regularization to robustly expand the support set (Wei et al., 2022). This approach ("MUSIC") provides state-of-the-art semi-supervised FSL accuracy with minimal implementation overhead.

D. Dependency Maximization and Instance Discriminant Analysis

Maximizing the Hilbert–Schmidt norm between embedded features and predicted labels over the unlabeled pool, followed by Fisher-criterion–based selection of the most discriminative pseudo-labeled instances, allows iterative support set augmentation and achieves leading results in transductive/semi-supervised benchmarks (Hou et al., 2021).

5. Robustness to Adversity and Continual Adaptation

A. Label Noise in Supports

Robust prototype refinement employs the RNNP procedure: synthesize hybrid (interpolated) features, apply soft K-means clustering across supports/hybrids/queries, and iteratively update prototypes to mitigate the effect of corrupted support labels. This simple plug-in raises accuracy by 5.9% at 40% noise over vanilla nearest-neighbor prototypes (Mazumder et al., 2020).

B. Adversarial Querying

The AQ meta-learning framework appends adversarial query perturbation to each meta-training episode's query set, meta-optimizing the feature extractor such that post-adaptation performance remains high even under strong PGD or MI-FGSM attacks (Goldblum et al., 2019). AQ achieves up to 44.8% robust accuracy (vs. near-zero for many transfer-based or pre-processing defenses) and has minimal compute overhead.

C. Continual and Lifelong Learning

Few-Shot Lifelong Learning (FSLL) and two-step consolidation (TSC) architectures explicitly select a small, low-importance subset of weights to train for each new class batch ("parameter isolation"), regularize for prototype separation, and interleave self-supervision to avoid overfitting and catastrophic forgetting (Mazumder et al., 2021, Wang et al., 2021).

6. Generalization Bounds, Task-Weighting, and Theoretical Guarantees

A. Per-Sample Rich Supervision

Explicit per-sample feature relevance, used to construct uncertainty ellipsoids in feature space, enables the definition of ellipsoid-margin losses that penalize misclassification on confident dimensions but allow slack on uncertain ones. This yields improved sample complexity (from $\sqrt{d/n}$ to $\sqrt{k/n}$ if only $k\ll d$ features are relevant) and tighter generalization bounds (Visotsky et al., 2019).

B. Uncertainty-Based Task Weighting

Simultaneous Perturbation Stochastic Approximation (SPSA) adapts the task weighting coefficients in multi-task FSL via random perturbation and stochastic gradient estimates, optimizing a task-uncertainty weighted objective and demonstrating convergence under general conditions (Boiarov et al., 2020). SPSA improves accuracy by up to +1.8% over unweighted baselines in Omniglot 1-shot 20-way tasks.

7. Emerging Architectures and Extensions

A. Hypernetwork Parameterization

HyperShot eschews gradient-based adaptation by employing a hypernetwork that synthesizes the parameters of a task-specific classifier directly from a kernelized summary of the support set (Sendera et al., 2022). By focusing on inter-support similarities rather than absolute embedding values, HyperShot attains competitive performance in one-shot and five-shot image classification.

B. Information Retrieval Perspective

Directly optimizing mean Average Precision (mAP) as a batch-wise, retrieval-oriented loss, rather than cross-entropy or triplet losses, globally aligns episodic representations and facilitates strong performance in classification and cross-instance retrieval (Triantafillou et al., 2017).

C. Malware Recognition and Structured Input

Pretrained transformer-based encoders (e.g., byte-level LLMs) can be used with prototype-based FSL on sequential or structured non-visual data, enabling accurate identification of novel malware types from minimal labels, generalizing beyond the vision domain (Stein et al., 2024).

FSL research integrates geometric, statistical, self-supervised, continual, and robustification strategies, achieving notable performance advances across domains. Fundamental methodological innovations—such as rank-based metrics, prompt-tuned semantic fusion, dependency maximization, and meta-learned adaptation—have demonstrably widened the operational range of FSL models, enabling few-shot learning under noisy labels, scarce supports, unlabeled pools, long-tailed class distributions, and sequential task arrival. Empirical ablations and theoretical results collectively underscore the importance of robust feature construction, explicit handling of uncertainty, multi-modal alignment, and adaptive supervision in effective and generalizable few-shot learning (Zheng et al., 2023, Mazumder et al., 2020, Zhou et al., 2024, Hou et al., 2021, Cheng et al., 2019, Wang et al., 2019, Ye et al., 2021, Boiarov et al., 2020).