Few-Shot Learning Approaches

Updated 27 November 2025

Few-shot learning is a paradigm that enables models to generalize to new tasks using only a few labeled examples, often employing meta-learning and metric-based techniques.
Key challenges include severe data scarcity, task variability, and overfitting, which drive the development of advanced adaptation methodologies.
Recent methods leverage contrastive self-supervised learning, transductive inference, and multimodal extensions to enhance performance across diverse domains.

Few-shot learning (FSL) is a paradigm in machine learning that seeks to enable models to generalize to new concepts given only a handful of labeled examples per class. This stands in contrast to standard supervised learning regimes that are reliant on large, labeled datasets. The few-shot paradigm is especially relevant in domains where data is scarce, labeling is expensive (such as in medical imaging or rare event detection), data distribution shifts are common, or rapid adaptation to novel classes is necessary. Core challenges include severe data scarcity, task and domain variability, and mitigating overfitting when adapting to a new task with only small support sets. The field has produced an array of algorithmic families—meta-learning, transfer learning, semi-/transductive, active, continual, and causal/contrastive approaches—each bringing distinctive solutions to the sample-efficiency bottleneck and task-specific adaptation problem (Parnami et al., 2022).

1. Mathematical Formulation and Core Protocols

In the canonical FSL setting, a few-shot episode is an $N$ -way $K$ -shot classification problem. Each episode $T$ consists of a support set $S$ composed of $K$ labeled samples per class and a query set $Q$ for evaluation. The model must rapidly infer a classification function $h: \mathcal{X} \to \mathcal{Y}$ , often with non-overlapping classes between meta-train and meta-test splits, under $|S| \ll |\mathcal{D}_{\mathrm{train}}|$ .

Key challenges include:

Data scarcity: $|S|$ is often in the range 1–16; empirical risk minimization is ill-posed.
Task variability and distribution shift: Classes and data distributions at test time typically differ significantly from those encountered during training.
Domain and modality generalization: The model must often handle new types of input (modal, cross-lingual, etc.).

Major approaches fall into the following protocolic and algorithmic families:

Approach Type	Principle	Example Algorithms
Metric-based Meta-Learning	Learn embeddings and task-adaptive metrics	ProtoNet, RelationNet, MatchingNet, TADAM
Optimization-based Meta-Learning	Learn initializations or update procedures	MAML, Meta-SGD, LEO, CAVIA
Model-based Meta-Learning	Learn memory or fast-weight modules	MANN, MetaNet, MM-Net
Transfer Learning	Pretrain feature extractor, then adapt to new tasks	Fine-tuning, SimpleShot, Frozen
Contrastive/Self-Supervised Paradigms	Leverage invariances and augmentations	CSSL-FSL, MICM, COLA
Transductive Inference	Utilize unlabeled queries for task structure	TEAM, LaplacianShot, ICI, Dependency Max
Semi-supervised/Active	Iteratively pseudo-label or actively query labels	ICI, AFSC, IDA
Causal/Interventional	Remove pretraining biases via SCM and adjustment	IFSL
Continual/Lifelong	Adapt in streaming or environment-shifting regimes	OAP, CPM, Proto-OML

The M-way K-shot episode protocol and its transductive, semi-supervised, active, online, and cross-domain variants form the backbone for empirical evaluation in the field (Parnami et al., 2022, Lunayach et al., 2022, Abdali et al., 2022).

2. Meta-Learning, Metric-Based, and Optimization-Based Approaches

Meta-learning seeks to “learn to learn”: it casts FSL as a bi-level optimization, where task distributions are drawn from $p(\mathcal{T})$ and the meta-learner is trained across many sample tasks to acquire an inductive bias facilitating rapid adaptation.

Metric-based approaches [e.g., Prototypical Networks, Matching Networks, Relation Networks] learn an embedding $g_\theta$ and a similarity function $d$ , so classification reduces to nearest-prototype or nearest-neighbor within $g_\theta(S)$ . ProtoNet, for instance, computes class prototypes as $v_c = \frac{1}{|S^c|}\sum_{\left(x_i, y_i\right) \in S^c} g_\theta(x_i)$ and classifies queries via $\mathrm{softmax}_c(-d(g_\theta(\tilde{x}), v_c))$ (Parnami et al., 2022).
Optimization-based approaches (e.g., MAML, LEO, Meta-SGD, CAVIA) meta-learn an initialization $\theta$ and/or update rule, such that adaptation on the support set via a few gradient steps achieves low loss on the query set. Classic MAML forms the meta-objective

$\min_\theta \mathbb{E}_{\tau} \left[ L^\tau(\theta - \alpha \nabla_\theta L^\tau(\theta)) \right].$

Task-agnostic meta-learning (TAML) extends this by entropy or inequality regularization to avoid overfitting the meta-initialization to the training-task distribution (Jamal et al., 2018).

Model-based approaches integrate external memory or dynamic fast-weights (Parnami et al., 2022).

Meta-learning algorithms deliver excellent performance and flexibility but may entail high computational cost at test time or sensitivity to hyperparameters. Metric learners have gained renewed value via deep features pretrained under self-supervised, contrastive, or transfer paradigms.

3. Transfer Learning, Contrastive, and Self-Supervised Paradigms

Transfer learning pretrains a feature extractor on large-scale data (often base classes disjoint from the target FSL classes) and then adapts (or reuses) these representations for few-shot inference, via nearest-neighbor classifiers, prototypical classifiers, or shallow fine-tuning (Parnami et al., 2022).

Recent innovations include:

Contrastive self-supervised learning: Feature representations are learned by maximizing agreement between augmented views of the same sample (InfoNCE loss), improving base generalization and downstream few-shot accuracy. Methods such as CSSL-FSL (Li et al., 2020) and MICM (Zhang et al., 23 Aug 2024) formalize two-phase pretraining (representation then per-task classifier), or hybridize contrastive and MIM objectives to balance generalization and discriminability.
Graph-based contrastive and meta-learning: In node-level few-shot classification, contrastive learning with graph augmentations is uniquely potent. The COLA paradigm leverages all nodes for label-free meta-task construction, aligning the strengths of contrastive and meta-episodic learning to outperform conventional methods (Liu et al., 2023).

Method	Setting	1-shot	5-shot
ProtoNet [SimCLR]	mini-ImageNet (inductive)	54.3%	—
MICM+OpTA	mini-ImageNet (transductive)	81.05%	—
COLA	Cora (graph 2-way)	84.58%	94.03%

Contrastive paradigms scale naturally to arbitrary $N$ -way $K$ -shot settings and can be decoupled from fixed $K$ at meta-training. Self-supervised and contrastive objectives have also been shown critical for cross-modal, semi-/unlabeled, and cross-domain FSL generalization (Zhang et al., 23 Aug 2024, Chadha et al., 2023).

4. Transductive, Semi-Supervised, and Active Learning Paradigms

Traditional FSL classifiers perform inductive inference using only the support set. Transductive FSL admits access to the entire set of queries (or auxiliary unlabeled samples) at inference, exploiting their joint structure for metric or label propagation.

Transductive metric learning (e.g., TEAM (Qiao et al., 2019), LaplacianShot, Dependency Maximization (Hou et al., 2021)) adapts the metric space per-episode using both labeled and unlabeled points, often by means of closed-form SDPs, instance credibility analysis or dependency maximization over RKHS embeddings.
Semi-supervised/Iterative self-labeling extends the labeled support set with the most credible or class-separating pseudo-labeled instances from the unlabeled pool, as in ICI (Wang et al., 2020) and Instance Discriminant Analysis (Hou et al., 2021).
Active Few-Shot Classification (AFSC) introduces a selection mechanism: given a labeling budget $\ell$ and an unlabeled pool, the algorithm selects which examples to label to maximize global accuracy via information-theoretic or uncertainty criteria (Abdali et al., 2022).

Transductive and active paradigms are especially suited for data-scarce or heavily imbalanced class regimes, yielding up to 10% absolute gains over fixed-support transductive baselines with the same labeling budget (Abdali et al., 2022). Active strategies bridge the few-shot and active learning fields.

5. Multimodal, Multitask, and Domain-Specific Extensions

Few-shot learning now spans vision, audio, graph data, language, and multimodal tasks:

Multimodal/multitask multilingual FSL (e.g., FM3) deploys frozen vision and language encoders, uses task-adaptive hypernetworks with contrastive fine-tuning, and supports joint tasks (VQA, NER, natural language understanding) with strong parameter efficiency and accuracy (Chadha et al., 2023).
Prompt-based multimodal models (PM²) inject class semantics via engineered or learned prompts for text–image-classification, combining linear and covariance-based heads on visual tokens. This is critical in domains such as medical imaging where rich prior context and statistical second-order visual features are vital (Wang et al., 13 Apr 2024).
Specialized biomedical paradigms (FAST) address annotation scarcity in whole slide image (WSI) classification by combining dual-level annotation (slide and patch), attention-based label cache for patch-level knowledge transfer, and adaptive prompt priors, yielding near-supervised performance at 0.22% annotation cost (Fu et al., 29 Sep 2024).
Few-shot audio classification protocols integrate metric learners with supervised contrastive losses (including angular margins), producing state-of-the-art results in low-shot audio settings (MetaAudio) (Sgouropoulos et al., 12 Sep 2025).
Graph contrastive and meta learning synergistically combine full-graph invariance and episodic discriminative bias for few-shot node classification (Liu et al., 2023).

This domain expansion brings with it new annotation paradigms, modalities, architectural building blocks (e.g., prompt engineering, hypernetwork adapters, covariance pooling), and evaluation metrics.

6. Continual, Online, and Causal Paradigms

Few-shot online continual learning evaluates models under non-stationary distributions and emerging classes, imposing stability-plasticity tradeoffs. Prototype averaging (OAP) balances memory retention and adaptation, while contextual memory (CPM), memory replay, and regularization-based methods modulate forgetting and plasticity (Lunayach et al., 2022).
Causal/interventional FSL paradigms model pretraining bias as a confounder and invoke backdoor adjustment to recover unbiased $P(Y|do(X))$ . These adjustments (feature-wise, class-wise, combined) can be wrapped around any baseline classifier or meta-learner and yield robust improvements, especially on hard queries and for deeper backbones (Yue et al., 2020).
IR-lens FSL treats all batch points as both query and retrieval “documents,” optimizing ranking-based objectives (mean average-precision) to fully exploit scarce data in each episode (Triantafillou et al., 2017).

Continual, active, and causal paradigms highlight the limits of conventional FSL and propose fundamentally new directions.

7. Future Directions, Limitations, and Impact

Current frontiers and open directions in FSL research include:

Expanding beyond fixed $N$ -way $K$ -shot: Handling variable-way and variable-shot episodes, or generalizing to joint classification over base and novel classes (generalized FSL).
Robust cross-domain adaptation: Approaches that maintain sample-efficiency under severe distribution shift, dataset bias, or domain transfer.
Integrating unsupervised, active, and semi-supervised components: Especially relevant for domains with large pools of unlabeled data and limited annotation budgets.
Scalability and efficiency: Reducing computational and memory demands of episodic/meta, contrastive, or graph-based methods, and exploring lighter architectures (e.g., MobileViT).
Domain-specific architecture innovations: Leveraging vision-language prompts, covariance pooling, or hypernetworked adapters for task- and modality-specific challenges.
Furthering theoretical understanding: Of the relationship between contrastive and meta-learning, causality, confounding, and sample-complexity.

Limitations persist in balancing flexibility and adaptation, addressing the stability–plasticity tradeoff, sensitivity to hyperparameters and support set selection, and scaling to truly low-resource or cross-modal settings. However, recent work demonstrates substantial advances; e.g., FM3 achieves or surpasses fully fine-tuned models for multimodal/lingual tasks (Chadha et al., 2023), and MICM outperforms prior state-of-the-art for unsupervised pretraining in both in-domain and cross-domain FSL (Zhang et al., 23 Aug 2024).

Few-shot learning, as a paradigm, continues to expand its theoretical, algorithmic, and empirical scope across tasks, modalities, and settings, remaining a central concern for generalization in data-scarce environments.