Few-Shot and In-Context Learning (ICL)

Updated 20 March 2026

Few-shot and in-context learning is a paradigm where large pretrained transformer models adapt to new tasks by conditioning on a handful of labeled demonstrations without updating model parameters.
Its performance hinges on precise prompt design and demonstration selection that guide the model’s attention and output formatting for improved accuracy.
Recent advancements reveal mechanistic insights and extend ICL to multimodal, cross-lingual, and vector settings while offering competitive alternatives to fine-tuning.

Few-Shot and In-Context Learning (ICL) formalizes a paradigm in which large pretrained models, typically transformer-based, adapt to new tasks at inference time by conditioning on prompts containing a small number of labeled demonstrations, but without updating model parameters. ICL enables rapid, task-flexible adaptation across domains, modalities, and languages—evoking classical meta-learning, but realized entirely through context-driven manipulation of attention and representation in the forward pass. Significant advances in both methodology and mechanistic understanding have led to state-of-the-art results across structured prediction, reasoning, generation, and low-resource language settings.

1. Formal Definition and Core Principles

In-context learning for few-shot problems considers a frozen model parameterization $\theta$ and a pretraining corpus that exposes the model to numerous diverse supervision episodes. At inference, the model is presented with a prompt typically structured as:

A "task instruction" in natural language
A set of $k$ demonstration examples $(x_i, y_i)$
A query $x^*$ for which the prediction $\hat y$ is generated

The model predicts $\hat y$ by computing $p_\theta(y \mid \text{prompt}, x^*)$ , thus implicitly performing conditional task adaptation within the prompt context. For classification, the prediction is: $\hat y = \arg\max_{c\in C} f_\theta\left( \text{prompt}, x^* \right)_c$ where $C$ is the label set, and $f_\theta$ is the model’s softmax output (Long et al., 2024, Cahyawijaya et al., 2024).

Contemporary ICL research frames this as meta-learning instantiated in attention-based sequence models, where adaptation occurs in the forward pass, not through explicit weight updates (Lampinen et al., 2024). Formal criteria for ICL require that conditioning on the full context reduces expected prediction loss compared to shallow or single-instance conditioning: $\ell_{\text{full}} < \ell_{\text{shallow}}, \quad \ell_{\text{full}} = \mathbb{E}[\mathcal{L}(\theta; x_1, ..., x_t)], \ \ell_{\text{shallow}} = \mathbb{E}[\mathcal{L}(\theta; x_t)]$

2. Mechanisms and Empirical Decomposition

Evidence from large-scale analytic studies demonstrates that few-shot ICL operates via several orthogonal mechanisms (Long et al., 2024):

Label Space Regulation: Demonstrations primarily serve to constrain the model’s output vocabulary to valid task labels, substantially increasing the fraction of predictions in the allowable label set.
Format Regulation: The prompt fixes the response format (verbalizer structure, answer phrasing), aligning outputs to evaluation scripts and expectations.
Marginal Discrimination: The extent to which ICL propagates real task discrimination (wrong→right flips within the in-format region) is comparatively minor and unstable, often offset by right→wrong flips, especially with randomly selected demonstrations.

Quantitatively, the decomposition

$\Delta \text{Acc}_{\text{ICL}} = P_{\text{space}} + P_{\text{format}} + P_{\text{disc}}$

reveals that $P_{\text{space}}$ and $P_{\text{format}}$ account for the overwhelming majority of ICL's benefit in general-purpose LLMs, with $P_{\text{disc}}$ contributing minimally except under semantic retrieval (Long et al., 2024). Providing explicit instructions about label set and answer format nearly closes the gap to standard $k$ -shot ICL, highlighting its function as an implicit instruction-following mechanism rather than true discrimination-learning in many settings.

3. Demonstration Selection and Retrieval Strategies

ICL performance is acutely sensitive to the quality and structure of demonstration selection:

Similarity Retrieval: Semantic retrieval of demonstrations, e.g., using SimCSE, enables accurate selection of relevant examples, significantly improving $P_{\text{disc}}$ (true task discrimination) compared to random selection. However, retrieval sets may reduce label diversity, potentially hurting output format regularization (Long et al., 2024).
Multi-Facet Retrieval: For structurally complex tasks such as nested Named Entity Recognition (NER), multi-dimensional retrieval incorporating semantic, boundary, and label similarity via contrastive learning yields large gains in F1 (Zhang et al., 2024).
Automated and RL-Based Selection: Parameter-efficient retrieval heads and reinforcement learning policies enable models to learn to self-select and sequence demonstrations, achieving optimal trade-offs between representativeness, diversity, and relevance. Reward models leveraging the LM’s own log-probabilities are central to such approaches (Long et al., 2024).

The following table encapsulates the benefit of various demonstration selection mechanisms across representative studies:

Selection Method	Domain/Task	Incremental Gain
Random	General classification	Baseline
Semantic retrieval	NER, text classification	+2–7 F1 / +1–5 accuracy
Multi-facet contrast	Nested NER	+4–7 F1 vs. best prior
RL-based retrieval	Text & code generation	+2–8 points task- or BLEU

Ablations consistently show largest F1/accuracy drops when semantic similarity is disabled, but tasks with structural complexity require explicit boundary or label cues for optimal generalization (Zhang et al., 2024).

4. Prompt Design, Modularity, and Scaling Regimes

Prompt modularity is essential for extensibility and clarity. Empirically effective prompts generally concatenate: task instructions, selected demonstrations, an explicit label set (if applicable), and the test query. Delineating roles (e.g., demonstration vs. label set vs. query) allows systematic experimentation and supports plug-and-play extension to new entity types, richer features, or output formats (Zhang et al., 2024).

Shot scaling exhibits consistent patterns:

Sharp gains at low $k$ ( $k \leq 10$ ): The largest performance increases occur in the transition from zero-shot to few-shot; benefits rapidly plateau beyond $k = 20–32$ (Zhang et al., 2024, Wang et al., 2024).
Diminishing or negative returns at larger $k$ : Both context-length bottlenecks and interference among excessive examples may decrease accuracy or fluency (Li et al., 2024).
Quality over quantity: Sample diversity and representativeness trump raw example count; random or poorly matched demos dilute label and format cues (Wang et al., 2024).

5. Mechanistic Interpretations and Theoretical Models

Recent mechanistic analyses have deepened the understanding of how few-shot ICL operates within neural architectures:

Task-Oriented Information Removal: Few-shot demonstrations implicitly drive the model to filter out task-irrelevant representation components, compressing query hidden states toward a low-dimensional task-verbalization subspace (TVS). This is accomplished by a small set of specialized “Denoising Heads” in transformer mid-to-late layers—ablation of which sharply degrades performance, especially on unseen-label tasks (Cho et al., 25 Sep 2025).
Implicit vs. Explicit ICL: Explicit prompt-based ICL can be “compiled” into concise attention-logit interventions (“in-context routing”) using reusable, input-conditioned routing directions extracted by PCA across tasks. This reduces inference costs and supports broader task generalization (Li et al., 26 Sep 2025).

Moreover, ICL phenomena encompass a wide spectrum—ranging from classical few-shot with explicit demonstrations, to instruction/meta-ICL, coreference, and analogical role binding, unifying meta-learning, goal-conditioned policy, and long-context sequence modeling under one theoretical umbrella (Lampinen et al., 2024).

6. Extensions: Multimodal, Cross-Lingual, and Vector ICL

Few-shot ICL has been systematically extended to modalities and settings beyond text:

Multimodal ICL: Vision-LLMs (VLMs, VLLMs, LVLMs) support image/text inputs and outputs; performance on benchmarks (e.g., VL-ICL Bench) improves in 0→2-shot regimes but is highly sensitive to context length, example order, and demonstration fusion policy. Hierarchical architectures and curriculum pretraining are promising directions (Zong et al., 2024, Chen et al., 11 Jun 2025).
Continuous Vector ICL: With lightweight projector modules, transformer LMs can ingest continuous vector embeddings from black-box encoders (e.g., for time series, molecules, fMRI), achieving in-context learning with vector-valued “pseudo-token” embeddings. Pretraining the projector is essential, and performance matches or exceeds both textual ICL and domain-tuned baselines (Zhuang et al., 2024).
Cross-Lingual and Low-Resource ICL: For low-resource and non-English languages, semantic retrieval across languages (e.g., via SBERT embedding) and in-context query alignment dramatically improve generalization, closing gaps with high-resource models (Cahyawijaya et al., 2024, Štefánik et al., 2023). Multilingual instruction tuning consistently outperforms English-only fine-tuning for Slavic and low-resource settings.

7. Comparisons to Alternative and Complementary Paradigms

Few-Shot Fine-Tuning (FT): When strictly controlled for model size, number of examples, and evaluation regime, few-shot FT and ICL are comparable in both in-domain and out-of-domain performance. Large models ( $\geq$ 6.7B) fine-tuned on as few as 16 examples can match or surpass ICL, with both paradigms subject to high run-to-run variance, spurious correlations, and pattern sensitivity (Mosbach et al., 2023).
Parameter-Efficient Fine-Tuning (PEFT): Adapter- and scaling-based PEFT variants (e.g., IA $^3$ ) can outperform ICL consistently in accuracy on standard few-shot benchmarks, while being orders of magnitude more efficient in FLOPs, memory, and storage. For example, IA $^3$ -based T-Few on T0 is 1000 $\times$ cheaper in inference FLOPs than GPT-3 ICL, achieving super-human accuracy on RAFT (Liu et al., 2022).
Instruction Tuning: In extremely low-shot CSS classification tasks, ICL robustly outperforms instruction-tuned models (e.g., LoRA or full supervision), particularly in settings prone to overfitting or query-distribution shift (Wang et al., 2024).

8. Practical Considerations and Best Practices

Prompt Design: Explicitly separate instruction, demonstration, and label list; keep constraints concise and output formats tightly controlled (Zhang et al., 2024, Wang et al., 2024).
Demonstration Selection: Prefer semantically and structurally relevant, diverse demonstration sets. RL-based retrieval or iterative selection (e.g., IDS) offers automatic balancing of similarity and diversity (Long et al., 2024, Qin et al., 2023).
Batching and Scaling: For large $k$ , parallel batching (ParaICL) avoids context-window overflow by semantic clustering and weighted aggregation over multiple short prompts, yielding consistent gains without degradation from input truncation (Li et al., 2024).
Privacy: Differentially private ICL is feasible through structured randomized sampling and mixture decoding, imposing negligible utility loss for $\epsilon=1-2$ in ROUGE-L, even on open-ended generation (Flemings et al., 31 Jan 2025).
RL and Exploration: Exploration-exploitation policies in multi-modal or compositional ICL (e.g., for LVLMs) can effectively fuse information across supporting evidence, optimizing demonstration selection under reward supervision (Chen et al., 11 Jun 2025).

9. Open Problems and Future Directions

Emergent research questions include: the mechanistic origins of ICL in sequence modeling, cross-modal and interactive extension, dynamic adaptation of routing/intervention layers, principled causality and interpretability metrics, and the synthesis of explicit and implicit ICL schemes for true zero-shot generalizability (Lampinen et al., 2024, Li et al., 26 Sep 2025, Cho et al., 25 Sep 2025). The spectrum of ICL continues to expand, connecting classical meta-learning, instruction following, and resource-lean adaptation in a unified framework.