Few-Shot In-Context Learning

Updated 25 March 2026

Few-shot in-context learning is a paradigm that enables models to generalize to new tasks using a limited set of input-output examples without parameter updates.
Demonstration selection and prompt engineering are critical, employing strategies like semantic similarity retrieval, skill-oriented rewriting, and contrastive selection.
The approach shows broad applicability across modalities and languages, with recent advances enhancing computational efficiency, distillation, and robustness.

Few-shot in-context learning is a paradigm within machine learning—most notably in LLMs and other sequence models—where a model adapts to new tasks or domains at inference time by conditioning on a small set (the "few shots") of input–output demonstrations, without any parameter updates. This mechanism allows a single pretrained model to generalize to new tasks, modalities, or domains by simply observing a handful of explicit example pairs within a prompt or context window.

1. Formal Definitions and Paradigm

Few-shot in-context learning (ICL) operates by concatenating $k$ demonstrations of input–output pairs $(x_1, y_1), \dots, (x_k, y_k)$ to form a context, followed by a query input $x_{k+1}$ . The model is then tasked with predicting the corresponding output $\hat{y}$ , often by maximizing the conditional likelihood:

$\hat y = \arg\max_{y \in \mathcal C} \; P(y \mid [x_1, y_1; \dots; x_k, y_k; x_{k+1}])$

where $[ \cdot ]$ denotes concatenation in the prompt, and $\mathcal C$ is the candidate space (e.g., label set).

No gradient updates or fine-tuning of model parameters occur during adaptation; all learning occurs "in context" through prompt conditioning. This setting extends to cross-modal and cross-lingual scenarios by including demonstrations from divergent modalities or languages within the few-shot context (Hee et al., 2024, Cahyawijaya et al., 2024).

2. Demonstration Selection and Prompt Engineering

Performance in few-shot ICL depends critically on the choice and structure of demonstration examples. Strategies include:

Semantic Similarity Retrieval: Retrieve $k$ demonstrations most semantically similar to the test query using learned embedding spaces, as in multilingual NLU and semantic parsing (Winata et al., 2023, An et al., 2023).
Skill-Oriented Rewriting: Generate "skill-based" descriptions via LLMs to eliminate spurious surface similarities and focus on procedural or semantic overlap (An et al., 2023).
Contrastive Selection: Employ contrastive learning to encode multiple aspects relevant for the task (e.g., semantic, boundary, and label similarity for nested NER) and select demonstrations that optimize an aggregate score (Zhang et al., 2024).

Prompt construction is often tightly designed to match either pretraining objectives (e.g., span corruption in T5) or downstream classification/generation formats, with elements such as explicit label verbalization, rationale prompting, and specialized templates for structured data (e.g., SQL-formatted dialogue state tracking) (Hee et al., 2024, Hu et al., 2022, Lee et al., 2023, Zhang et al., 2024).

Best practices include limiting prompt size (often $k \leq 16$ due to context window constraints), using explicit label rationales, and aligning prompt structure with model pretraining distributions (Hee et al., 2024, Lee et al., 2023).

Few-shot ICL generalizes not only across tasks within a modality but also across modalities and languages:

Cross-Modal Transfer: Textual demonstrations significantly improve vision-language hate speech detection, outperforming visual-language demonstrations due to richer linguistic pattern coverage and better transfer to tasks where visual-to-text translation (e.g., OFA captions) underspecifies semantic content (Hee et al., 2024).
Cross-Lingual Transfer: In low-resource languages, "in-context query alignment"—mapping semantically similar source-target pairs via parallel corpora within the prompt—outperforms label alignment, yielding consistent F1 gains and robustness to label-shift artifacts (Cahyawijaya et al., 2024).
Non-linguistic Domains: Learned in-context sequence models (e.g., for molecular property prediction or 6-DoF robotic alignment) can be trained from scratch without language pretraining, using permutation-equivariant and graph-structured encoders, and still match or surpass classic meta-learners at small $k$ (Fifty et al., 2023, Vosylius et al., 2023).

4. Computational and Algorithmic Enhancements

Recent research has advanced few-shot ICL via several algorithmic augmentations:

Parallel Batching and Weighted Decoding: ParaICL executes batches of demonstrations in parallel (rather than concatenating all into a single context), weights batch predictions by semantic similarity, and utilizes an adaptive plausibility constraint for robust token selection. This yields consistent improvements in accuracy, especially as $k$ increases (Li et al., 2024).
Negative Sample Leverage: Incorporating negative samples (i.e., error modes from zero-shot CoT) into demonstration selection, and then retrieving additional positive exemplars most similar to these negatives, reduces performance sensitivity and enhances test accuracy (Liang et al., 31 Jul 2025).
Optimization-based Contextualization: Context Tuning initializes trainable prompts or cache states with embeddings of real demonstration examples and directly optimizes these prompt representations at inference time on the new task, achieving fast adaptation with low memory overhead, outperforming conventional prompt tuning and approaching test-time weight update approaches (Lu et al., 6 Jul 2025).
Fusion and Objective-Aligned Prompting in Seq2Seq Models: Early/late fusion of encoded demonstrations (as in FiD or RAG) and prompt alignment to pretraining objectives enable seq2seq LMs to outperform much larger decoder-only models in true few-shot settings (Lee et al., 2023).

5. Knowledge Distillation and Student Model Compression

In-context learning ability can be distilled into smaller student models:

Context Distillation: Soft labels produced by a large model under few-shot ICL prompting are used as distillation targets. This allows a much smaller student (e.g., OPT-125M) to internalize the context sensitivity of a 1.3B model, achieving a nearly 50% improvement in out-of-domain accuracy with up to 60% reduction in memory consumption (Duan et al., 2024).
Meta-ICT and Multitask-ICT: Distillation protocols combining in-context learning objectives and standard language modeling further improve student generalization in both meta-training and multitask regimes, with the best results when both loss components are included (Huang et al., 2022).

6. Limitations and Open Challenges

Despite strong empirical advances, several challenges persist:

Concept Exploitation: Most LLMs (except T0 variants) fail to leverage demonstrations that share latent reasoning concepts with the query, deferring instead to superficial distributional cues or spurious correlations (Štefánik et al., 2022).
Context-length Bottlenecks: Vanilla ICL is limited by the available prompt window; performance can plateau or degrade as $k$ increases beyond 16–32, motivating parallelization and compression techniques (Li et al., 2024).
Demonstration Noise and Retrieval Bias: Naive retrieval based on raw input similarity induces surface-level biases; best practices involve semantic/skill-based or contrastive representations to avoid overfitting to spurious patterns (An et al., 2023, Zhang et al., 2024).
Fine-grained Diversity Handling: Demonstration diversity (in shape, syntax, or reasoning chain) is more effective than mere count increase—diverse examples can dramatically lower prediction error, particularly in settings such as few-shot 6-DoF object alignment (Vosylius et al., 2023).

7. Applications and Empirical Performance

Few-shot in-context learning has demonstrated strong empirical performance across:

Multimodal and Cross-modal Tasks: Cross-modality hate speech detection, few-shot visual question answering, and in-context image manipulation (e.g., InstaManip’s group-wise self-attention) (Hee et al., 2024, Lai et al., 2024, Chen et al., 11 Jun 2025).
Language Technology: Dialogue state tracking with SQL-based schema prompts, multilingual NLU/Q&A, nested NER, and knowledge base question answering via logical-form generation and KB grounding (Hu et al., 2022, Zhang et al., 2024, Li et al., 2023, Winata et al., 2023).
Scientific and Structured Prediction: Molecular property prediction with dynamic permutation-invariant attention and low-latency inference; robotic imitation via energy-based graph alignment (Fifty et al., 2023, Vosylius et al., 2023, Lai et al., 2024).
Model Efficiency: Distillation methods retain most ICL gains in compact students, offering practical deployment advantages in constrained environments (Duan et al., 2024, Huang et al., 2022).

Results consistently indicate that with appropriate demo selection, prompt structuring, and retrieval policy, few-shot ICL can match or exceed traditional fine-tuned or supervised systems—even in structured, low-resource, and cross-modal scenarios. Best-practice guidelines stress the use of semantically-relevant demonstrations (ideally matching the task’s underlying skill or concept), prompt alignment to pretraining or task format, and adaptation of computational enhancements such as context tuning or batch-wise fusion to maximize sample efficiency and transferability.