In-Context Learning Capabilities

Updated 1 May 2026

In-context learning (ICL) is a capability of large language models that uses few-shot prompt demonstrations to infer new task mappings without model parameter updates.
Quantitative analyses reveal that ICL performance depends on prompt context length, model scale, and signal-to-noise ratios, often exhibiting sharp phase transitions.
Optimized prompt construction and retrieval strategies, including n-gram induction and schema activation, significantly enhance ICL performance across diverse applications.

In-context learning (ICL) is a distinctive capability of LLMs wherein they can adapt to new tasks on the fly by conditioning on a prompt containing a handful of input–output demonstrations, without updating their model parameters. This emergent property underpins a wide range of state-of-the-art results in both language and multimodal AI. Recent research has produced both deep formal analyses and extensive empirical benchmarks, revealing that ICL is not a monolithic phenomenon but instead constitutes a complex interplay of mechanisms including task recognition, task learning, retrieval, prompt optimization, data-centric effects, architectural constraints, and curriculum dynamics. This article surveys the precise definitions, theoretical formalisms, experimental protocols, phase transitions, representational structures, and practical implications of in-context learning, as established by contemporary research.

1. Formal Characterization and Dual Modes of ICL

At its core, ICL with LLMs involves prepending a prompt context—typically, $K$ demonstrations $(x_i,y_i)_{i=1}^K$ —to a novel input $x_{\rm test}$ and predicting its label $y_{\rm test}$ by evaluating:

$p_\theta(y_{\rm test} \mid \mathcal{D}_{\rm demo}, x_{\rm test}) = p_\theta(y_{\rm test} \mid (x_1, y_1, \ldots, x_K, y_K), x_{\rm test})$

where $\theta$ are the frozen LLM parameters and the demonstrations implicitly specify a task mapping $f: \mathcal{X} \to \mathcal{Y}$ (Pan et al., 2023, Dong et al., 2022).

ICL capabilities factor into two fundamental operating modes (Pan et al., 2023, Lin et al., 2024):

Task Recognition (TR): The model identifies the underlying task from the marginal distributions of inputs and label words, regardless of whether the input–label pairings are meaningful. Formally, the model’s prediction is invariant to permutation of input–output pairs:

$p_\theta(y \mid x_{\rm test}, (x_i,y_i)_{i=1}^K) = p_\theta(y \mid x_{\rm test}, \{x_i\}_{i=1}^K, \{y_i\}_{i=1}^K)$

TR leverages patterns and priors acquired during pretraining.

Task Learning (TL): The model infers a new or arbitrary mapping from the specific pairing of demonstration inputs and outputs, exhibiting a capacity for on-the-fly function learning (e.g., learning a new input–label mapping not seen in pretraining).

Extensive controlled experiments confirm that TR saturates quickly with small model sizes and few demonstrations, whereas TL emerges at scale and with larger $K$ . For example, in GPT-3, TR accuracy remains flat from 350M to 175B parameters, while TL (evaluated via abstract labels not present in pretraining) rises from randomness to approach TR+TL performance only for the largest models and $K \geq 16$ (Pan et al., 2023).

2. Quantitative Phase Diagrams and Scaling Laws

The scaling behavior of ICL in both synthetic and realistic settings is governed by geometric and statistical tradeoffs among input dimension $(x_i,y_i)_{i=1}^K$ 0, context length $(x_i,y_i)_{i=1}^K$ 1, pre-training task diversity $(x_i,y_i)_{i=1}^K$ 2, and signal-to-noise ratio (SNR) (Chandrupatla et al., 28 Apr 2026).

Key scaling laws for linear Gaussian-mixture classification tasks:

ICL accuracy is near-perfect ( $(x_i,y_i)_{i=1}^K$ 3) when

$(x_i,y_i)_{i=1}^K$ 4

and $(x_i,y_i)_{i=1}^K$ 5.

The critical “phase transition” in generalization occurs at $(x_i,y_i)_{i=1}^K$ 6, with accuracy improving exponentially fast above this threshold.
The margin and concentration bounds guarantee that, for fixed $(x_i,y_i)_{i=1}^K$ 7 and $(x_i,y_i)_{i=1}^K$ 8 (cluster separation), one needs context length $(x_i,y_i)_{i=1}^K$ 9 to scale with $x_{\rm test}$ 0 and inversely with signal strength to maintain low test error.
Benign overfitting arises: memorizing noisy in-context labels while preserving strong test accuracy is possible if $x_{\rm test}$ 1 and label noise is moderately bounded (Chandrupatla et al., 28 Apr 2026, He et al., 2024).

A Bayesian perspective formalizes ICL as posterior inference over task parameters $x_{\rm test}$ 2:

$x_{\rm test}$ 3

yielding an explicit trade-off: context examples $x_{\rm test}$ 4 push the model’s output distribution from the pretraining prior ( $x_{\rm test}$ 5) toward the query task posterior, with the necessary context length set by KL-divergences between task distributions (Song et al., 26 Oct 2025).

3. Mechanistic Insights and Representational Evidence

ICL arises from the alignment between pretraining data structure and the inference-time prompt (Han et al., 2023, Wibisono et al., 2024):

In tasks where the correct output can be deduced from co-occurrence statistics alone (e.g., word analogies, country–capital pairs), even non-attentional models (CBOW) can exhibit ICL given sufficiently diverse, intermingled training data (Wibisono et al., 2024).
For logical and pattern-based tasks (e.g., repeat-first, function regression), positional information and model depth become essential. Failure to include positional cues or training diversity leads to ICL breakdown: the model either collapses to majority-class guessing or cannot generalize to new patterns (Wibisono et al., 2024).
Pretraining on rare-token, long-context, informationally sparse data disproportionately improves ICL ability, as measured by up to +18 percentage points in downstream tasks (Han et al., 2023).

At the architectural level, transformer models uniquely develop n-gram “induction heads” in attention layers, which support efficient in-context matching and generalization to regular language patterns; adding explicit n-gram computation blocks to other sequence models yields substantial ICL gains (Akyürek et al., 2024). In-context learning can be functionally simulated by explicit mean-pooling operations in a linear transformer and is recoverable by hard-wiring such structures (Akyürek et al., 2024, Song et al., 26 Oct 2025).

4. Evaluation Regimes and Decomposition of Effects

Recent benchmarks and analytical work decompose ICL success into orthogonal dimensions:

Label-space regulation: Demos constrain the model’s outputs to target vocabulary.
Format regulation: Demos enforce output structure (verbalizer, answer format).
Discriminative improvement: Demos increase correct class prediction within in-space, in-format outputs (Long et al., 2024).

Empirically, label-space and format regulation account for most ICL accuracy gains, while random demonstrations have little effect on discrimination. Retrieval of semantically similar examples—especially those both similar and label-diverse—amplifies discriminative correction, but exclusive retrieval of a single class may damage label-space coverage (Long et al., 2024, Zhao et al., 2024).

A coordinate system with “perception” (similarity to demonstrations) and “cognition” (task recognizability by the model) captures the operational regime: only in the quadrant with task recognition and similar exemplars are both memory and inference fully leveraged, while in other regions the model defaults to copying, guessing, or positional bias (Zhao et al., 2024).

5. Prompt and Retrieval Optimization

Optimizing the selection, ordering, and diversity of in-context demonstrations is critical for robust and performant ICL:

Self-optimizing retrieval heads integrated into the LLM allow sequential, RL-trained demo selection that maximizes task-relevant context diversity and label coverage, outperforming both BM25 and dense retrieval baselines (Long et al., 2024).
RL-ICL achieves accuracy improvements of 2–14 points over SimCSE and random selection, while using only a small fraction of the candidate pool (1–8%) and preserving label and source diversity (Long et al., 2024).
The agentic multimodal ContextNav framework combines resource-aware embedding, agentic retrieval, structural alignment, and a closed-loop graph-based planning grammar, achieving ICL gains of 16.8% versus 7.6% for previous best practices (Fu et al., 6 Oct 2025).

6. Extensions: Schema Activation, Continuous Representations, and Multimodal ICL

Schema-Activated ICL (SA-ICL): Conditioning the model on explicitly abstracted schema templates (inspired by cognitive schema theory) leads to pronounced gains: up to 36.19% in chemistry and physics QA, with high interpretability and task transfer even with a single demonstration (Chen et al., 14 Oct 2025).
In-Context Vectors (ICV): Summarizing demonstrations into a latent vector that can steer inference-time LLM behavior enhances adherence to demonstration intent and efficiency, supports vector arithmetic for compositionality, and consistently outperforms standard ICL and LoRA fine-tuning across safety, style, and role-play tasks (Liu et al., 2023).
Vector-ICL: LLMs can perform ICL on continuous projected embeddings from arbitrary pretrained encoders, achieving gains in text, graph, time-series, molecular, and fMRI tasks—often surpassing token-based ICL and domain-tuned baselines (Zhuang et al., 2024).
Multimodal ICL: Vision–language LLMs (VLLMs) display ICL when prompted with image–text demonstration contexts; the scaling of few-shot multimodal ICL is currently bounded by context-length and vision encoder limitations, with performance peaking at 2–4 shots (Zong et al., 2024, Fu et al., 6 Oct 2025).

7. Practical Recommendations and Future Perspectives

Prompt construction: Ensure label diversity, high semantic similarity, and explicit output formatting. For code ICL, semantically meaningful identifier naming is crucial—removing identifier information reduces accuracy by up to 30 percentage points, whereas format and implementation have secondary effects (Li et al., 8 Aug 2025).
Pretraining corpus design: Maximize rare-token, long-context, and pattern-varied examples to enhance downstream ICL capacity (Han et al., 2023, Wibisono et al., 2024).
Architectural modifications: Incorporate explicit n-gram and schema representations for greater interpretability, efficiency, and human-like reasoning (Akyürek et al., 2024, Chen et al., 14 Oct 2025).
Benchmarks and controls: Disentangle TR and TL by evaluating random-label vs. abstract-label performance; measure the marginal utility of retrieval and prompt engineering by orthogonal ablation studies (Pan et al., 2023, Long et al., 2024, Fang et al., 28 Apr 2025).

The current scientific consensus is that ICL in foundation models is an emergent, multifaceted capability, quantifiably driven by both pretraining priors and in-context Bayesian-style statistical inference over observed demonstrations. True task learning, especially with arbitrary or abstract label associations, manifests only at the highest model scales and with sufficiently long contexts, while task recognition from distributional cues is robust even in small and medium LLMs. Ongoing research continues to refine our understanding of representation, retrieval, compositionality, and modality in ICL, with substantial open questions regarding scaling limits, failure modes, and new forms of dynamic, schema-driven context construction (Pan et al., 2023, Long et al., 2024, Zhao et al., 2024, Akyürek et al., 2024, Han et al., 2023, Zhuang et al., 2024, Fu et al., 6 Oct 2025).