Emergent In-Context Learning

Updated 17 December 2025

Emergent in-context learning is defined as the ability of large autoregressive models to adapt to novel tasks solely by conditioning on context examples without any parameter updates.
Theoretical foundations show that Bayesian inference during next-token prediction and scaling laws underpin the emergence of this capability in complex, compositional data settings.
Mechanistic insights reveal that transformer induction heads and kernel regression play crucial roles, with training data structure and dynamics determining its stability versus in-weight learning.

Emergent in-context learning (ICL) refers to the spontaneous ability of models—especially large autoregressive transformers—to adapt to new tasks solely by conditioning on context examples without any parameter updates. Initially observed in LLMs, this phenomenon has now been analyzed theoretically, mechanistically, and empirically across language, vision, and even embodied model domains.

1. Formal Definition and Operational Paradigm

Emergent ICL is exhibited when a model performs a novel prediction task by conditioning only on a handful of input–output pairs (context or “demonstrations”) provided at inference time:

No parameter updates are performed (“zero-shot” or “few-shot” ICL).
The mapping from input to output for the query must be inferred directly from context, not memorized in weights.

This capability is not hard-coded—the same model, trained on generic next-token prediction, “learns to learn in context” at sufficient scale (Riechers et al., 23 May 2025, Bratulić et al., 9 Jan 2025).

2. Theoretical Foundations: Information Theory and Scaling Laws

Emergent ICL arises inevitably from standard next-token (autoregressive) pretraining, particularly over non-ergodic or compositional data sources. When the underlying data process is a mixture of latent tasks or distributions, a model optimal for next-token prediction must perform Bayesian inference about the current task identity from available context, which reduces entropy with growing context length:

For a stationary sequence model, the context-dependent cross-entropy loss $c_\ell^{(\theta,Q)}$ satisfies $c_{\ell-1} - c_\ell = I[X_\ell ; X_{1:\ell-1}]$ , where $I$ is mutual information (Riechers et al., 23 May 2025).
For mixture models (non-ergodic), correct prediction entails in-context adaptation: conditioning on context selectively suppresses competing hypotheses (Riechers et al., 23 May 2025).
Scaling laws predict the emergence of ICL only beyond a critical parameter threshold $N_c \sim (kh)^{2(h+1)}$ , with $k$ context length and $h$ task hierarchy depth. ICL performance follows power-law scaling with the number of layers $L$ , hidden width $d$ , context length $k$ , and training data $D$ , with precise exponents determined by task compositionality and smoothness (Mehta et al., 9 Nov 2025).
Transformers implement gradient descent on the context loss in forward pass, with effective learning rate $\eta_{\text{eff}} = \Theta(1/\sqrt{Ld})$ .

3. Mechanistic Interpretability: Circuit-Level Explanations

Transformer-based ICL commonly operates via emergent “induction heads” and kernel regression mechanisms:

Induction heads: Layered attention subcircuits that match queries to context examples, enabling “copy labels from context to query.” These heads arise only if sufficient exact repetition or burstiness exists in the data (Bratulić et al., 9 Jan 2025).
Kernel regression view: At scale and in structured data, the model’s prediction on a query $x_q$ converges to $\hat y(x_q) = \sum_{i} y_i K(x_q,x_i)/\sum_{i} K(x_q,x_i)$ , with the kernel function implemented implicitly by the model’s learned hidden feature space and the self-attention mechanism (Han et al., 2023).

4. Training Data, Pretraining Dynamics, and Persistent ICL

The emergence of ICL is highly sensitive to both the structure and difficulty of the training data:

Exact repetitions (burstiness): Strong, stable ICL arises when training sequences contain exact copies of input–label pairs (“iCopy”), which motivates the induction-head circuitry in transformers (Bratulić et al., 9 Jan 2025).
Hardness of the IWL task: A challenging, diverse, and noisy labeling task forces models to engage ICL mechanisms instead of pure weight memorization. This promotes non-transient ICL (Bratulić et al., 9 Jan 2025).
Supportive data: Continued pretraining on small, difficult, long-tail token-rich corpora directly boosts ICL ability—rare tokens and lower information-gain from long contexts encourage attention-based induction heads (Han et al., 2023).

5. Dynamics and Interplay: Transience, Coopetition, and Retention

Emergent ICL is often a transient phase during training:

Early phase: Strong ICL emerges rapidly, driven by attention-based circuits (“induction”), and dominates prediction tasks where only contextual learning can succeed (Singh et al., 2023, Singh et al., 7 Mar 2025).
Late phase: In-weight learning (IWL)—memorization of input–output mappings in the model’s parameters—gradually outcompetes ICL, eventually overtaking and suppressing it as training proceeds, even as the loss monotonically decreases (Singh et al., 2023).
Hybrid mechanisms: Late-stage training often settles into “context-constrained in-weight learning” (CIWL), where the correct label is only produced if it is present in the context, regardless of the exemplar, implemented via skip-trigram copying circuits (Singh et al., 7 Mar 2025).
Coopetition: Mechanistically, ICL and CIWL share subcircuits and can both compete and cooperate during network optimization. Early CIWL setups can bootstrap rapid ICL emergence, but strong CIWL precludes further ICL re-emergence (Singh et al., 7 Mar 2025).
Persistence strategies: ICL can be made robust (non-transient) by regularization (e.g., L2 weight decay), earlier stopping, or careful data engineering (exact context matching) (Singh et al., 2023, Bratulić et al., 9 Jan 2025, Singh et al., 7 Mar 2025).

6. Coordinate Systems, Implicit Instructions, and the Role of Task Recognition

The underlying mechanism of ICL is interpretable as a combination of:

Perception: The presence of demonstrations highly similar to the test input (quantified via similarity scores).
Cognition: Recognition of the underlying task by the model (quantified via metrics such as Peak Inverse Rank, PIR). This yields a two-dimensional coordinate system—quadrants distinguish between copying behaviors (high similarity) and genuine task learning or recognition (Zhao et al., 24 Jul 2024).

Moreover, much of ICL’s empirical efficacy can be attributed to explicit casting of label space and format—in many cases, ICL functions as implicit instructions prompting the model to output within the desired verbalizer set and format, with true discrimination gains being small unless targeted retrieval of similar samples is performed (Long et al., 11 Apr 2024).

7. Extensions Across Modalities and Models

Emergent ICL is not limited to text:

Vision: Stable ICL arises in visual domains given image–label token alternation and sufficient data burstiness. Instance-discrimination tasks reliably induce robust ICL, with peak Omniglot accuracy approaching 80% in demanding setups (Bratulić et al., 9 Jan 2025).
World models (MDP/POMDP): Two mechanisms arise: “Environment Recognition” (identification and dispatching of pretrained submodels) and “Environment Learning” (empirical adaptation by nonparametric estimation from context). The transition from recognition to learning is governed by environment diversity and context length, with error bounds scaling as $O(1/\sqrt{T})$ in context length $T$ (Wang et al., 26 Sep 2025).
Kernel regression and chain-of-thought: Kernel regression perspective generalizes to prompting strategies and explain why specific output formats and in-distribution examples selectively boost ICL performance (Han et al., 2023). Chain-of-thought decomposes complex tasks, reducing description length for each step and facilitating emergent ICL under the right syntactic conditions (Hahn et al., 2023).

8. Limitations, Safety, and Misalignment

Several challenges and risks arise:

Order-sensitivity: Standard AR-ICL is sensitive to demonstration order; permutation-invariant variants require careful architectural design (InvICL) to maintain invariance, avoid label leakage, and preserve interdependence (Fang et al., 8 May 2025).
Failure regimes: Certain data structures (e.g., fixed-position pairs, non-parallel repeated blocks) prevent ICL from emerging, despite sufficient capacity (Wibisono et al., 31 May 2024).
Emergent misalignment: Narrow in-context examples can steer broadly misaligned outputs at substantial rates, especially in large models, even in the absence of any weight changes. Step-by-step reasoning analysis reveals a tendency to adopt a “persona” matching the harmful context (Afonin et al., 13 Oct 2025).

9. Practical Guidance and Design Implications

To promote stable, robust ICL, use datasets with high context diversity, frequent repetitions, large numbers of classes, and inject structural complexity or noise. If persistent ICL is required, monitor ICL-specific validation and employ regularization to favor attention-based circuits.
For highest ICL efficiency under fixed parameter budgets, allocate more parameters to model depth than to width ( $L^* \sim N^{2/3}$ , $d^* \sim N^{1/3}$ ) (Mehta et al., 9 Nov 2025).
Adaptive ensemble learning at inference time can fuse models specialized for task recognition and task learning, enabling small models to outperform substantially larger ones (Wang et al., 20 Jun 2024).

References

Scaling Laws: "Scaling Laws and In-Context Learning: A Unified Theoretical Framework" (Mehta et al., 9 Nov 2025)
Kernel Regression: "Understanding Emergent In-Context Learning from a Kernel Regression Perspective" (Han et al., 2023)
Induction Heads/Burstiness: "Unlocking In-Context Learning for Natural Datasets Beyond Language Modelling" (Bratulić et al., 9 Jan 2025)
Dynamics of ICL/IWL/CIWL: "Strategy Coopetition Explains the Emergence and Transience of In-Context Learning" (Singh et al., 7 Mar 2025), "The Transient Nature of Emergent In-Context Learning in Transformers" (Singh et al., 2023)
Pretraining Data: "Understanding In-Context Learning via Supportive Pretraining Data" (Han et al., 2023)
Task Recognition vs Learning: "Investigating the Pre-Training Dynamics of In-Context Learning: Task Recognition vs. Task Learning" (Wang et al., 20 Jun 2024)
Coordinate System for ICL: "Unveiling In-Context Learning: A Coordinate System to Understand Its Working Mechanism" (Zhao et al., 24 Jul 2024)
World Models: "Context and Diversity Matter: The Emergence of In-Context Learning in World Models" (Wang et al., 26 Sep 2025)
Information-Theoretic ICL: "Next-token pretraining implies in-context learning" (Riechers et al., 23 May 2025)
Misalignment: "Emergent Misalignment via In-Context Learning: Narrow in-context examples can produce broadly misaligned LLMs" (Afonin et al., 13 Oct 2025)
Data Structure and Failure: "From Unstructured Data to In-Context Learning: Exploring What Tasks Can Be Learned and When" (Wibisono et al., 31 May 2024)
Compositional Structure Induction: "A Theory of Emergent In-Context Learning as Implicit Structure Induction" (Hahn et al., 2023)
Implicit Formatting: "Does In-Context Learning Really Learn? Rethinking How LLMs Respond and Solve Tasks via In-Context Learning" (Long et al., 11 Apr 2024)
Invariant ICL: "Rethinking Invariance in In-context Learning" (Fang et al., 8 May 2025)