Interleaved Concept Learning (ICL)

Updated 29 March 2026

Interleaved Concept Learning (ICL) is a framework where LLMs use in-context demonstrations to perform both skill recognition and novel function learning without gradient updates.
It leverages Bayesian inference and optimization via self-attention, enabling the model to infer underlying generative processes from provided examples.
ICL finds practical use in tasks like sentiment analysis and synthetic regression, balancing robust pattern matching with adaptive function fitting.

Interleaved Concept Learning (ICL), more commonly termed In-Context Learning, refers to the capacity of large auto-regressive LLMs to generalize from a small set of (input, label) demonstrations supplied within a prompt, without requiring gradient-based parameter updates. Underpinning ICL is the notion that the LLM, through its forward pass, uses the prompt as evidence for an underlying text-generation mechanism—termed a "skill"—and continues generation in accordance with this inferred mechanism. Two distinct processes emerge: skill recognition, where the model identifies a previously learned concept from pre-training, and skill learning, where it induces a novel generative process from in-context examples. Both processes can be unified under a data generation perspective, wherein the LLM is effectively inferring the generator that best explains the prompt (Mao et al., 2024).

1. Formal Framework of In-Context Learning

Let $S_n = [(x_1,y_1), \ldots, (x_n, y_n)]$ denote a sequence of in-context demonstrations, and let $x_{\text{test}}$ denote a new input. A "skill" is formalized as a data generation function $g$ , mapping input tokens to label sequences. Pre-training yields a family $\Theta = \{\theta\}$ of such concepts, each corresponding to a probability model $p_\theta$ (e.g., an HMM transition matrix). Conversely, tasks not seen during pre-training are modeled via a richer class $\mathcal{F}$ , such as linear regressors or decision trees.

Skill Recognition: Bayesian inference over $\Theta$ yields the posterior:

$p(\theta | S_n) \propto p(S_n|\theta)p(\theta)$

and the predictive distribution:

$p(y_{\text{test}}|S_n, x_{\text{test}}) = \int_\Theta p(y_{\text{test}}|S_n, x_{\text{test}}, \theta) p(\theta|S_n) d\theta$

Skill Learning: For a new function $f^* \notin \Theta$ , ICL “fits” $x_{\text{test}}$ 0 to minimize prediction error on $x_{\text{test}}$ 1. Using the forward-pass predictor $x_{\text{test}}$ 2 (parameters $x_{\text{test}}$ 3), the meta-learning objective is:

$x_{\text{test}}$ 4

At inference, the LLM’s self-attention forward pass implements, without gradient updates, a marginalization or optimization procedure that approximately realizes the posterior or the empirical minimizer (Mao et al., 2024).

2. The Data Generation Perspective

ICL and LLM pre-training are unified through the lens of token sequence modeling. Pre-training fits the model to $x_{\text{test}}$ 5 via autoregressive next-token prediction, approximating a mixture over latent concepts $x_{\text{test}}$ 6. In context, the prompt $x_{\text{test}}$ 7 serves as evidence for a latent generator, where

$x_{\text{test}}$ 8

Bayesian recognition expands this as an integral over concepts, whereas function-learning interprets $x_{\text{test}}$ 9. The self-attention weights act as the computational substrate for the required marginalization or optimization (Mao et al., 2024).

3. Theoretical Underpinnings: Algorithms and Equations

Skill recognition employs discrete Bayesian selection among known concepts. Skill learning’s mechanism, especially in linearized transformer settings, has been rigorously connected to gradient-based learning algorithms:

Single-Layer Attention as Optimization: For linearized transformers, a single self-attention layer with pre-trained weights $g$ 0 realizes:

$g$ 1

Multi-Layer Generalizations: Multi-layer attention architectures approximate ridge regression or higher-order optimizers. In the infinite-width limit, this architecture converges to the Bayes-optimal estimator for the linear-Gaussian function class, thus simulating closed-form least-squares fitting (Bai et al., 2023).

This indicates that the LLM forward pass can perform either marginalization (for recognition) or direct optimization (for learning) within a unified computational paradigm (Mao et al., 2024).

4. Illustrative Applications

Concrete examples clarify the distinction between ICL’s dual regimes:

Task Type	Mechanism (Skill)	Example
Sentiment Classification	Skill Recognition	Prompt: “Example 1: ‘This movie was delightful.’ → positive ␞ ... Query: ...”
Synthetic Regression	Skill Learning	Prompt: Pairs $g$ 2 from $g$ 3; model fits least-squares mapping

In sentiment classification, the model matches examples to an HMM-concept $g$ 4 determined by token transition statistics, then outputs label according to $g$ 5. In regression, if given novel pairs from a linear function, the model’s architecture simulates two steps of gradient descent or matrix inversion to infer and apply the new function (Mao et al., 2024).

5. Advantages and Limitations

The strengths and weaknesses of ICL’s two modalities reveal the trade-offs intrinsic to LLM prompt-based adaptation:

Mode	Strengths	Weaknesses
Skill Recognition	Robust to mis-specified demonstrations; insensitivity to adversarial noise	Cannot override pre-training priors; inflexible for “specification-heavy” tasks
Skill Learning	Can induce genuinely novel mappings; enables on-the-fly correction/“editing”	Susceptible to noisy context; fails if task outside pre-training $g$ 6

A key insight is that, in both modes, the LLM is engaging in Bayesian reasoning over latent generative processes—whether selecting a discrete $g$ 7 or regressing to a continuous $g$ 8—using fundamentally the same autoregressive, self-attention machinery (Mao et al., 2024).

6. Open Questions and Future Research

Several foundational questions remain unresolved:

Origins of Skill Learning: The emergence of skill learning is observed only at large model scale and with skewed, bursty pre-training distributions. The specific causal factors, including architectural motifs like induction head circuits (Olsson et al., 2022) or the impact of distributional Zipf laws, are not yet fully understood.
Limits Beyond Pre-training: Empirical results indicate LLMs only generalize to new $g$ 9 within the span of pre-training $\Theta = \{\theta\}$ 0, suggesting fundamental constraints on out-of-class generator inference. The extent to which pre-training can be designed to promote broader, more “imaginative” $\Theta = \{\theta\}$ 1-learning remains an open research direction.
Extensions of the Data Generation Lens: Analogous latent-variable and function-learning frameworks may apply to other emergent LLM abilities such as chain-of-thought reasoning, self-correction, or tool use—interpreting these as multi-step or interactive data-generation inference. Early work (Prystawski & Goodman, 2023) frames chain-of-thought as latent proof-tree generation, though comprehensive formalization is outstanding.

Understanding the precise mechanisms, limits, and optimal regimes for skill recognition and learning, and developing methods for dynamically blending these modes at inference time, remain central to advancing the theoretical and empirical understanding of ICL in LLMs (Mao et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

A Survey to Recent Progress Towards Understanding In-Context Learning (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Interleaved Concept Learning (ICL).

Interleaved Concept Learning (ICL)

1. Formal Framework of In-Context Learning

2. The Data Generation Perspective

3. Theoretical Underpinnings: Algorithms and Equations

4. Illustrative Applications

5. Advantages and Limitations

6. Open Questions and Future Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

Interleaved Concept Learning (ICL)

1. Formal Framework of In-Context Learning

2. The Data Generation Perspective

3. Theoretical Underpinnings: Algorithms and Equations

4. Illustrative Applications

5. Advantages and Limitations

6. Open Questions and Future Research

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research