Papers
Topics
Authors
Recent
2000 character limit reached

In-Context Learning and Representations

Updated 12 February 2026
  • In-context learning with representations is a paradigm where transformer models condition on structured demonstration pairs to dynamically perform tasks by mapping outputs into diverse representational spaces.
  • Empirical findings reveal that high-quality representations lead to steep performance gains, with strong Spearman correlations (ρ ≈ 0.8–0.95) between zero-shot accuracy and additional demonstrations.
  • The research highlights transformer mechanisms such as hidden state convergence, attention-based task vectors, and spectral embeddings that underpin robust, generalizable in-context learning.

In-context learning with representations encompasses the mechanisms by which large models, particularly transformers, learn to perform new tasks at inference time by conditioning on demonstrations—examples given as part of the input context—wherein the manner and structure of the provided input-output pairs, and the internal representations induced thereby, are central to both the performance and limitations of ICL. This research area covers (i) how the representation or format of demonstrations—whether class labels, continuous vectors, or multimodal constructs—affects both baseline accuracy and the subsequent capacity for learning from additional examples; (ii) the theoretical and mechanistic underpinnings of how transformers process, encode, and update such representations over context and depth; and (iii) the empirical workflows and trade-offs in prompt design, modality integration, and model selection for achieving robust, generalizable in-context learning across domains.

1. Problem Definition and Representation Formalism

In standard ICL settings, a model is presented with kk demonstration pairs {(xi,yi)}i=1k\{(x_i, y_i)\}_{i=1}^{k} followed by a query xx_*, with the objective of predicting the correct output yy_* associated with xx_*. A key innovation is to explicitly parametrize the representation function RR mapping labels (or, more generally, outputs) to a vocabulary, embedding, or other representational space: R:YVR: \mathcal{Y} \rightarrow \mathcal{V}, where Y\mathcal{Y} is the label set and V\mathcal{V} is the space of possible representations (tokens, vectors, images, etc.) (Marinescu et al., 9 Oct 2025). The ICL prompt thus consists of the context DR(k)={(x1,R(y1)),...,(xk,R(yk))}D_R(k) = \{(x_1, R(y_1)), ..., (x_k, R(y_k))\} and the query xx_*. Task performance is measured by the accuracy or loss statistic f(R,k)f(R, k), with A0(R)=f(R,0)A_0(R) = f(R,0) the zero-shot baseline accuracy for representation RR.

Beyond text-based representations, approaches such as Vector-ICL and In-Context Representation Learning (ICRL) extend this framework to continuous vector inputs (Zhuang et al., 2024, Zhang et al., 22 Sep 2025). These methods leverage projectors or alignment mechanisms to map arbitrary continuous features into the model’s embedding space, enabling the LLM to perform ICL on domains such as molecules, time-series, or vision.

2. Orthogonality of Representation and Learning in ICL

A central result is that the effect of representation (i.e., the choice of label or demonstration format) and the effect of learning from additional demonstrations are largely orthogonal. For fixed representation RR, increasing the number of demonstrations kk reliably lifts performance f(R,k)f(R,k) above baseline A0(R)A_0(R)—but crucially, the ranking of representations established by their zero-shot accuracy is preserved at all kk (Marinescu et al., 9 Oct 2025). This is formalized by the decomposition

p(R(y)x,DR)q(R(y)x,DR)p(R(y)x)p(R(y) \mid x, D_R) \propto q(R(y) \mid x, D_R) \cdot p(R(y) \mid x)

where q()q(\cdot) captures the effect of demonstration-based learning (largely invariant to RR), and p()p(\cdot) encodes the model’s prior induced by pretraining, which is sensitive to the semantic faithfulness of RR.

Empirically, representations with extremely low A0(R)A_0(R) (e.g., meaningless token choices) yield flat f(R,k)f(R,k) curves, manifested as an inability to exploit in-context learning. Representations of moderate A0(R)A_0(R) see the steepest gains with additional demonstrations. High-quality, semantically aligned representations (high A0(R)A_0(R)) provide strong initial performance but demonstrate saturation, as the upper bound is set by the prior (Marinescu et al., 9 Oct 2025). Spearman correlations ρ0.8\rho \approx 0.8–$0.95$ between A0(R)A_0(R) and f(R,k)f(R,k) across all kk substantiate the orthogonality hypothesis.

3. Architectures and Mechanisms for Representation Manipulation

Transformers support ICL via several distinct but interoperable representation mechanisms:

  • Hidden state dynamics and double convergence: Internal activations vk()\mathbf{v}_k^{(\ell)} at each position kk and layer \ell converge, as context length increases, to latent representations zxk()\mathbf{z}_{x_k}^{(\ell)} specific to each distinct token or concept (Yang et al., 17 Jul 2025). Across layers, these latents are repeatedly projected towards the low-frequency eigenspace of the data-generating process’s structure (e.g., a graph’s Laplacian), yielding an implicit bias toward globally coherent representations and robustness to high-frequency perturbations (Yang et al., 17 Jul 2025).
  • Task vectors from attention head activations: The effective “task” encoded by a set of in-context examples can be recovered as a weighted sum of attention head activations at specific positions, with learnable weights over heads capturing the task-relevant directions (Saglam et al., 8 Feb 2025). Training only these head weights (task-vector interventions) can restore or steer ICL in settings where base model ICL has been blocked, such as overlong prompts.
  • Layerwise separation of representation and learning: Empirical studies demonstrate that trained transformers organize their computation so that lower layers extract (potentially nonlinear) representations f(x)f(x) and upper layers implement linear in-context learning (e.g., ridge regression) over those representations, as revealed by probing and “pasting” experiments (Guo et al., 2023).
  • Memory and non-parametric methods in vision and multi-modal ICL: Scene understanding in vision can be realized by retrieving labels for query features from a memory of prompt features using cross-image or spatial-attention contextualized representations, with nearest-neighbor decoding providing high accuracy across dense vision tasks (Balažević et al., 2023). Aggregated-image (I²L) techniques for multimodal models construct composite pixel-space representations combining demonstrations and queries to leverage vision models’ capability for in-context learning (Wang et al., 2024).

4. Theoretical Foundations and Proofs of Transformer Generalization

Provable analyses of shallow (and to some extent, deep) transformer architectures elucidate how they perform contextual regression or learning with unknown representations. For regression tasks with inputs xkx_k mapped via unknown f(xk)f(x_k), transformers trained by gradient descent converge linearly to globally optimal solutions, and the learned in-context mapping implements ridge regression over the basis induced by ff (Yang et al., 2024). Multi-head attention is theoretically necessary for memorizing and inverting the kernel block formed by demonstrations (Yang et al., 2024). This formalizes the capacity for transformers to “contextually generalize” to unseen queries or even new underlying functions, provided their basis is encoded in the architecture.

Spectral and energy-based analyses further demonstrate that transformer representations settle into minimum Dirichlet energy configurations with respect to the latent data geometry induced by the prompt context—e.g., reconstructing the underlying topology of random walks on a graph (Park et al., 2024, Yang et al., 17 Jul 2025). Minimizing this energy yields spectral (Laplacian) embeddings so that the first principal components of the hidden representations align with the structure of the context (Park et al., 2024).

5. Empirical Probing and Metrics for Representation Quality

To quantify and dissect representation learning in context, several metrics and analysis techniques are employed:

  • Representational Similarity Analysis (RSA): By comparing the pairwise geometry of prompt-induced embeddings to a hypothesis matrix encoding task-relevant structure, RSA empirically tracks how internal representations reflect the target relation (Yousefi et al., 2023). Increases in RSA correlation with additional ICL examples signify that the model encodes task structure more faithfully in its activations.
  • Peak Inverse Rank (PIR): PIR scores, based on the ranks of task-representative tokens in the hidden state logit distributions, quantify the model’s internal task recognition capability. High PIR indicates the model has internally aligned its representations to match the expected task, even before prediction (Zhao et al., 2024).
  • Attention ratio metrics: Comparing attention allocations to relevant versus irrelevant prompt tokens, these metrics reveal that in-context demonstrations shift attention toward informative cues, with higher ratios linked to better behavioral accuracy (Yousefi et al., 2023).
  • Empirical scaling laws: ICL performance curves f(R,k)f(R,k) as a function of the number of demonstrations, and their slopes S(R)S(R), provide a practical measure of how demonstration count and representation interact in determining learning efficiency (Marinescu et al., 9 Oct 2025, Zhuang et al., 2024).

6. Extensions to Continuous, Multimodal, and Semi-Supervised Settings

Recent advances generalize ICL with representations to new data modalities:

  • Vector-ICL and ICRL paradigms process continuous vector representations produced by pretrained encoders (e.g., for molecules, graphs, fMRI) by projecting them into the LLM’s embedding space via lightweight projectors trained on general or task-specific objectives (Zhuang et al., 2024, Zhang et al., 22 Sep 2025). Training-free alignment methods such as optimal-transport projections are effective for on-the-fly adaptation.
  • Implicit in-context learning (I2CL): Rather than concatenating demonstrations in token space, representations of demonstrations are aggregated into a compact context vector and injected into the residual streams of the query at inference time (Li et al., 2024). This yields few-shot performance at zero-shot computational cost, and scalar task representations provide a task-identity that can support transfer.
  • Semi-supervised in-context learning: Transformers can leverage unlabeled context to induce geometry-aware features via in-context Laplacian eigenmap computation, enabling robust label propagation in few-label regimes, as formalized in the IC-SSL framework (Fan et al., 17 Dec 2025). The model performs spectral embedding and subsequent kernel-based inference entirely in context, with forward passes emulating block-power iterations and functional gradient descent in RKHS.
  • Vision, scene understanding, and multimodal ICL: Memory-augmented vision transformers produce spatially coherent representations amenable to dense prompt-based adaptation via nearest-neighbor retrieval, sidestepping explicit parametric decoders (Balažević et al., 2023). Aggregated-image ICL for multimodal models (I²L) encodes demonstrations and queries as composite images, fully exploiting the vision front-end of large multimodal models (Wang et al., 2024).

7. Practical Guidelines, Limitations, and Open Directions

From these findings, practical guidelines emerge for effective in-context learning with representations:

  • Representation selection should precede demonstration scaling: optimizing class names or label tokens on a small proxy set significantly lifts attainable ICL accuracy (Marinescu et al., 9 Oct 2025). Weak representations cannot be “rescued” by additional demonstrations alone.
  • Scaling demonstration count yields the largest marginal gains when the representation is of moderate quality; high-quality representations rapidly saturate, while poor ones show limited to no improvement (Marinescu et al., 9 Oct 2025, Zhuang et al., 2024).
  • Model size matters: Larger models leverage demonstration-induced learning more steeply and exhibit greater robustness to less semantically meaningful representations (Marinescu et al., 9 Oct 2025, Zhuang et al., 2024). For small models, careful representation selection is critical.
  • Task recognition and retrieval modes: High PIR tasks permit easy generalization from dissimilar demonstrations; if PIR is low, only similar-example retrieval (perceptual similarity) aids ICL, and otherwise performance may degrade due to position or copying biases (Zhao et al., 2024).
  • Energy minimization dynamics explain why, beyond ICL performance, representations self-organize according to latent task structure, with double-convergence yielding a bias towards smooth, robust, and globally consistent embeddings (Yang et al., 17 Jul 2025, Park et al., 2024).

Limitations persist: open-weight LLMs may internalize novel in-context representations but often fail to flexibly deploy them in downstream inference (e.g., explicit reasoning or adaptive world modeling tasks) (Lepori et al., 4 Feb 2026). Current next-token prediction objectives are insufficient for driving causal use of newly-induced representations. Addressing these gaps may require architectural interventions, auxiliary objectives, or prompt-engineering strategies targeting explicit deployment and integration of in-context learned representations.

In sum, in-context learning with representations is governed by the interaction of baseline structure conferred by the choice of representation and the additive, often orthogonal, effect of learning from demonstrations. Mechanistically, transformers support remarkable adaptation by reorganizing activations and attention, but full exploitation of newly learned representations—especially in multimodal and compositional settings—remains an ongoing area of research, with implications for prompt optimization, transfer learning, and the design of future generalist models.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to In-Context Learning with Representations.