In-Context Representation Learning

Updated 10 February 2026

In-Context Representation Learning is a mechanism where transformers adapt internal representations based solely on input context, enabling dynamic, few-shot and cross-modal generalization.
It leverages contrastive objectives and attention-driven geometric reorganization to induce context-sensitive latent spaces in a single forward pass.
Multimodal extensions use projection techniques to integrate diverse data types, enhancing performance on novel tasks without explicit parameter tuning.

In-context representation learning (ICRL) refers to the mechanism by which transformers, especially large language and multimodal models, induce and adapt internal representations based purely on context provided at inference time, without any explicit parameter updates. This phenomenon enables models to generalize to novel tasks, modalities, or input distributions by leveraging example-driven prompting or external representations, resulting in dynamic, context-sensitive latent geometries and prediction behaviors.

1. Theoretical Foundations: In-Context Learning as Representation Learning

Recent research demonstrates a deep mathematical connection between in-context learning (ICL) in transformers and classical representation learning. At the core, transformer ICL can be viewed as inducing a context-dependent latent space—through a single forward pass over demonstration-rich prompts—where downstream tasks are recast as geometric or contrastive operations in this space.

A formal kernel-learning duality establishes that a softmax attention layer in a transformer implements one exact gradient-descent step on a supervised contrastive objective defined over key-value pairs: for a context $D$ , queries $Q = W_Q D$ , keys $K = W_K D$ , and values $V = W_V D$ , the ICL mechanism can be interpreted as minimizing a contrastive loss $L(\hat{x}_K, \hat{x}_V) = \operatorname{dist}(\hat{x}_K, \hat{x}_V)$ , where $\hat{x}_K$ and $\hat{x}_V$ are nonlinear transformations of $K$ and $V$ (Miyanishi et al., 2024, Ren et al., 2023). This forms the basis of the contrastive learning analogy, with the model’s parameter update for new tasks effected in the representation space rather than through conventional training.

2. Multimodal and Continuous Representation Injection

ICRL has been extended well beyond text, allowing LLMs and multimodal models to consume, process, and reason over arbitrary non-textual representations as first-class context. In “Vector-ICL,” projected continuous vectors from arbitrary encoders (e.g., for vision, molecules, time series, fMRI) are aligned into the LM’s native embedding space through lightweight projectors (linear or MLP). These projected representations are directly injected into the context as pseudo-tokens, enabling the LLM to perform ICL over arbitrary modalities. Supervised projector tuning is optionally employed for further alignment, but even purely pre-trained or random linear projections can suffice due to properties of norm and cosine similarity preservation (Zhuang et al., 2024). This mechanism enables compositional, modality-agnostic prompt construction, supporting cross-modal K-shot generalization and reasoning.

In a strictly training-free setting, “ICRL” applies untrained linear projections or PCA-compressed embeddings from foundational models (FM) directly into the prompt, with optimal-transport alignment to adapt non-text modalities to LLM embedding statistics. The LLM can then exploit these representations for few-shot regression and classification in previously unencountered modalities (Zhang et al., 22 Sep 2025).

3. Mechanistic Insights: Geometry, Mixed Effects, and Prompt Structure

Transformers, when supplied with structured in-context examples, dynamically reorganize the geometry of their latent representation space. Quantitative and qualitative studies show that model activations corresponding to contextually adjacent or semantically related tokens converge in the residual-stream geometry, minimizing the Dirichlet energy with respect to an emergent, context-specified graph (Park et al., 2024, Lepori et al., 4 Feb 2026). This phenomenon is abrupt: as context length or the number of exemplars increases, a sharp phase transition is observed where the model’s internal structure realigns from pretrained semantics toward the structure implied by the prompt.

Mixed-effect modeling reveals both global bias (e.g. input formatting style, modality) and random effects (specific examples) in multimodal ICL. Analytical decomposition of the loss into semantic and formatting components shows that semantic content dominates model behavior on hard tasks, while formatting bias is more pronounced in easy tasks. The model’s performance is thus characterized by both its sensitivity to representational shift (driven by key-value embedding distances) and its robustness to invariances in input surface forms (Miyanishi et al., 2024).

4. Architectures and Practical Algorithms

Diverse architectures have been developed to operationalize ICRL. In the “Credibility Transformer,” a credibility-weighted combination of instance-specific and global-prior embeddings forms the basis for context-driven adaptation. In-context enhancement is achieved via cross-batch attention layers and outcome-token decoration, which "recenter" predictions on similar, previously encountered examples; this mechanism supports zero-shot generalization to new categorical feature levels (Padayachy et al., 9 Sep 2025).

Methods such as “Implicit In-context Learning” (I2CL) compress the information from demonstration examples into a compact activation-space "context vector," which is then injected into every transformer layer through linear gating at inference time. This allows the model to approach few-shot ICL accuracy at zero-shot computational and memory cost and to detect and transfer task similarity through learned low-dimensional representations (Li et al., 2024).

In the visual domain, unified in-context representation learning is achieved by discrete quantization and embedding of both text and image prompts into a single vocabulary, autoregressively modeled using a sparse, decoder-only transformer. The result is a modality-agnostic framework capable of handling multimodal generation and prediction (Sheng et al., 2023). Similarly, for large vision-LLMs, visual in-context learning is improved by selecting and summarizing demonstration images based on both visual and intent-oriented criteria, composing the final prompt as a sequence of compact, text-based summaries (Zhou et al., 2024, Zhang et al., 2023).

5. Representation Quality, Task Generalization, and Orthogonality

The efficacy of ICRL depends not only on prompt content and task structure but also on the choice of representation for both demonstrations and label tokens. Studies systematically optimize label representations and find that, across a spectrum of quality, the baseline zero-shot accuracy is largely determined by the representation itself, whereas the incremental learning with additional demonstrations is nearly independent. This orthogonality allows separate, additive optimization of representation schemes and prompt size to maximize few-shot performance. Medium-quality representations derive the most benefit from additional demonstrations, with larger models exhibiting steeper learning curves (Marinescu et al., 9 Oct 2025).

“Learning Task Representations from In-Context Learning” introduces the concept of a Learnable Task Vector: a weighted sum of attention-head outputs, causally optimized per task, which aligns the last-layer hidden state distribution with that of optimal ICL. This approach is modality-agnostic and provides a compact, transferable encoding of task structure (Saglam et al., 8 Feb 2025).

6. Analytical and Empirical Frameworks

A rich suite of analytical frameworks has been introduced to probe ICRL. The contrastive formulation of ICL unifies the view of self-attention as single-step contrastive learning in feature space, with variants (e.g., augmentation, negative sampling, and regularization) enabling more robust and interpretable shifts in latent geometry (Ren et al., 2023, Miyanishi et al., 2024). Mixed-effect regression and representation-level analyses provide fine-grained decompositions of bias and variance contributions to in-context performance, especially in multimodal and cross-format regimes.

For semi-supervised settings, architectures explicitly encode graph-manifold structure in the context via RBF-affinity self-attention and spectral embedding, enabling the model to learn geometry-aware, context-sensitive representations that transfer to low-label regimes and high-dimensional ambient spaces (Fan et al., 17 Dec 2025). In unsupervised meta-learning, the sequence-modeling reformulation and mixup-based augmentation strategies yield representation spaces with strong cross-domain generalization (Vettoruzzo et al., 2024).

7. Limitations, Challenges, and Future Directions

Substantial challenges remain in ICRL. One key limitation observed is the “representation deployment gap”: despite in-context induction of rich semantic geometry, current models often fail to leverage these representations for downstream prediction after a prompt interruption or in tasks requiring more flexible information deployment. This is especially acute in adaptive world modeling, where latent geometry is well-formed but not referenced for task execution (Lepori et al., 4 Feb 2026). Architectural interventions such as auxiliary regularization, specialized routing, and explicit binding mechanisms may be necessary for true context-driven world modeling.

Scaling context length can unlock novel latent capabilities, but advances in robust cross-modal alignment, compact representation, and transfer-friendly architectures are needed for further progress. Future extensions focus on mechanistic interpretability (e.g., path-patching, attention head analysis), multi-modality (audio, video, tables), integrating negative sampling for full contrastive objectives, and the invention of context-aware tokenization schemes.

In summary, in-context representation learning formalizes, generalizes, and extends the adaptive representation capabilities of transformer-based models, unifying principles from contrastive learning, kernel methods, and meta-learning into a single mechanistic and algorithmic framework that underlies contemporary few-shot and cross-modal generalization (Miyanishi et al., 2024, Ren et al., 2023, Zhang et al., 22 Sep 2025, Zhuang et al., 2024, Sheng et al., 2023).