In-Context Representation Learning (ICRL)

Updated 15 March 2026

In-Context Representation Learning (ICRL) is a method where transformer models dynamically adapt their token embeddings to reflect context-specific relationships.
It employs prompt-induced energy minimization and layered computations to reconfigure internal representations without gradient updates.
ICRL enhances few-shot, multimodal, and semi-supervised learning by aligning embeddings to new semantic structures, improving model generalization.

In-Context Representation Learning (ICRL) refers to the capacity of transformer-based models—particularly LLMs—to dynamically reorganize their internal representation geometry in response to contextually supplied demonstrations, thereby overriding or reshaping the geometric organization imposed by pretraining. This paradigm extends conventional in-context learning (ICL) by emphasizing not only output adaptation but also the high-dimensional restructuring of concept representations in response to prompt-level supervision, spanning text and non-text modalities, supervised and semi-supervised settings, and encompassing both theoretical analyses and empirical phenomena (Park et al., 2024, Zhang et al., 22 Sep 2025, Fan et al., 17 Dec 2025).

1. Foundational Definition and Conceptual Framework

ICRL describes the dynamic adaptation, within a frozen model, of token or concept embeddings to reflect context-specific relationships that may encode new semantics, task objectives, or cross-modal associations. Let $T$ be a set of tokens (e.g., $\{\tau_1, \dots, \tau_n\}$ ) whose embeddings, after pretraining, instantiate a “semantic prior”—a geometry reflecting corpus-driven relationships. Upon presenting a context $S$ consisting of in-context demonstrations (e.g., $(c(v_t), c(v_{t+1}))$ pairs specifying relations on a novel graph structure), the internal hidden-state vectors $h^\ell(\tau_i) \in \mathbb{R}^d$ are reorganized such that their new geometric relationships reflect the prompt-implied semantics. This latent reconfiguration is typically prompt-length dependent and context-sensitive, and may be formalized as minimizing a context-conditional energy function (Park et al., 2024).

ICRL represents a general mechanism underpinning LLMs’ on-the-fly induction of task-specific representations without gradient updates, closely linking to their efficacy in few-shot learning, adaptive generalization, and emerging multimodal reasoning capabilities (Zhang et al., 22 Sep 2025, Zhuang et al., 2024).

2. Mechanistic and Theoretical Analyses

The functional core of ICRL is the transformer’s ability to implement emergent optimization dynamics in its activations, as elucidated by both empirical probing and formal constructions:

Energy Minimization Analogy: Given a graph $G=(V,E)$ and a bijection $c: V \to T$ mapping abstract nodes to pretrained tokens, one defines an energy

$E(Z) = \sum_{(i, j) \in E} w_{ij} \| z_i - z_j \|^2 + \lambda \sum_{i \in V} \| z_i - z^{\mathrm{pre}}_i \|^2$

where $w_{ij}$ are edge weights and $\lambda$ controls retention of the pretrained geometry. With increasing context size, the graph-smoothness term comes to dominate, inducing a sharp geometric reorganization of token embeddings to reflect the context-induced structure (Park et al., 2024).

Two-Phase Computation: Theoretical constructions show that a transformer can, within moderate depth and size, implement a pipeline where (i) early layers compute a context-dependent representation (e.g., via an MLP mapping $\{\tau_1, \dots, \tau_n\}$ 0); (ii) upper layers execute an in-context algorithm (e.g., Bayesian ridge regression) on those representations, with empirical probing confirming this modular separation (Guo et al., 2023). In structured tasks, lower-layer hidden states encode $\{\tau_1, \dots, \tau_n\}$ 1, which are subsequently overwritten as upper layers perform context-conditioned computation.
Semi-Supervised Manifold Structure: Sufficiently deep transformers can use self-attention with non-linear kernels to construct discrete Laplacians and extract eigenmaps, which converge to Laplace–Beltrami eigenfunctions as the number of unlabeled examples grows, thus unifying manifold learning and ICL (Fan et al., 17 Dec 2025).
Layerwise Compression–Expansion: Empirical geometric analyses reveal that early layers compress in-context demonstrations into a compact, discriminative representation (“task vector”), which is then expanded in late layers to condition predictions, with minimum Task-Distance Normalized Variance (TDNV) identifying the bottleneck layer (Jiang et al., 22 May 2025).

Recent work extends ICRL to non-text data using frozen foundational models (FMs) and cross-modal projections. The paradigm for mapping non-text representations into LLMs comprises:

Text-level injection: High-dimensional FM representations are first reduced (e.g., PCA), then stringified as comma-separated values and injected into the prompt as pseudo-token sequences (Zhang et al., 22 Sep 2025).
Embedding-level injection: FM vectors are mapped into the LLM’s embedding space via zero-padding, random projections, or empirical distribution alignment (e.g., optimal transport to match the empirical mean and variance of target embeddings), with theoretical guarantees for norm and cosine similarity preservation under randomized linear maps (Zhang et al., 22 Sep 2025).
Prompt templates encase the vector representations and corresponding labels, with the LLM attending directly to these projected inputs; ablation studies highlight the necessity of inter-example diversity for effective ICRL (Zhang et al., 22 Sep 2025).
“Vector-ICL” generalizes this architecture, employing lightweight projectors trained via language modeling objectives, enabling LLMs to process continuous vectors from arbitrary domains, including time-series, graphs, and fMRI, often outperforming domain-tuned baselines after task-specific fine-tuning of the projector (Zhuang et al., 2024).

4. Empirical Signatures and Experimental Findings

Multiple empirical findings characterize ICRL:

Phase Transition in Representation Geometry: As the context length $\{\tau_1, \dots, \tau_n\}$ 2 surpasses a critical threshold $\{\tau_1, \dots, \tau_n\}$ 3, models’ internal representations shift abruptly toward alignment with context-imposed structures (e.g., a predefined graph), evidenced by a sharp drop in Dirichlet energy and clear geometry in PCA plots (Park et al., 2024).
Critical Context Scaling: The value $\{\tau_1, \dots, \tau_n\}$ 4 scales sublinearly with problem size, implying efficient adaptation even for large semantic domains. Below $\{\tau_1, \dots, \tau_n\}$ 5, representations remain dominated by prior semantic geometry; above $\{\tau_1, \dots, \tau_n\}$ 6, context semantics become dominant.
Impact of Pretrained Semantic Correlation: When tokens have strongly correlated pretrained semantics, context-induced topology can only partially override prior structure; new relationships are encoded in higher principal components, while dominant PCs retain pretraining geometry (Park et al., 2024).
Representation-Learning in Semi-Supervised ICL: Transformers can leverage large amounts of unlabeled context to learn robust, geometry-aware features, improving generalization and supporting steep accuracy improvements in low-label regimes across synthetic and image-based datasets (Fan et al., 17 Dec 2025).
Robustness to Demonstration Quality: ICRL depends on the informativeness and diversity of supplied vectors; highly homogeneous input vectors degrade performance toward random guess, while retaining inter-example diversity via simple alignment methods optimizes in-context adaptation (Zhang et al., 22 Sep 2025).
Cross-Modality Performance: Without fine-tuning, ICRL can improve molecular property prediction, vision, and time-series tasks though best results require careful alignment of encoder distributions, and performance still trails that of fully supervised or fine-tuned specialist models (Zhang et al., 22 Sep 2025, Zhuang et al., 2024).

5. Representation and Demonstration Orthogonality

Systematic studies indicate that the representation quality (e.g., label-token assignment in classification) determines the baseline and ceiling accuracy for ICRL, but the process of learning from added context is largely orthogonal—learning is enabled epiphenomenally and monotonically enhances accuracy atop the representational baseline, without altering the relative ordering of representations (Marinescu et al., 9 Oct 2025). The incremental benefit of additional context is modulated by representation quality and model size; optimal adaptation thus involves both representational search and context scaling.

Section	Key Findings	Representative Source
Geometry & Energy	Representations minimize unary-tethered Dirichlet energy; sharp transition as context exceeds $\{\tau_1, \dots, \tau_n\}$ 7	(Park et al., 2024)
Cross-Modality	Vector projection and alignment extend ICRL to molecules, fMRI, graphs	(Zhang et al., 22 Sep 2025, Zhuang et al., 2024)
Semi-Supervised	In-context unlabeled tokens enable laplacian eigenmap learning and OOD generalization	(Fan et al., 17 Dec 2025)
Demonstration	Representational baseline and learning efficiency are separable and orthogonal	(Marinescu et al., 9 Oct 2025)

6. Theoretical Links to Optimization and Representation Learning

ICRL presents deep connections to representation learning, kernel methods, and implicit optimization:

Single attention layers can be viewed as implementing a one-step gradient update in a kernel-induced feature space, establishing duality to contrastive learning objectives; excess error decays as $\{\tau_1, \dots, \tau_n\}$ 8 in number of demonstrations, exposing generalization bounds (Ren et al., 2023).
The mechanism is robust to known modifications from contrastive learning: regularizing value heads, employing nonlinear projections, or incorporating negatives systematically alter in-context adaptation dynamics.
Compositionally, stacked layers and feed-forward transformations create a sequence of reference models, supporting block coordinate gradient-ascent on a multi-layer energy landscape (Ren et al., 2023).

7. Applications, Limitations, and Research Directions

ICRL breaks the constraint of fixed representation geometry imposed by pretraining, theoretically enabling:

Modular in-context “world models”: Rapid adaptation of underlying representations for graph tracing, environment modeling, or structured knowledge induction, directly from prompt-level specification (Park et al., 2024).
Cross-modal downstream reasoning: Prompt-injected frozen encoders allow LLMs to abstract over molecular, visual, or sensor data, facilitating few-shot reasoning and multimodal classification without weight updates (Zhang et al., 22 Sep 2025, Zhuang et al., 2024).
In-context reinforcement learning: Integration with explicit reward-belief modules, such as variational autoencoders, supports belief-augmented sequence modeling for in-context reinforcement learning and lifts transformer-based policies toward Bayes-adaptive regime (Dippel et al., 13 Nov 2025).

Principal limitations include dependence on encoder diversity, finite context-window bottlenecks, and an inability to match fully supervised domain models without further optimization or hybridization. Open research avenues involve principled prompt and representation search, joint encoder–projector training, scaling to higher-dimensional or structured embeddings, and dissecting the internal dynamics of context-induced representational adaptation.

ICRL formalizes a core adaptive capability of large sequence models: rapid, context-sensitive reorganization of their representational geometry. It acts as a bridge between prompt engineering, geometric representation learning, and implicit energy minimization, offering a unifying perspective for few-shot learning, multimodal processing, and dynamic adaptation in advanced neural architectures (Park et al., 2024, Zhang et al., 22 Sep 2025, Fan et al., 17 Dec 2025, Guo et al., 2023, Zhuang et al., 2024).