In-Context Vectors (M²IV)

Updated 26 July 2025

In-Context Vectors (M²IV) are compact, task-specific latent representations derived from contextual demonstrations to drive in-context learning.
M²IV techniques use methods like activation extraction, averaging, and projection to inject learned vectors into key network layers for scalable inference.
Empirical results show that M²IVs improve accuracy, reduce inference latency, and enhance robustness across multimodal and language tasks.

An In-Context Vector ("M²IV" or Multimodal/Matrix In-Context Vector, Editor's term) refers to a compact, task-dependent internal representation derived from a set of contextual examples (demonstrations) and directly integrated into a model's processing pipeline—either as a latent vector or a collection of such vectors across network layers or attention heads. Rather than concatenating demonstrations as prompt tokens, M²IV methods extract or learn these vectors from activations, patch or inject them at strategic locations (residual streams, attention heads, FFN blocks), and use them to steer model outputs, thus enabling efficient, scalable, and controllable in-context learning across language, multimodal, and hybrid architectures.

1. Theoretical Foundations and Emergence

The foundational perspective on In-Context Vectors is that they emerge as compressed representations of task-specific information required to perform in-context learning (ICL) (Hendel et al., 2023, Yang et al., 16 Jan 2025, Dong et al., 10 Jun 2025). The "task vector" θ(S) is formally a mapping from a demonstration set S to a vector in the model's latent space such that, for a query x, the model operates as T([S, x]) = f(x; θ(S)), with f parameterizing a hypothesis class (Hendel et al., 2023). Linear transformer analyses show that such vectors often manifest as linear combinations of demonstration embeddings (the "Linear Combination Conjecture"), where the extracted vector summarizes mapping information as z_tv = ∑ᵢ βᵢ zᵢ, with βᵢ as learned coefficients (Dong et al., 10 Jun 2025).

In supervised settings with synthetic data and controlled architectures, robust emergence of task vectors is observed when input formats provide clearly separated positions for contextual signals (e.g., explicit tokens or tag delimiters) and when model depth and capacity are calibrated (Yang et al., 16 Jan 2025). In LSTMs, context vectors (cell states) accumulate task-relevant information through nearly quantized updates, facilitating syntactic depth tracking (Shibata et al., 2020).

2. Mechanisms of Construction and Injection

M²IV construction involves either forward-extracting activations from demonstration-augmented sequences (“static” methods) or learning explicit, per-layer vectors through optimization (“learnable” methods). Techniques include averaging head outputs, projecting head activations via learned matrices, principal component extraction, or learning shift/injection vectors via loss alignment (Li et al., 6 Apr 2025, Liu et al., 2023, Peng et al., 19 Jun 2024, Cai et al., 23 May 2025).

Injection mechanisms are twofold:

Residual Stream Injection: M²IVs are added into the model’s residual stream, often at each layer, modifying both attention (MHA) and MLP branches (Li et al., 6 Apr 2025). For a layer-l and token-i, injection proceeds as

$\mathbf{h}_l^i = \mathbf{h}_{l-1}^i + [\mathbf{a}_l^i + \alpha_l^a \mathbf{v}_l^a] + [\mathbf{m}_l^i + \alpha_l^m \mathbf{v}_l^m]$

with $\mathbf{v}_l^a, \mathbf{v}_l^m$ as learnable vectors and $\alpha_l^a, \alpha_l^m$ as scaling factors.

Attention Head or Segment Patching: Mean activations $\mu_{l,j}$ across demonstrations are computed and, at carefully selected head locations $\lambda^{\text{MTV}}_j$ , these activations are patched during inference (Huang et al., 21 Jun 2024). Selection of $\lambda$ is data-driven, often via REINFORCE-based optimization, to maximize performance on downstream tasks.

Dynamic segmentation and injection strategies partition latent representations based on task complexity and optimization feedback, enabling injection at the most impactful model positions (Cai et al., 23 May 2025). Learnable variants (e.g., LIVE, M²IV, MimIC) further optimize per-head/per-layer shift vectors through alignment and KL-divergence losses against full ICL outputs (Peng et al., 19 Jun 2024, Li et al., 6 Apr 2025, Jiang et al., 11 Apr 2025).

3. Empirical Performance and Applications

Across extensive benchmarks (VQAv2, OK-VQA, VizWiz, GQA, dense captioning, ImageNet, CUB-200, several text classification and regression datasets), M²IV methods demonstrate:

Accuracy Gains: M²IV consistently outperforms vanilla ICL and alternative compact representation methods (Task Vector, Function Vector, ICV), achieving average accuracy improvements of 3.74% over standard ICL at the same shot count and often even exceeding the performance of explicit many-shot ICL as context size increases (Li et al., 6 Apr 2025, Peng et al., 19 Jun 2024, Jiang et al., 11 Apr 2025, Cai et al., 23 May 2025).
Efficiency: By replacing demonstration tokens with fixed-size vector sets, FLOPs and latency at inference are reduced by an order of magnitude, enabling rapid adaptation in production systems (Peng et al., 19 Jun 2024, Li et al., 6 Apr 2025).
Robustness and Generalization: Aggregation and momentum-based optimization of state vectors (akin to model soup and gradient descent momentum) yield robustness to demonstration order/selection and improved out-of-distribution transfer (Li et al., 17 Apr 2024).

M²IVs enable a variety of downstream applications:

Multimodal in-context learning across vision–language reasoning tasks (Li et al., 6 Apr 2025, Chen et al., 2023, Zhuang et al., 8 Oct 2024)
Behavior steering and controllability (e.g., style transfer, safety, role-playing, domain adaptation) (Liu et al., 2023, Brumley et al., 11 Nov 2024)
Dynamic, light-weight adaptation to new tasks in low-resource or high-throughput settings (Cai et al., 23 May 2025, Yang et al., 16 Jan 2025)

4. Geometric and Functional Interpretation

A unifying geometric framework reveals that M²IVs enhance downstream accuracy via two mechanisms: (1) increasing separability of query hidden states into label-specific clusters in early layers (driven by Previous Token Heads), and (2) refining alignment with label unembedding directions in later layers (driven by Induction Heads and task vectors) (Yang et al., 24 May 2025). The separability determines the theoretical upper bound on accuracy, but actual performance depends critically on alignment—a property sharply increased post-transition in the layerwise trajectory.

Special attention heads (in-context heads, PTH, IH) and distributed rule vectors act in concert: the former extract label features and compute query–key similarity metrics to modulate information flow, while the latter encode distributed high-level abstractions of the underlying rule, especially when compositional generalization across multiple demonstrations is required (Yu et al., 5 Feb 2024, Zheng et al., 23 Jun 2024, Yang et al., 24 May 2025).

Theoretical results explain phenomena such as majority label bias and recency bias as emergent from the similarity metrics learned by query–key towers and suggest normalization or explicit balancing strategies to mitigate such biases (Yu et al., 5 Feb 2024).

5. Scalability, Repository, and Future Directions

M²IV frameworks natively support scaling to many-shot scenarios by aggregating contextual information in fixed-size latent vectors, allowing the implicit compression of hundreds of demonstrations without exceeding context window limitations (Li et al., 6 Apr 2025, Huang et al., 21 Jun 2024). Hierarchical aggregation (divide-and-conquer strategies) and dynamic segmentation further enable scalability in both vision-language and pure language domains (Li et al., 17 Apr 2024, Cai et al., 23 May 2025).

VLibrary is introduced as a dedicated repository for storing, retrieving, and composing learned M²IVs, indexed by semantically meaningful metadata (e.g., MHA scalar weights). This modular infrastructure supports plug-and-play adaptation for cross-modal alignment, explainability, and safety interventions (Li et al., 6 Apr 2025).

Notable open challenges and potential directions include:

Improving handling of high-rank, more complex mappings for which single task vectors are insufficient; injecting multiple vectors or employing distributed rule vectors offers partial mitigation (Dong et al., 10 Jun 2025, Zheng et al., 23 Jun 2024).
Extending M²IVs to continuous and non-token modalities (Vector-ICL) (Zhuang et al., 8 Oct 2024).
Further refining segment selection in latent space and optimizing injection locations via more scalable policy optimization (Cai et al., 23 May 2025, Huang et al., 21 Jun 2024).

6. Comparative Analysis and Methodological Implications

Top-down (global, contrast-based) and bottom-up (head-attribution-based) M²IV extraction methods present a trade-off: top-down approaches (ICV) are highly effective for behavioral modulation and broad steering tasks, while bottom-up approaches (Function Vector, head patching, distributed rule vector) excel in fine-grained, functional mapping scenarios (Brumley et al., 11 Nov 2024, Yang et al., 24 May 2025, Zheng et al., 23 Jun 2024). Sensitivity to demonstration construction, vector injection position, and vector strength must be considered when deploying M²IVs—highlighting the ongoing need for methodologically unified benchmarking, interpretability research, and exploration of hybrid strategies that can combine precision and global steering.

7. Implications for Model Design and Interpretability

M²IV work deepens understanding of information aggregation and task abstraction in neural architectures. In Transformer-based models, they motivate explicit architectural or training modules for localizing, composing, and leveraging in-context knowledge, including auxiliary losses such as TVP-loss or layerwise alignment (Yang et al., 16 Jan 2025, Jiang et al., 11 Apr 2025). Interpretability is enhanced as task- and rule-relevant features become directly manipulable latent objects, and new frameworks for steering, multi-tasking, and post-training adaptation without further parameter updates become feasible.

In summary, In-Context Vectors (M²IV) unify a spectrum of methodologies that achieve compact, efficient, and interpretable in-context learning by integrating learned or extracted representations at key architectural locations. These representations support task-conditional reasoning, scalable adaptation, and robust multi-modal or multi-task control in state-of-the-art large models.