Multimodal In-Context Learning

Updated 1 July 2025

Multimodal In-Context Learning is an approach where models are conditioned on paired image and text examples to infer task mappings without changing parameters.
It leverages transformer architectures by using demonstration sequences that induce shift vectors through attention, aligning visual and textual cues.
Practical applications include hybrid question answering and few-shot adaptation, though challenges such as context window limitations and bias persist.

Multimodal In-Context Learning (ICL) is a paradigm extending the in-context learning capabilities of LLMs to domains involving multiple data modalities, most commonly vision and language. In this framework, a large multimodal model (MLLM, VLLM, or LVLM) is provided with a prompt containing a sequence of demonstration examples, where each example typically includes an image and associated text (such as a question, caption, or label), along with the desired output. Following these demonstrations, a query instance (e.g., a new image and question) is presented. The model is then expected to generate the correct response for the query by conditioning on the provided examples, without any updates to its parameters. This approach allows for rapid adaptation to new tasks or variations within tasks based on a few examples presented "in context" within the input sequence, leveraging the extensive pretraining of the large models.

Core Mechanisms and Theoretical Perspectives

Multimodal ICL operates by presenting demonstrations as part of the input sequence, which the model processes using its standard architecture, typically a transformer-based decoder or encoder-decoder. The demonstrations serve to implicitly define a task or specify a desired input-output mapping. The model is theorized to learn this mapping from the examples and apply it to the query.

Mathematically, the in-context examples are often structured as a sequence of (image, text, output) triplets followed by the query (image, text). For a query $q = \langle I_q, T_q \rangle$ and a context $C_q = \{ \langle I_i, T_i, R_i \rangle \}_N$ , the model generates a response $R_q$ by conditioning on the concatenated sequence: $R_q = \text{Model}([C_q, q])$ (Chen et al., 2023, Baldassini et al., 24 Apr 2024).

In transformer models, the effect of in-context demonstrations can be interpreted as adding "shift vectors" to the hidden states of the query tokens (Jiang et al., 11 Apr 2025). These shift vectors are induced by the attention mechanism focusing on the demonstration tokens. The attention mechanism computes relevance based on key-value distances between query and demonstration representations (Miyanishi et al., 23 Aug 2024). A contrastive learning perspective suggests that ICL implicitly minimizes the distance between query and relevant context representations (Miyanishi et al., 23 Aug 2024).

Another perspective views multimodal ICL as the process of forming a "task mapping" from the input demonstrations (Li, 5 Mar 2025, Li et al., 21 May 2025). Each demonstration $(I_i, Q_i, R_i)$ represents a local task mapping $f_i: (I_i, Q_i) \rightarrow R_i$ . The model must integrate these into a global task mapping $\hat{f}: (\hat{I}, \hat{Q}) \rightarrow \hat{R}$ for the query. The quality of this global mapping depends on the selection and configuration of demonstrations. The ability to recognize the task structure from the demonstrations (Task Recognition, TR) is often more critical than learning the specific mappings (Task Learning, TL), as LVLMs can often leverage their pretrained knowledge for TL if TR is successful (Li, 5 Mar 2025).

Key Challenges and Limitations

Multimodal ICL faces several significant challenges that limit its effectiveness and reliability compared to its text-only counterpart:

Context Window Limitations: Image inputs are typically encoded into a large number of tokens, consuming a substantial portion of the model's fixed context window. This severely restricts the number of multimodal demonstrations that can be included in a prompt, hindering "many-shot" learning (Huang et al., 21 Jun 2024, Gao et al., 11 Jun 2024, Li et al., 6 Apr 2025).
Ineffective Utilization of Visual Information: Empirical studies reveal that for many tasks, current MLLM architectures predominantly rely on the textual content within demonstrations, with the visual components having minimal impact on the ICL performance, particularly in Visual Question Answering (VQA) tasks (Chen et al., 2023, Baldassini et al., 24 Apr 2024, Xu et al., 1 Jul 2024). Image information in demonstrations is primarily beneficial for pure image-to-text tasks like captioning or classification, or for selecting relevant text demonstrations (Chen et al., 2023, Baldassini et al., 24 Apr 2024).
Sensitivity to Demonstration Configuration: The performance of multimodal ICL is highly sensitive to the specific set of demonstrations selected and their ordering within the prompt (Li et al., 21 May 2025, Rieff et al., 26 Jun 2025).
Bias Issues: Models exhibit biases such as a "recency bias," where the model is overly influenced by the last demonstration in the sequence, and a "majority bias," where predictions are swayed by the most frequent labels in the context (Baldassini et al., 24 Apr 2024, Rieff et al., 26 Jun 2025). These biases can lead to unreliable performance.
Limited Generalization and Induction: Existing models may struggle to generalize from multimodal examples to tasks requiring fine-grained visual recognition, complex rule induction, or reasoning across multiple interleaved images, which are key aspects of true ICL (Doveh et al., 19 Mar 2024, Zong et al., 19 Mar 2024).
Modest Gains and Shortcut Learning: For simpler tasks like VQA or captioning, few-shot multimodal ICL might offer only marginal gains over zero-shot performance. The model may take "shortcuts" by simply mimicking demonstration answers based on superficial similarity rather than learning the underlying task mapping (Baldassini et al., 24 Apr 2024).
Lack of Explicit ICL Training: Many base MLLMs are not explicitly pre-trained for handling multi-image interleaved inputs in a semantically coherent ICL format, limiting their ability to fully exploit contextual examples (Doveh et al., 19 Mar 2024, Gao et al., 11 Jun 2024).

Strategies for Improving Multimodal ICL

Research efforts aim to address the identified challenges through various strategies:

Prompt Engineering and Structure Optimization:
- Implementing type-specific ICL tailors the prompt and the use of Chain-of-Thought reasoning based on question type (e.g., using CoT only for complex reasoning questions over tables or multiple modalities) (Liu et al., 2023).
- Training models with a multi-turn conversation format explicitly tunes them for handling sequences of interleaved image-text ICL examples (Doveh et al., 19 Mar 2024).
- Using clear introductory instructions at the start of the prompt enhances task comprehension (Qin et al., 27 Oct 2024).
- Empirical studies suggest that intra-demonstration ordering (e.g., presenting the image before text within each example) can significantly impact performance, while inter-demonstration ordering (the sequence of examples) appears less critical (Qin et al., 27 Oct 2024).
Demonstration Selection and Configuration:
- Utilizing retrieval-based methods to select demonstrations relevant to the query is crucial (Liu et al., 2023, Qin et al., 27 Oct 2024).
- Mixed-modality selection methods, such as MMICES, filter candidates visually and then re-rank textually to ensure relevance across both modalities (Chen et al., 2023).
- Modality-adaptive selection strategies prioritize visual or textual similarity based on the task's inherent requirements (Xu et al., 1 Jul 2024). Text-driven selection is often effective, but visual similarity is vital for tasks requiring fine-grained visual understanding (Xu et al., 1 Jul 2024).
- Learning demonstration selection policies through reinforcement learning frameworks (Exploration-Exploitation) allows the model to discover effective combinations of examples that capture inter-demonstration interactions and maximize task performance (Chen et al., 11 Jun 2025).
- Task-aware sequence configuration methods (like SabER (Li, 5 Mar 2025) and TACO (Li et al., 21 May 2025)) use lightweight transformer models and task-aware attention to dynamically select and order demonstrations, aiming for a cohesive global task mapping.
Efficient Representation and Compression:
- Aggregating image information into the latent space of the corresponding text in demonstrations (AIM) significantly reduces the number of tokens required per demonstration, enabling multi-shot ICL even on models originally trained for single images, improving efficiency and scalability (Gao et al., 11 Jun 2024).
- Compressing the information from many-shot examples into "Multimodal Task Vectors" (MTVs) stored within the model's attention heads allows for efficient many-shot learning by injecting these vectors during inference, bypassing context length limitations (Huang et al., 21 Jun 2024).
- Replacing explicit demonstration tokens with trainable "In-context Vectors" (M²IV) that directly encode the ICL effect through training, offers a token-efficient alternative that leverages complementary strengths of MHA and MLP for cross-modal fidelity and semantic distillation (Li et al., 6 Apr 2025).
- Optimizing a compact coreset of demonstrations using untapped support data and visual features as keys (KeCO) improves the informativeness of the coreset for image classification ICL with low computational cost (Chen et al., 19 Apr 2025).
- Learning to directly inject shift vectors after attention layers (MimIC) approximates the effect of demonstrations with lightweight trainable modules, improving efficiency and stability (Jiang et al., 11 Apr 2025).
Architectural and Internal Mechanism Enhancements:
- Developing unified frameworks that quantize and embed multimodal inputs into a shared space for decoder-only transformers with MoE layers enables handling ICL for tasks requiring multimodal output (e.g., image segmentation masks alongside text) in a single pipeline (Sheng et al., 2023).
- Using a lightweight tuning module (M²IXT) prepended to various multimodal backbones can enhance ICL capabilities with minimal additional parameters and data requirements (Chen et al., 2023).
- Modulating attention logits at inference time based on query-ICD affinity and positional context (CAMA) calibrates the attention mechanism to better utilize relevant demonstrations and mitigate positional biases (Li et al., 21 May 2025).

Evaluation Benchmarks and Empirical Findings

Evaluating multimodal ICL requires benchmarks that go beyond simple VQA and captioning to test true multimodal reasoning, induction, and generalization.

The MultimodalQA dataset (Liu et al., 2023) provides hybrid questions over text, tables, and images. The MMHQA-ICL framework achieved state-of-the-art few-shot performance on this dataset, demonstrating the value of type-specific ICL and strong retrieval/captioning (Liu et al., 2023).
The VL-ICL Bench (Zong et al., 19 Mar 2024) comprises a diverse set of tasks (e.g., fast binding, operator induction, multi-image reasoning, text-to-image tasks) designed to specifically challenge the ICL abilities of VLLMs. Evaluations on this benchmark reveal that even state-of-the-art models struggle with tasks requiring complex multimodal induction and robust context scaling (Zong et al., 19 Mar 2024, Xu et al., 1 Jul 2024).
SMMILE (Rieff et al., 26 Jun 2025) is the first expert-driven multimodal ICL benchmark for medical tasks. Evaluations on SMMILE show that most current MLLMs exhibit only moderate to poor multimodal ICL ability in this domain, with average performance improvements over zero-shot around 8-9% (Rieff et al., 26 Jun 2025). The benchmark highlights sensitivity to irrelevant examples and strong recency bias (Rieff et al., 26 Jun 2025).
Custom benchmarks are used to evaluate specific methods, showing substantial performance gains. For instance, methods focusing on efficient representation like M²IV report average accuracy gains of 3.74% over vanilla ICL on seven benchmarks (Li et al., 6 Apr 2025). KeCO achieves over 20% average improvement for image classification ICL (Chen et al., 19 Apr 2025). Methods targeting explicit ICL tuning or attention modulation like CAMA and MimIC also demonstrate significant boosts over baselines across standard VQA and captioning tasks (Li et al., 21 May 2025, Jiang et al., 11 Apr 2025).

Empirical evidence consistently shows that while random demonstration selection offers limited benefits, strategically selecting and configuring multimodal examples, or using methods that compress or learn the ICL effect, leads to improved performance across a range of tasks. However, the extent of improvement varies significantly depending on the task, the model architecture, and the specific ICL strategy employed.

Practical Considerations and Applications

Multimodal ICL holds promise for various real-world applications by enabling large models to adapt to specific user needs or data distributions without costly fine-tuning:

Hybrid Question Answering: Systems processing information from diverse sources like text, tables, and images can use ICL to adapt to user query patterns or specific document structures (Liu et al., 2023).
Resource-Constrained Environments: Efficient methods like AIM (Gao et al., 11 Jun 2024), MTVs (Huang et al., 21 Jun 2024), M²IV (Li et al., 6 Apr 2025), and KeCO (Chen et al., 19 Apr 2025) are crucial for deploying ICL-capable models on devices or cloud infrastructure with limited memory or computational resources, by reducing the token burden of demonstrations. KeCO's online capability makes it suitable for streaming data scenarios (Chen et al., 19 Apr 2025).
Few-Shot Adaptation in Specialized Domains: ICL is highly relevant for domains where labeled data is scarce, such as medical image analysis (Rieff et al., 26 Jun 2025). However, current models show limitations in this area, indicating a need for domain-specific improvements and robust handling of context noise. Anchored-by-Text ICL offers a strategy for safety-constrained tasks like hateful meme detection by using benign examples as anchors (Miyanishi et al., 23 Aug 2024).
Interactive AI Systems: Agents that need to understand user instructions and generalize from examples involving images and text can leverage multimodal ICL for flexible interaction (Sheng et al., 2023).
Unified Vision Tasks: Frameworks enabling multimodal output allow a single model to perform diverse visual understanding tasks (e.g., segmentation, captioning) driven purely by multimodal prompts (Sheng et al., 2023).
Model Steering and Control: Learnable in-context vectors (M²IV) and vector libraries (VLibrary) offer potential for fine-grained control over LVLM behavior, including cross-modal alignment, output customization, and even safety enforcement or circumvention (Li et al., 6 Apr 2025).

While practical applications are emerging, the sensitivity to context, limited visual utilization in demonstrations, and biases remain critical challenges that require careful consideration and mitigation before widespread deployment.

Future Directions

Research in multimodal ICL is actively pursuing several avenues to overcome current limitations:

Improving Modality Interaction: Developing model architectures and training strategies that enable MLLMs to genuinely utilize the visual information within demonstrations for reasoning across modalities, rather than primarily relying on text (Chen et al., 2023, Xu et al., 1 Jul 2024).
Robust Context Scaling: Addressing the context window bottleneck and improving model performance and stability with larger numbers of demonstrations and multiple interleaved images (Doveh et al., 19 Mar 2024, Gao et al., 11 Jun 2024, Huang et al., 21 Jun 2024, Li et al., 6 Apr 2025).
Advanced Demonstration Management: Researching more sophisticated and model-aware strategies for demonstration selection, ordering, and sequence configuration, potentially using learned policies or explicit task mapping guidance (Xu et al., 1 Jul 2024, Qin et al., 27 Oct 2024, Chen et al., 11 Jun 2025, Li, 5 Mar 2025, Li et al., 21 May 2025).
Learning ICL Mechanisms: Further exploring methods to learn the ICL effect directly through trainable components or representation engineering, enhancing efficiency, stability, and potentially interpretability (Huang et al., 21 Jun 2024, Li et al., 6 Apr 2025, Jiang et al., 11 Apr 2025, Chen et al., 19 Apr 2025).
Novel Prompting and Output Formats: Investigating new ways to structure multimodal prompts and enable ICL for tasks with diverse output modalities beyond text (Sheng et al., 2023).
Development of Challenging Benchmarks: Creating more diverse and rigorous benchmarks that test complex multimodal reasoning, induction, and generalization across various domains, driving progress beyond current limitations observed in benchmarks like VL-ICL Bench and SMMILE (Zong et al., 19 Mar 2024, Rieff et al., 26 Jun 2025).
Understanding Inductive Biases: Researching how multimodal ICL interacts with model inductive biases learned during pretraining and how to leverage or mitigate these interactions for better adaptation (Xu et al., 1 Jul 2024).
Combining Methods: Exploring the synergy of different ICL enhancement techniques, such as combining efficient representation methods with advanced selection strategies or attention calibration.