Mimic In-Context Learning for Multimodal Tasks (2504.08851v2)

Published 11 Apr 2025 in cs.LG and cs.AI

Abstract: Recently, In-context Learning (ICL) has become a significant inference paradigm in Large Multimodal Models (LMMs), utilizing a few in-context demonstrations (ICDs) to prompt LMMs for new tasks. However, the synergistic effects in multimodal data increase the sensitivity of ICL performance to the configurations of ICDs, stimulating the need for a more stable and general mapping function. Mathematically, in Transformer-based models, ICDs act as "shift vectors" added to the hidden states of query tokens. Inspired by this, we introduce Mimic In-Context Learning (MimIC) to learn stable and generalizable shift effects from ICDs. Specifically, compared with some previous shift vector-based methods, MimIC more strictly approximates the shift effects by integrating lightweight learnable modules into LMMs with four key enhancements: 1) inserting shift vectors after attention layers, 2) assigning a shift vector to each attention head, 3) making shift magnitude query-dependent, and 4) employing a layer-wise alignment loss. Extensive experiments on two LMMs (Idefics-9b and Idefics2-8b-base) across three multimodal tasks (VQAv2, OK-VQA, Captioning) demonstrate that MimIC outperforms existing shift vector-based methods. The code is available at https://github.com/Kamichanw/MimIC.

PDF Abstract

This paper introduces "Mimic In-Context Learning" (MimIC), a novel method to enhance the In-Context Learning (ICL) capabilities of Large Multimodal Models (LMMs) by learning stable and generalizable "shift effects" from In-Context Demonstrations (ICDs). The authors observe that ICL performance in LMMs is highly sensitive to ICD configurations due to the synergistic effects of multimodal data. Traditional ICL relies on ICDs during inference, which can be computationally expensive and sensitive to the choice and order of these demonstrations. Previous "shift vector" based methods, which aim to learn a general mapping function from ICDs, have limitations in their approximation of how ICDs influence model behavior.

The core idea of MimIC is to more rigorously approximate the mathematical effect of ICDs, which is to add "shift vectors" to the hidden representations of query tokens within Transformer-based models. MimIC introduces lightweight learnable modules into LMMs with four key enhancements:

Inserting shift vectors after attention layers: Unlike previous methods that place them after Feed-Forward Network (FFN) layers, MimIC aligns with the mathematical derivation showing that the shift effect should occur post-attention.
Assigning a shift vector to each attention head: This allows each head to learn a unique shift, capturing distinct representation shifts for different aspects of the input.
Making shift magnitude query-dependent: The scaling factor of the shift vector is dynamically adjusted based on the current query, allowing for more nuanced adaptations.
Employing a layer-wise alignment loss: This loss function ensures that the hidden states of the MimIC-enhanced LMM (processing only the query) closely match the hidden states of the original LMM when performing standard ICL (processing query and ICDs).

The training process involves two parallel LMMs:

The original LMM processes a query along with $k$ ICDs to generate target hidden states at each layer.
The MimIC LMM processes only the query, using its learnable MimIC attention heads to produce shifted hidden states.

The total loss function combines this layer-wise alignment loss ( $\mathcal{L}_\text{align}$ ) with a standard LLMing loss ( $\mathcal{L}_\text{gt}$ ) to maintain task performance: $\mathcal{L} = \mathcal{L}_\text{align} + \lambda \mathcal{L}_\text{gt}$ . By training with randomly selected ICDs, MimIC learns a general shift pattern. During inference, MimIC no longer requires ICDs, leading to significant speed improvements.

Implementation Details:

The shift effect is approximated by a learnable vector $\bm{v}$ in each attention head.
The query-dependent magnitude $\tilde{\mu}(\bm{q}, \bm{K})$ is determined by a trainable linear layer $f(\cdot)$ that approximates $\log Z_1$ , where $Z_1$ represents the sum of exponentiated attention scores over ICDs.
The MimIC attention head output is: $\operatorname{SA}(\bm{q}, \bm{K}, \bm{V}) + \tilde{\mu}(\bm{q}, \bm{K}) \bm{v}$ .

function mimic_attention(query_q, keys_K, values_V, learnable_v, linear_f):
    # Standard attention (independent of ICDs)
    standard_attn_output = self_attention(query_q, keys_K, values_V)

    # Approximate Z1 (sum of attention scores over ICDs for this query)
    log_Z1_approx = linear_f(query_q)
    Z1_approx = exp(log_Z1_approx)

    # Z2 (sum of attention scores over query's own context)
    Z2 = sum(exp(dot_product(query_q, k_i)) for k_i in keys_K) # Simplified

    # Approximate query-dependent magnitude
    mu_approx = Z1_approx / (Z1_approx + Z2)

    # Shift vector application
    shifted_output = standard_attn_output + mu_approx * learnable_v

    return shifted_output

Experiments and Results:

MimIC was evaluated on Idefics-9b and Idefics2-8b-base models across VQAv2, OK-VQA, and COCO Captioning tasks.

Performance: MimIC consistently outperformed standard ICL (e.g., 32-shot ICL) and previous shift vector-based methods (like Function Vector, Task Vector, and LIVE) as well as LoRA. For instance, on Idefics-9b, MimIC achieved a 3.46% accuracy improvement on VQAv2, 3.57% on OK-VQA, and 9.00 CIDEr on COCO Captioning compared to 32-shot ICL.
Data Efficiency: MimIC matched 32-shot ICL performance with guidance from only 1-shot ICL during its training. It also required significantly fewer training samples than methods like LIVE to achieve strong performance (e.g., surpassing LIVE's best with 1/8th the data).
Ablation Studies: Confirmed the importance of each of MimIC's four enhancements (multi-head shift vectors, query-dependent magnitude, placement after attention, and layer-wise alignment loss). Using a multi-head, query-dependent magnitude was shown to be crucial.
Alignment: MimIC demonstrated a closer alignment (smaller L2 distance) in latent space to traditional ICL compared to other methods, including a variant of MimIC using KL divergence (MimIC $^\dagger$ ) and LoRA.
Hallucinations: MimIC generated fewer hallucinations in image captioning tasks compared to other non-zero-shot methods while maintaining high recall. While hallucinations increased slightly with more simulated "shots" during training, they remained lower than standard ICL.

Key Contributions:

Provides a more rigorous mathematical approximation of ICL's shift effects in LMMs, highlighting flaws in previous methods.
Proposes a feasible method (MimIC) to achieve this approximation with few learnable parameters integrated into attention heads.
Demonstrates consistent improvements over ICL, prior shift vector methods, and LoRA across multiple tasks and LMM architectures, with better data efficiency and reduced inference latency.

The paper concludes that MimIC effectively learns the ICL shift effect, offering competitive few-shot performance with reduced latency, fewer training samples than comparable methods, and fewer parameters than LoRA while often achieving better results and reducing hallucinations. The code is available at https://github.com/Kamichanw/MimIC.

PDF Markdown Bookmark Chat (Pro)

Authors (7)

Yuchu Jiang (3 papers)
Jiale Fu (4 papers)
Chenduo Hao (4 papers)
Xinting Hu (16 papers)
Yingzhe Peng (7 papers)
Xin Geng (90 papers)
Xu Yang (222 papers)

Related Papers

Find Related Papers

GitHub

GitHub - Kamichanw/MimIC: [CVPR'25] Official code of paper "Mimic In-Context Learning for Multimodal Tasks" (15 stars)