Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 146 tok/s
Gemini 2.5 Pro 43 tok/s Pro
GPT-5 Medium 32 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 37 tok/s Pro
Kimi K2 189 tok/s Pro
GPT OSS 120B 434 tok/s Pro
Claude Sonnet 4.5 37 tok/s Pro
2000 character limit reached

BMIP: Bi-directional Modality Interaction Prompt Learning for VLM (2501.07769v1)

Published 14 Jan 2025 in cs.LG and cs.CV

Abstract: Vision-LLMs (VLMs) have exhibited remarkable generalization capabilities, and prompt learning for VLMs has attracted great attention for the ability to adapt pre-trained VLMs to specific downstream tasks. However, existing studies mainly focus on single-modal prompts or uni-directional modality interaction, overlooking the powerful alignment effects resulting from the interaction between the vision and language modalities. To this end, we propose a novel prompt learning method called $\underline{\textbf{B}}i-directional \underline{\textbf{M}}odality \underline{\textbf{I}}nteraction \underline{\textbf{P}}rompt (BMIP)$, which dynamically weights bi-modal information through learning the information of the attention layer, enhancing trainability and inter-modal consistency compared to simple information aggregation methods. To evaluate the effectiveness of prompt learning methods, we propose a more realistic evaluation paradigm called open-world generalization complementing the widely adopted cross-dataset transfer and domain generalization tasks. Comprehensive experiments on various datasets reveal that BMIP not only outperforms current state-of-the-art methods across all three evaluation paradigms but is also flexible enough to be combined with other prompt-based methods for consistent performance enhancement.

Summary

  • The paper introduces BMIP, a novel prompt learning method for Vision-Language Models utilizing bi-directional modality interaction to enhance consistency and generalization.
  • A new open-world generalization evaluation paradigm is proposed to provide a more realistic assessment of model generalization capabilities beyond traditional cross-dataset tasks.
  • Experimental results demonstrate BMIP achieves state-of-the-art performance, showing significant gains on datasets with imbalanced vision and text information like EuroSAT and Flowers102.

The paper introduces the Bi-directional Modality Interaction Prompt (BMIP) method, a novel prompt learning approach for Vision-LLMs (VLMs) designed to improve inter-modal consistency and overall performance across various generalization tasks. The key innovation of BMIP lies in its bi-directional modality interaction mechanism, which dynamically weights information from both visual and textual modalities through an aggregation function that leverages the attention layer outputs. The paper also introduces a new evaluation paradigm called open-world generalization, which complements existing cross-dataset transfer and domain generalization tasks by providing a more realistic assessment of a model's generalization capabilities.

The paper highlights the limitations of existing prompt learning methods, which primarily focus on single-modal prompts or uni-directional modality interaction. Single-modal approaches often struggle with datasets exhibiting high intra-class visual variances or small inter-class textual variances. While some methods attempt to transfer prompts from language to vision, they fail to fully exploit the bi-directional interaction between the two modalities, leading to sub-optimal alignment. BMIP addresses these limitations by introducing three key components:

  • Deep Language Prompt Learning: Introduces layered prompts, {PiR1×b×dl}i=0J\{P_i \in \mathbb{R}^{1 \times b \times d_l}\}_{i=0}^J, to expand the scope of prompt information. The input to the initial layer assumes the structure [P0,W0][P_0, W_0], where {W0RN×x×dl}\{W_0 \in \mathbb{R}^{N \times x \times d_l}\} signifies the word embedding of the text TT, where JJ, bb, and dld_l indicates the depth, length, and dimension of the language prompts, respectively, xx represents the number of words, with NN indicating the total count of image categories.
    • In the first JJ layers of the text encoder gg, the inputs and outputs at the ithi^{th} layer are represented as:

      [_,Wi]=gi([Pi1,Wi1])\left[ \_,W_i \right] =g_i\left( \left[ P_{i-1}, W_{i-1} \right] \right) for i=1,2,,Ji=1,2,\dots,J

    • Beyond the Jth{J}^{th} layer, the prompts from the output of the preceding layer serve as the input to the next layer. The class feature zz is obtained by projecting the class representation to a common embedding space via the text projection head TextProj\mathrm{TextProj}.

      • [Pj,Wj]=gj([Pj1,Wj1])\left[ P_j,W_j \right] =g_j\left( \left[ P_{j-1}, W_{j-1} \right] \right) for j=J+1,,Kj=J+1,\dots,K
      • z=TextProj([c1,c2,,cN])z=\mathrm{TextProj}\left([c_1, c_2, \dots, c_N]\right)
  • Deep Vision Prompt Learning: Incorporates vision prompt vectors, {P~iR1×b×dv}i=0J\{\tilde{P}_i \in \mathbb{R}^{1 \times b \times d_v}\}_{i=0}^J, to extract representative visual features, with a depth and length matching the language branches but differing in the vision prompt dimension dvd_v. In the first JthJ^{th} layers of the image encoder ff, a learnable vision prompt replaces the output of the previous layer.
    • The inputs and outputs for the first JJ layers are represented as:

      [CLSi,Ei,_]=fi([CLSi1,Ei1,P~i1])\left[ CLS_i, E_i, \_ \right] = f_i \left( \left[ CLS_{i-1}, E_{i-1}, \tilde{P}_{i-1} \right] \right) for i=1,2,,Ji = 1, 2, \dots, J

    • After the JthJ^{th} layer, the ensuing layer's input is the immediate output of its predecessor. Upon obtaining the ultimate class token CLSKCLS_K, the image projection head, denoted as ImageProj\mathrm{ImageProj}, is employed to map the final image feature xx to the common embedding space.

      • [CLSj,Ej,P~j]=fj([CLSj1,Ej1,P~j1])\left[ CLS_j,E_j,\tilde{P}_{j} \right] = f_j\left( \left[ CLS_{j-1},E_{j-1},\tilde{P}_{j-1} \right] \right) for j=J+1,,Kj = J+1,\dots,K
      • x=ImageProj(CLSK)x=\mathrm{ImageProj}\left( CLS_K \right)
  • Vision Language Modality Interaction: Achieves effective information aggregation through a language projection head FlF_l, a vision projection head FvF_v, and a learnable aggregation function that uses the output weight of the attention layer.
    • The vision and language projection heads produce transformed vision and language information, represented as {Fv(Pi),Fl(P~i)}\{F_v(P_{i}), F_l(\tilde{P}_{i})\}.
    • The vision and language attention weight (wvw_v, wlw_l) are extracted from the output of the current attention layer, representing the degree of attention given by other inputs to the current prompt.
    • Modality-specific 1×11 \times 1 linear layers, LlL_l and LvL_v, learn the relationship between attention weights and substitution weights (wl,wvw_l, w_v), generating dynamic weights for each prompt.
    • wv=Lv(Av)w_v = L_v(A_v), P~i=wvP~i+(1wv)Fv(Pi)\tilde{P}_{i}' = w_v * \tilde{P}_{i} + (1 - w_v) * F_v(P_{i})
    • wvw_v: vision attention weight
    • LvL_v: vision linear layer
    • AvA_v: output of the current attention layer
    • P~i\tilde{P}_{i}: vision prompts
    • FvF_v: vision projection head
    • wl=Ll(Al)w_l = L_l(A_l), Pi=[wlPi+(1wl)Fl(P~i)]P_i' =\left[w_l * P_i + (1 - w_l) * F_l(\tilde{P}_i)\right]
    • wlw_l: language attention weight
    • LlL_l: language linear layers
    • AlA_l: output of the current attention layer
    • PiP_i: language prompts
    • FlF_l: language projection head

The open-world generalization evaluation paradigm is introduced to address the limitations of the base-to-new class generalization task, which evaluates base and new classes separately. The open-world generalization paradigm does not pre-determine whether the data belongs to base or new classes, thus providing a more realistic evaluation of a model's ability to generalize to unknown distributions.

The experimental results demonstrate that BMIP achieves state-of-the-art (SOTA) performance across various tasks and datasets. Specifically, BMIP exhibits significant performance improvements on datasets with imbalanced text and image information, such as EuroSAT and Flowers102, which aligns with the motivation of addressing the shortcomings of single-modal prompt learning methods. Furthermore, BMIP's modular design allows it to be combined with other prompt-based methods, such as PromptSRC and CoPrompt, for consistent performance enhancement. Ablation studies validate the effectiveness of the proposed aggregation function and demonstrate that BMIP's performance gains are not solely attributed to an increase in the number of parameters.

Dice Question Streamline Icon: https://streamlinehq.com

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.