BMIP: Bi-directional Modality Interaction Prompt Learning for VLM

Published 14 Jan 2025 in cs.LG and cs.CV | (2501.07769v1)

Abstract: Vision-LLMs (VLMs) have exhibited remarkable generalization capabilities, and prompt learning for VLMs has attracted great attention for the ability to adapt pre-trained VLMs to specific downstream tasks. However, existing studies mainly focus on single-modal prompts or uni-directional modality interaction, overlooking the powerful alignment effects resulting from the interaction between the vision and language modalities. To this end, we propose a novel prompt learning method called $\underline{\textbf{B}}i-directional \underline{\textbf{M}}odality \underline{\textbf{I}}nteraction \underline{\textbf{P}}rompt (BMIP)$, which dynamically weights bi-modal information through learning the information of the attention layer, enhancing trainability and inter-modal consistency compared to simple information aggregation methods. To evaluate the effectiveness of prompt learning methods, we propose a more realistic evaluation paradigm called open-world generalization complementing the widely adopted cross-dataset transfer and domain generalization tasks. Comprehensive experiments on various datasets reveal that BMIP not only outperforms current state-of-the-art methods across all three evaluation paradigms but is also flexible enough to be combined with other prompt-based methods for consistent performance enhancement.

Abstract PDF Upgrade to Chat

Authors (5)

Summary

The paper introduces BMIP, a novel prompt learning method for Vision-Language Models utilizing bi-directional modality interaction to enhance consistency and generalization.
A new open-world generalization evaluation paradigm is proposed to provide a more realistic assessment of model generalization capabilities beyond traditional cross-dataset tasks.
Experimental results demonstrate BMIP achieves state-of-the-art performance, showing significant gains on datasets with imbalanced vision and text information like EuroSAT and Flowers102.

The paper introduces the Bi-directional Modality Interaction Prompt (BMIP) method, a novel prompt learning approach for Vision-LLMs (VLMs) designed to improve inter-modal consistency and overall performance across various generalization tasks. The key innovation of BMIP lies in its bi-directional modality interaction mechanism, which dynamically weights information from both visual and textual modalities through an aggregation function that leverages the attention layer outputs. The paper also introduces a new evaluation paradigm called open-world generalization, which complements existing cross-dataset transfer and domain generalization tasks by providing a more realistic assessment of a model's generalization capabilities.

The paper highlights the limitations of existing prompt learning methods, which primarily focus on single-modal prompts or uni-directional modality interaction. Single-modal approaches often struggle with datasets exhibiting high intra-class visual variances or small inter-class textual variances. While some methods attempt to transfer prompts from language to vision, they fail to fully exploit the bi-directional interaction between the two modalities, leading to sub-optimal alignment. BMIP addresses these limitations by introducing three key components:

Deep Language Prompt Learning: Introduces layered prompts, $\{P_i \in \mathbb{R}^{1 \times b \times d_l}\}_{i=0}^J$ , to expand the scope of prompt information. The input to the initial layer assumes the structure $[P_0, W_0]$ $[P_{0}, W_{0}]$ , where $\{W_0 \in \mathbb{R}^{N \times x \times d_l}\}$ ${W_{0} \in R^{N \times x \times d_{l}}}$ signifies the word embedding of the text $T$ $T$ , where $J$ $J$ , $b$ $b$ , and $d_l$ $d_{l}$ indicates the depth, length, and dimension of the language prompts, respectively, $x$ $x$ represents the number of words, with $N$ $N$ indicating the total count of image categories.
- In the first $J$ layers of the text encoder $g$ , the inputs and outputs at the $i^{th}$ layer are represented as:
  
  $\left[ \_,W_i \right] =g_i\left( \left[ P_{i-1}, W_{i-1} \right] \right)$ for $i=1,2,\dots,J$
- Beyond the ${J}^{th}$ layer, the prompts from the output of the preceding layer serve as the input to the next layer. The class feature $z$ is obtained by projecting the class representation to a common embedding space via the text projection head $\mathrm{TextProj}$ .
  - $\left[ P_j,W_j \right] =g_j\left( \left[ P_{j-1}, W_{j-1} \right] \right)$ for $j=J+1,\dots,K$
  - $z=\mathrm{TextProj}\left([c_1, c_2, \dots, c_N]\right)$
Deep Vision Prompt Learning: Incorporates vision prompt vectors, $\{\tilde{P}_i \in \mathbb{R}^{1 \times b \times d_v}\}_{i=0}^J$ , to extract representative visual features, with a depth and length matching the language branches but differing in the vision prompt dimension $d_v$ $d_{v}$ . In the first $J^{th}$ $J^{t h}$ layers of the image encoder $f$ $f$ , a learnable vision prompt replaces the output of the previous layer.
- The inputs and outputs for the first $J$ layers are represented as:
  
  $\left[ CLS_i, E_i, \_ \right] = f_i \left( \left[ CLS_{i-1}, E_{i-1}, \tilde{P}_{i-1} \right] \right)$ for $i = 1, 2, \dots, J$
- After the $J^{th}$ layer, the ensuing layer's input is the immediate output of its predecessor. Upon obtaining the ultimate class token $CLS_K$ , the image projection head, denoted as $\mathrm{ImageProj}$ , is employed to map the final image feature $x$ to the common embedding space.
  - $\left[ CLS_j,E_j,\tilde{P}_{j} \right] = f_j\left( \left[ CLS_{j-1},E_{j-1},\tilde{P}_{j-1} \right] \right)$ for $j = J+1,\dots,K$
  - $x=\mathrm{ImageProj}\left( CLS_K \right)$
Vision Language Modality Interaction: Achieves effective information aggregation through a language projection head $F_l$ $F_{l}$ , a vision projection head $F_v$ $F_{v}$ , and a learnable aggregation function that uses the output weight of the attention layer.
- The vision and language projection heads produce transformed vision and language information, represented as $\{F_v(P_{i}), F_l(\tilde{P}_{i})\}$ .
- The vision and language attention weight ( $w_v$ , $w_l$ ) are extracted from the output of the current attention layer, representing the degree of attention given by other inputs to the current prompt.
- Modality-specific $1 \times 1$ linear layers, $L_l$ and $L_v$ , learn the relationship between attention weights and substitution weights ( $w_l, w_v$ ), generating dynamic weights for each prompt.
- $w_v = L_v(A_v)$ , $\tilde{P}_{i}' = w_v * \tilde{P}_{i} + (1 - w_v) * F_v(P_{i})$
- $w_v$ : vision attention weight
- $L_v$ : vision linear layer
- $A_v$ : output of the current attention layer
- $\tilde{P}_{i}$ : vision prompts
- $F_v$ : vision projection head
- $w_l = L_l(A_l)$ , $P_i' =\left[w_l * P_i + (1 - w_l) * F_l(\tilde{P}_i)\right]$
- $w_l$ : language attention weight
- $L_l$ : language linear layers
- $A_l$ : output of the current attention layer
- $P_i$ : language prompts
- $F_l$ : language projection head

The open-world generalization evaluation paradigm is introduced to address the limitations of the base-to-new class generalization task, which evaluates base and new classes separately. The open-world generalization paradigm does not pre-determine whether the data belongs to base or new classes, thus providing a more realistic evaluation of a model's ability to generalize to unknown distributions.

The experimental results demonstrate that BMIP achieves state-of-the-art (SOTA) performance across various tasks and datasets. Specifically, BMIP exhibits significant performance improvements on datasets with imbalanced text and image information, such as EuroSAT and Flowers102, which aligns with the motivation of addressing the shortcomings of single-modal prompt learning methods. Furthermore, BMIP's modular design allows it to be combined with other prompt-based methods, such as PromptSRC and CoPrompt, for consistent performance enhancement. Ablation studies validate the effectiveness of the proposed aggregation function and demonstrate that BMIP's performance gains are not solely attributed to an increase in the number of parameters.

Markdown Report Issue