- The paper introduces BMIP, a novel prompt learning method for Vision-Language Models utilizing bi-directional modality interaction to enhance consistency and generalization.
- A new open-world generalization evaluation paradigm is proposed to provide a more realistic assessment of model generalization capabilities beyond traditional cross-dataset tasks.
- Experimental results demonstrate BMIP achieves state-of-the-art performance, showing significant gains on datasets with imbalanced vision and text information like EuroSAT and Flowers102.
The paper introduces the Bi-directional Modality Interaction Prompt (BMIP) method, a novel prompt learning approach for Vision-LLMs (VLMs) designed to improve inter-modal consistency and overall performance across various generalization tasks. The key innovation of BMIP lies in its bi-directional modality interaction mechanism, which dynamically weights information from both visual and textual modalities through an aggregation function that leverages the attention layer outputs. The paper also introduces a new evaluation paradigm called open-world generalization, which complements existing cross-dataset transfer and domain generalization tasks by providing a more realistic assessment of a model's generalization capabilities.
The paper highlights the limitations of existing prompt learning methods, which primarily focus on single-modal prompts or uni-directional modality interaction. Single-modal approaches often struggle with datasets exhibiting high intra-class visual variances or small inter-class textual variances. While some methods attempt to transfer prompts from language to vision, they fail to fully exploit the bi-directional interaction between the two modalities, leading to sub-optimal alignment. BMIP addresses these limitations by introducing three key components:
- Deep Language Prompt Learning: Introduces layered prompts, {Pi∈R1×b×dl}i=0J, to expand the scope of prompt information. The input to the initial layer assumes the structure [P0,W0], where {W0∈RN×x×dl} signifies the word embedding of the text T, where J, b, and dl indicates the depth, length, and dimension of the language prompts, respectively, x represents the number of words, with N indicating the total count of image categories.
In the first J layers of the text encoder g, the inputs and outputs at the ith layer are represented as:
[_,Wi]=gi([Pi−1,Wi−1]) for i=1,2,…,J
Beyond the Jth layer, the prompts from the output of the preceding layer serve as the input to the next layer. The class feature z is obtained by projecting the class representation to a common embedding space via the text projection head TextProj.
- [Pj,Wj]=gj([Pj−1,Wj−1]) for j=J+1,…,K
- z=TextProj([c1,c2,…,cN])
- Deep Vision Prompt Learning: Incorporates vision prompt vectors, {P~i∈R1×b×dv}i=0J, to extract representative visual features, with a depth and length matching the language branches but differing in the vision prompt dimension dv. In the first Jth layers of the image encoder f, a learnable vision prompt replaces the output of the previous layer.
The inputs and outputs for the first J layers are represented as:
[CLSi,Ei,_]=fi([CLSi−1,Ei−1,P~i−1]) for i=1,2,…,J
After the Jth layer, the ensuing layer's input is the immediate output of its predecessor. Upon obtaining the ultimate class token CLSK, the image projection head, denoted as ImageProj, is employed to map the final image feature x to the common embedding space.
- [CLSj,Ej,P~j]=fj([CLSj−1,Ej−1,P~j−1]) for j=J+1,…,K
- x=ImageProj(CLSK)
- Vision Language Modality Interaction: Achieves effective information aggregation through a language projection head Fl, a vision projection head Fv, and a learnable aggregation function that uses the output weight of the attention layer.
- The vision and language projection heads produce transformed vision and language information, represented as {Fv(Pi),Fl(P~i)}.
- The vision and language attention weight (wv, wl) are extracted from the output of the current attention layer, representing the degree of attention given by other inputs to the current prompt.
- Modality-specific 1×1 linear layers, Ll and Lv, learn the relationship between attention weights and substitution weights (wl,wv), generating dynamic weights for each prompt.
- wv=Lv(Av), P~i′=wv∗P~i+(1−wv)∗Fv(Pi)
- wv: vision attention weight
- Lv: vision linear layer
- Av: output of the current attention layer
- P~i: vision prompts
- Fv: vision projection head
- wl=Ll(Al), Pi′=[wl∗Pi+(1−wl)∗Fl(P~i)]
- wl: language attention weight
- Ll: language linear layers
- Al: output of the current attention layer
- Pi: language prompts
- Fl: language projection head
The open-world generalization evaluation paradigm is introduced to address the limitations of the base-to-new class generalization task, which evaluates base and new classes separately. The open-world generalization paradigm does not pre-determine whether the data belongs to base or new classes, thus providing a more realistic evaluation of a model's ability to generalize to unknown distributions.
The experimental results demonstrate that BMIP achieves state-of-the-art (SOTA) performance across various tasks and datasets. Specifically, BMIP exhibits significant performance improvements on datasets with imbalanced text and image information, such as EuroSAT and Flowers102, which aligns with the motivation of addressing the shortcomings of single-modal prompt learning methods. Furthermore, BMIP's modular design allows it to be combined with other prompt-based methods, such as PromptSRC and CoPrompt, for consistent performance enhancement. Ablation studies validate the effectiveness of the proposed aggregation function and demonstrate that BMIP's performance gains are not solely attributed to an increase in the number of parameters.