Compound Text-Guided Prompt Tuning via Image-Adaptive Cues (2312.06401v1)

Published 11 Dec 2023 in cs.CV

Abstract: Vision-LLMs (VLMs) such as CLIP have demonstrated remarkable generalization capabilities to downstream tasks. However, existing prompt tuning based frameworks need to parallelize learnable textual inputs for all categories, suffering from massive GPU memory consumption when there is a large number of categories in the target dataset. Moreover, previous works require to include category names within prompts, exhibiting subpar performance when dealing with ambiguous category names. To address these shortcomings, we propose Compound Text-Guided Prompt Tuning (TGP-T) that significantly reduces resource demand while achieving superior performance. We introduce text supervision to the optimization of prompts, which enables two benefits: 1) releasing the model reliance on the pre-defined category names during inference, thereby enabling more flexible prompt generation; 2) reducing the number of inputs to the text encoder, which decreases GPU memory consumption significantly. Specifically, we found that compound text supervisions, i.e., category-wise and content-wise, is highly effective, since they provide inter-class separability and capture intra-class variations, respectively. Moreover, we condition the prompt generation on visual features through a module called Bonder, which facilitates the alignment between prompts and visual features. Extensive experiments on few-shot recognition and domain generalization demonstrate that TGP-T achieves superior performance with consistently lower training costs. It reduces GPU memory usage by 93% and attains a 2.5% performance gain on 16-shot ImageNet. The code is available at https://github.com/EricTan7/TGP-T.

Citations (2)

View on Semantic Scholar

Summary

The paper proposes TGP-T, which leverages dual-level text supervision to optimize prompt tuning and reduce GPU memory usage by up to 93%.
It integrates image-adaptive cues via the Bonder module to align visual and textual modalities, achieving a 2.5% accuracy boost in few-shot scenarios.
Empirical tests across 11 datasets validate its scalability and superior performance compared to methods like CoOp and CoCoOp.

An Expert Review of "Compound Text-Guided Prompt Tuning via Image-Adaptive Cues"

The paper "Compound Text-Guided Prompt Tuning via Image-Adaptive Cues" introduces an innovative approach to prompt tuning in Vision-LLMs (VLMs), specifically those akin to CLIP. The authors address key limitations of existing prompt tuning frameworks, such as excessive GPU memory consumption when dealing with numerous categories and underperformance with ambiguous category names. Their solution, Compound Text-Guided Prompt Tuning (TGP-T), significantly optimizes resource requirements while enhancing performance.

Contributions

Text Supervision for Prompt Optimization: The authors leverage text supervision to inform the optimization of prompts. This text supervision strategy offers two primary advantages:
- Mitigates the dependency on predefined category names, facilitating a more flexible and robust prompt generation process.
- Reduces the required input volume to the text encoder, thereby effectively decreasing GPU memory demands.
Dual-Level Text Supervision: The paper distinguishes between category-wise and content-wise text supervision. The former provides a broad category perspective, while the latter captures intra-class variations. This dual strategy is shown to enhance model performance by improving inter-class separability and addressing intra-class diversity.
Image-Conditioned Prompt Generation: Through a module termed "Bonder," TGP-T aligns prompt generation with visual features. This conditional prompt formulation process ensures tighter integration between visual and textual modalities, improving the model's adaptation capabilities.
Computational Efficiency and Performance Gains: The empirical experiments demonstrate substantial efficiency improvements, with TGP-T reducing GPU memory usage by 93% and achieving a 2.5% accuracy gain on 16-shot ImageNet scenarios. This demonstrates the framework's scalability and potential for broader applications across various datasets.

Methodological Insights

TGP-T's architecture is built upon a pre-trained VLM like CLIP. By rerouting the learning of category centers after the text encoder, it requires only two prompt inputs, as opposed to parallelizing one for each category, which substantially reduces the computational footprint. The Bonder module, leveraging cross-modal attention, is pivotal to this efficiency, enabling the adaptation of prompt queries via real-time visual feature interactions.

The research rigorously tests the framework across 11 datasets under few-shot recognition settings, establishing the framework's superiority over prominent methods like CoOp and CoCoOp. It is evident that TGP-T's design intrinsically enhances VLMs' cross-modal capabilities, leveraging them for both few-shot learning and domain generalization effectively.

Theoretical and Practical Implications

From a theoretical standpoint, TGP-T provides a new lens through which the integration of visual and textual modalities can be optimized, specifically utilizing dual-text supervision to drive prompt optimization. This challenges existing paradigms where textual prompts heavily rely on category names that are susceptible to ambiguity.

Practically, the implications of this research are significant for deploying VLMs in resource-constrained environments or applications demanding efficiency without sacrificing performance. The reduction in computational overhead positions TGP-T as a viable approach for real-world applications requiring scalable and adaptable vision-language interactions.

Future Directions

The paper opens several avenues for future exploration:

Extending the framework to include more diverse and task-adaptive forms of text supervision.
Investigating further optimizations around the Bonder module for even more efficient cross-modal integrations.
Exploring the potential of integrating TGP-T with other foundation models besides CLIP to test its adaptability and impact across a broader span of VLM architectures.

In conclusion, "Compound Text-Guided Prompt Tuning via Image-Adaptive Cues" presents a compelling advancement in the domain of efficient VLM tuning. Its methodological rigor and applicability substantiate its contributions, providing a solid foundation for subsequent research in prompt optimization and cross-modal learning frameworks.

PDF Markdown

Related Papers

GitHub

GitHub - EricTan7/TGP-T: [AAAI2024] Official implementation of the AAAI 2024 paper TGP-T (28 stars)