Deeply Coupled Cross-Modal Prompt Learning (2305.17903v3)

Published 29 May 2023 in cs.CV

Abstract: Recent advancements in multimodal foundation models (e.g., CLIP) have excelled in zero-shot generalization. Prompt tuning involved in the knowledge transfer from foundation models to downstream tasks has gained significant attention recently. Existing prompt-tuning methods in cross-modal learning, however, either solely focus on language branch, or learn vision-language interaction in a shallow mechanism. In this context, we propose a Deeply coupled Cross-modal Prompt learning (DCP) method based on CLIP. DCP flexibly accommodates the interplay between vision and language with a Cross-Modal Prompt Attention (CMPA) mechanism, which enables the mutual exchange of respective representation through a well-connected multi-head attention module progressively and strongly. We then conduct comprehensive few-shot learning experiments on 11 image classification datasets and analyze the robustness to domain shift as well. Thorough experimental analysis evidently demonstrates the superb few-shot generalization and compelling domain adaption capacity of a well-executed DCP. The code can be found at https://github.com/GingL/CMPA.

Authors (6)

Xuejing Liu (14 papers)
Wei Tang (135 papers)
Jinghui Lu (28 papers)
Rui Zhao (241 papers)
Zhaojun Guo (2 papers)
Fei Tan (25 papers)

Citations (13)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

GitHub

GitHub - GingL/CMPA (13 stars)

Deeply Coupled Cross-Modal Prompt Learning (2305.17903v3)

Summary

Related Papers

GitHub