Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models (2306.11400v2)

Published 20 Jun 2023 in cs.CV and cs.CL

Abstract: Prompt tuning, like CoOp, has recently shown promising vision recognizing and transfer learning ability on various downstream tasks with the emergence of large pre-trained vision-LLMs like CLIP. However, we identify that existing uni-modal prompt tuning approaches may result in sub-optimal performance since this uni-modal design breaks the original alignment of textual and visual representations in the pre-trained model. Inspired by the nature of pre-trained vision-LLMs, we aim to achieve completeness in prompt tuning and propose a novel approach called Multi-modal Deep-symphysis Prompt Tuning, dubbed as MuDPT, which extends independent multi-modal prompt tuning by additionally learning a model-agnostic transformative network to allow deep hierarchical bi-directional prompt fusion. We evaluate the effectiveness of MuDPT on few-shot vision recognition and out-of-domain generalization tasks. Compared with the state-of-the-art methods, MuDPT achieves better recognition and generalization ability with an apparent margin thanks to synergistic alignment of textual and visual representations. Our code is available at: https://github.com/Mechrev0/MuDPT.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (5)
  1. Lei Ba J., Swersky K. and Fidler S., “Predicting deep zero-shot convolutional neural networks using textual descriptions,” IEEE International Conference on Computer Vision. 2015, pp. 4247-4255.
  2. Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov and Y. Cao, “Simple visual language model pretraining with weak supervision,” International Conference on Learning Representations. 2022.
  3. Song H., Dong L., Zhang W., Liu T. and Wei F., “Clip models are few-shot learners: empirical studies on vqa and visual entailment,” Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022, pp. 6088-6100.
  4. Zhou K., Yang J., Loy C. C. and Liu Z., “Learning to prompt for vision-language models,” International Journal of Computer Vision. 2022, 130(9) pp. 2337-2348.
  5. Zhou K., Yang J., Loy C. C. and Liu Z., “Conditional prompt learning for vision-language models,” IEEE Conference on Computer Vision and Pattern Recognition. 2022, pp. 16816-16825.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Yongzhu Miao (1 paper)
  2. Shasha Li (57 papers)
  3. Jintao Tang (8 papers)
  4. Ting Wang (213 papers)
Citations (3)
Github Logo Streamline Icon: https://streamlinehq.com