MuDPT: Multi-modal Deep-symphysis Prompt Tuning for Large Pre-trained Vision-Language Models (2306.11400v2)
Abstract: Prompt tuning, like CoOp, has recently shown promising vision recognizing and transfer learning ability on various downstream tasks with the emergence of large pre-trained vision-LLMs like CLIP. However, we identify that existing uni-modal prompt tuning approaches may result in sub-optimal performance since this uni-modal design breaks the original alignment of textual and visual representations in the pre-trained model. Inspired by the nature of pre-trained vision-LLMs, we aim to achieve completeness in prompt tuning and propose a novel approach called Multi-modal Deep-symphysis Prompt Tuning, dubbed as MuDPT, which extends independent multi-modal prompt tuning by additionally learning a model-agnostic transformative network to allow deep hierarchical bi-directional prompt fusion. We evaluate the effectiveness of MuDPT on few-shot vision recognition and out-of-domain generalization tasks. Compared with the state-of-the-art methods, MuDPT achieves better recognition and generalization ability with an apparent margin thanks to synergistic alignment of textual and visual representations. Our code is available at: https://github.com/Mechrev0/MuDPT.
- Lei Ba J., Swersky K. and Fidler S., “Predicting deep zero-shot convolutional neural networks using textual descriptions,” IEEE International Conference on Computer Vision. 2015, pp. 4247-4255.
- Z. Wang, J. Yu, A. W. Yu, Z. Dai, Y. Tsvetkov and Y. Cao, “Simple visual language model pretraining with weak supervision,” International Conference on Learning Representations. 2022.
- Song H., Dong L., Zhang W., Liu T. and Wei F., “Clip models are few-shot learners: empirical studies on vqa and visual entailment,” Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2022, pp. 6088-6100.
- Zhou K., Yang J., Loy C. C. and Liu Z., “Learning to prompt for vision-language models,” International Journal of Computer Vision. 2022, 130(9) pp. 2337-2348.
- Zhou K., Yang J., Loy C. C. and Liu Z., “Conditional prompt learning for vision-language models,” IEEE Conference on Computer Vision and Pattern Recognition. 2022, pp. 16816-16825.
- Yongzhu Miao (1 paper)
- Shasha Li (57 papers)
- Jintao Tang (8 papers)
- Ting Wang (213 papers)