Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DialCLIP: Empowering CLIP as Multi-Modal Dialog Retriever (2401.01076v2)

Published 2 Jan 2024 in cs.CL

Abstract: Recently, substantial advancements in pre-trained vision-LLMs have greatly enhanced the capabilities of multi-modal dialog systems. These models have demonstrated significant improvements by fine-tuning on downstream tasks. However, the existing pre-trained models primarily focus on effectively capturing the alignment between vision and language modalities, often ignoring the intricate nature of dialog context. In this paper, we propose a parameter-efficient prompt-tuning method named DialCLIP for multi-modal dialog retrieval. Specifically, our approach introduces a multi-modal context prompt generator to learn context features which are subsequently distilled into prompts within the pre-trained vision-LLM CLIP. Besides, we introduce domain prompt to mitigate the disc repancy from the downstream dialog data. To facilitate various types of retrieval, we also design multiple experts to learn mappings from CLIP outputs to multi-modal representation space, with each expert being responsible to one specific retrieval type. Extensive experiments show that DialCLIP achieves state-of-the-art performance on two widely recognized benchmark datasets (i.e., PhotoChat and MMDialog) by tuning a mere 0.04% of the total parameters. These results highlight the efficacy and efficiency of our proposed approach, underscoring its potential to advance the field of multi-modal dialog retrieval.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (21)
  1. “Multi-modal open-domain dialogue,” arXiv preprint arXiv:2010.01082, 2020.
  2. “Multimodal dialogue response generation,” in ACL (1). 2022, pp. 2854–2866, Association for Computational Linguistics.
  3. “Image-chat: Engaging grounded conversations,” in ACL. 2020, pp. 2414–2429, Association for Computational Linguistics.
  4. “Photochat: A human-human dialogue dataset with photo sharing behavior for joint image-text modeling,” arXiv preprint arXiv:2108.01453, 2021.
  5. “Mmdialog: A large-scale multi-turn dialogue dataset towards multi-modal open-domain conversation,” in ACL (1). 2023, pp. 7348–7363, Association for Computational Linguistics.
  6. OpenAI, “Gpt-4 technical report,” ArXiv, vol. abs/2303.08774, 2023.
  7. “Learning transferable visual models from natural language supervision,” in International conference on machine learning. PMLR, 2021, pp. 8748–8763.
  8. “Flamingo: a visual language model for few-shot learning,” Advances in Neural Information Processing Systems, vol. 35, pp. 23716–23736, 2022.
  9. “OFA: unifying architectures, tasks, and modalities through a simple sequence-to-sequence learning framework,” in ICML. 2022, vol. 162 of Proceedings of Machine Learning Research, pp. 23318–23340, PMLR.
  10. “Image as a foreign language: Beit pretraining for vision and vision-language tasks,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 19175–19186.
  11. “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
  12. “The power of scale for parameter-efficient prompt tuning,” in EMNLP (1). 2021, pp. 3045–3059, Association for Computational Linguistics.
  13. “Gpt understands, too,” arXiv preprint arXiv:2103.10385, 2021.
  14. “Instance-aware prompt learning for language understanding and generation,” arXiv preprint arXiv:2201.07126, 2022.
  15. “Late prompt tuning: A late prompt could be better than many prompts,” in EMNLP (Findings). 2022, pp. 1325–1338, Association for Computational Linguistics.
  16. “Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation,” in International Conference on Machine Learning. PMLR, 2022, pp. 12888–12900.
  17. “Prefix-tuning: Optimizing continuous prompts for generation,” arXiv preprint arXiv:2101.00190, 2021.
  18. “Vlmo: Unified vision-language pre-training with mixture-of-modality-experts,” Advances in Neural Information Processing Systems, vol. 35, pp. 32897–32912, 2022.
  19. “Pace: Unified multi-modal dialogue pre-training with progressive and compositional experts,” in ACL (1). 2023, pp. 13402–13416, Association for Computational Linguistics.
  20. “Scaling instruction-finetuned language models,” arXiv preprint arXiv:2210.11416, 2022.
  21. “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Zhichao Yin (8 papers)
  2. Binyuan Hui (57 papers)
  3. Min Yang (239 papers)
  4. Fei Huang (409 papers)
  5. Yongbin Li (128 papers)
Citations (3)