Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 71 tok/s
Gemini 2.5 Pro 54 tok/s Pro
GPT-5 Medium 22 tok/s Pro
GPT-5 High 29 tok/s Pro
GPT-4o 88 tok/s Pro
Kimi K2 138 tok/s Pro
GPT OSS 120B 446 tok/s Pro
Claude Sonnet 4.5 35 tok/s Pro
2000 character limit reached

Conditional Prompt Tuning for Multimodal Fusion (2312.03734v1)

Published 28 Nov 2023 in cs.CL and cs.AI

Abstract: We show that the representation of one modality can effectively guide the prompting of another modality for parameter-efficient multimodal fusion. Specifically, we first encode one modality and use its representation as a prior to conditionally prompt all frozen layers of the other modality. This is achieved by disentangling the vanilla prompt vectors into three types of specialized prompts that adaptively capture global-level and instance-level features. To better produce the instance-wise prompt, we introduce the mixture of prompt experts (MoPE) to dynamically route each instance to the most suitable prompt experts for encoding. We further study a regularization term to avoid degenerated prompt expert routing. Thanks to our design, our method can effectively transfer the pretrained knowledge in unimodal encoders for downstream multimodal tasks. Compared with vanilla prompting, we show that our MoPE-based conditional prompting is more expressive, thereby scales better with training data and the total number of prompts. We also demonstrate that our prompt tuning is architecture-agnostic, thereby offering high modularity. Extensive experiments over three multimodal datasets demonstrate state-of-the-art results, matching or surpassing the performance achieved through fine-tuning, while only necessitating 0.7% of the trainable parameters. Code will be released: https://github.com/songrise/ConditionalPrompt.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Gated multimodal units for information fusion. arXiv preprint arXiv:1702.01992, 2017.
  2. Attentional mixtures of soft prompt tuning for parameter-efficient multi-task knowledge sharing. arXiv preprint arXiv:2205.11961, 3, 2022.
  3. Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274, 2022.
  4. Randaugment: Practical automated data augmentation with a reduced search space. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 702–703, 2020.
  5. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  6. An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929, 2020.
  7. Learning factored representations in a deep mixture of experts. arXiv preprint arXiv:1312.4314, 2013.
  8. Visual prompt tuning for test-time domain adaptation. arXiv preprint arXiv:2210.04831, 2022.
  9. Diversity-aware meta visual prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10878–10887, 2023.
  10. Visual prompt tuning. In European Conference on Computer Vision, pages 709–727. Springer, 2022.
  11. Clip-count: Towards text-guided zero-shot object counting. arXiv preprint arXiv:2305.07304, 2023.
  12. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19113–19122, 2023.
  13. Supervised multimodal bitransformers for classifying images and text. arXiv preprint arXiv:1909.02950, 2019.
  14. Do we really need a large number of visual prompts? arXiv preprint arXiv:2305.17223, 2023.
  15. Segment anything. arXiv preprint arXiv:2304.02643, 2023.
  16. Gshard: Scaling giant models with conditional computation and automatic sharding. arXiv preprint arXiv:2006.16668, 2020.
  17. The power of scale for parameter-efficient prompt tuning. arXiv preprint arXiv:2104.08691, 2021.
  18. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  19. Efficient multimodal fusion via interactive prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2604–2613, 2023.
  20. Modular and parameter-efficient multimodal fusion with prompting. arXiv preprint arXiv:2203.08055, 2022.
  21. Being comes from not-being: Open-vocabulary text-to-motion generation with wordless training. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 23222–23231, 2023.
  22. Cross-modal collaborative representation learning and a large-scale rgbt benchmark for crowd counting. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4823–4833, 2021a.
  23. Prompt-matched semantic segmentation. arXiv preprint arXiv:2208.10159, 2022.
  24. Swin transformer: Hierarchical vision transformer using shifted windows. In Proceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021b.
  25. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101, 2017.
  26. Segment anything in medical images. arXiv preprint arXiv:2304.12306, 2023.
  27. Multimodal contrastive learning with limoe: the language-image mixture of experts. Advances in Neural Information Processing Systems, 35:9564–9576, 2022.
  28. Learning how to ask: Querying lms with mixtures of soft prompts. arXiv preprint arXiv:2104.06599, 2021.
  29. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  30. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  31. Scaling vision with sparse mixture of experts. Advances in Neural Information Processing Systems, 34:8583–8595, 2021.
  32. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. arXiv preprint arXiv:1701.06538, 2017.
  33. Multimodal few-shot learning with frozen language models. Advances in Neural Information Processing Systems, 34:200–212, 2021.
  34. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  35. Recipe recognition with large multimodal food dataset. In 2015 IEEE International Conference on Multimedia & Expo Workshops (ICMEW), pages 1–6. IEEE, 2015.
  36. Learning to prompt for continual learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 139–149, 2022.
  37. Visual entailment: A novel task for fine-grained image understanding. arXiv preprint arXiv:1901.06706, 2019.
  38. Prompt tuning for generative multimodal pretrained models. arXiv preprint arXiv:2208.02532, 2022a.
  39. Dynamic prompting: A unified framework for prompt tuning. arXiv preprint arXiv:2303.02909, 2023.
  40. An empirical study of gpt-3 for few-shot knowledge-based vqa. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 3081–3089, 2022b.
  41. Cpt: Colorful prompt tuning for pre-trained vision-language models. arXiv preprint arXiv:2109.11797, 2021.
  42. Unified vision and language prompt learning. arXiv preprint arXiv:2210.07225, 2022.
  43. Automatic chain of thought prompting in large language models. arXiv preprint arXiv:2210.03493, 2022.
  44. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 16816–16825, 2022a.
  45. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9):2337–2348, 2022b.
  46. Incorporating bert into neural machine translation. arXiv preprint arXiv:2002.06823, 2020.

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Github Logo Streamline Icon: https://streamlinehq.com