Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mixture-of-Prompt-Experts for Multi-modal Semantic Understanding (2403.11311v2)

Published 17 Mar 2024 in cs.CL and cs.MM

Abstract: Deep multimodal semantic understanding that goes beyond the mere superficial content relation mining has received increasing attention in the realm of artificial intelligence. The challenges of collecting and annotating high-quality multi-modal data have underscored the significance of few-shot learning. In this paper, we focus on two critical tasks under this context: few-shot multi-modal sarcasm detection (MSD) and multi-modal sentiment analysis (MSA). To address them, we propose Mixture-of-Prompt-Experts with Block-Aware Prompt Fusion (MoPE-BAF), a novel multi-modal soft prompt framework based on the unified vision-LLM (VLM). Specifically, we design three experts of soft prompts: a text prompt and an image prompt that extract modality-specific features to enrich the single-modal representation, and a unified prompt to assist multi-modal interaction. Additionally, we reorganize Transformer layers into several blocks and introduce cross-modal prompt attention between adjacent blocks, which smoothens the transition from single-modal representation to multi-modal fusion. On both MSD and MSA datasets in few-shot setting, our proposed model not only surpasses the 8.2B model InstructBLIP with merely 2% parameters (150M), but also significantly outperforms other widely-used prompt methods on VLMs or task-specific methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (58)
  1. Vlmo: Unified vision-language pre-training with mixture-of-modality-experts. In Advances in Neural Information Processing Systems, volume 35, pages 32897–32912. Curran Associates, Inc.
  2. Alexandru-Costin Băroiu and \textcommabelowStefan Trău\textcommabelowsan-Matu. 2022. Automatic sarcasm detection: Systematic literature review. Information, 13(8).
  3. Multi-modal sarcasm detection in Twitter with hierarchical fusion model. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 2506–2515, Florence, Italy. Association for Computational Linguistics.
  4. Towards multimodal sarcasm detection (an _Obviously_ perfect paper). In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4619–4629, Florence, Italy. Association for Computational Linguistics.
  5. Sentiment and emotion help sarcasm? a multi-task learning framework for multi-modal sarcasm, sentiment and emotion analysis. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4351–4360, Online. Association for Computational Linguistics.
  6. Randaugment: Practical automated data augmentation with a reduced search space. In Advances in Neural Information Processing Systems, volume 33, pages 18613–18624. Curran Associates, Inc.
  7. Instructblip: Towards general-purpose vision-language models with instruction tuning.
  8. Nice perfume. how long did you marinate in it? multimodal sarcasm explanation. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10):10563–10571.
  9. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4171–4186, Minneapolis, Minnesota. Association for Computational Linguistics.
  10. An image is worth 16x16 words: Transformers for image recognition at scale. In International Conference on Learning Representations.
  11. Making pre-trained language models better few-shot learners. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 3816–3830, Online. Association for Computational Linguistics.
  12. A systematic survey of prompt engineering on vision-language foundation models.
  13. Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis. In Proceedings of the 2021 International Conference on Multimodal Interaction, ICMI ’21, page 6–15, New York, NY, USA. Association for Computing Machinery.
  14. Ptr: Prompt tuning with rules for text classification. arXiv preprint arXiv:2105.11259.
  15. Deep residual learning for image recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  16. Unsupervised prompt learning for vision-language models. arXiv preprint arXiv:2204.03649.
  17. Scaling up visual and vision-language representation learning with noisy text supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 4904–4916. PMLR.
  18. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 19113–19122.
  19. Vilt: Vision-and-language transformer without convolution or region supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 5583–5594. PMLR.
  20. The power of scale for parameter-efficient prompt tuning. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 3045–3059, Online and Punta Cana, Dominican Republic. Association for Computational Linguistics.
  21. Xiang Lisa Li and Percy Liang. 2021. Prefix-tuning: Optimizing continuous prompts for generation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 4582–4597, Online. Association for Computational Linguistics.
  22. Multi-modal sarcasm detection with interactive in-modal and cross-modal graphs. In Proceedings of the 29th ACM International Conference on Multimedia, MM ’21, page 4707–4715, New York, NY, USA. Association for Computing Machinery.
  23. Multi-modal sarcasm detection via cross-modal graph convolutional network. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1767–1777, Dublin, Ireland. Association for Computational Linguistics.
  24. Towards multi-modal sarcasm detection via hierarchical congruity modeling with knowledge enhancement. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 4995–5006, Abu Dhabi, United Arab Emirates. Association for Computational Linguistics.
  25. Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Comput. Surv., 55(9).
  26. P-tuning v2: Prompt tuning can be comparable to fine-tuning universally across scales and tasks. CoRR, abs/2110.07602.
  27. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61–68, Dublin, Ireland. Association for Computational Linguistics.
  28. Gpt understands, too.
  29. Deeply coupled cross-modal prompt learning. In Findings of the Association for Computational Linguistics: ACL 2023, pages 7957–7970, Toronto, Canada. Association for Computational Linguistics.
  30. Vision-and-language pretrained models: A survey. In Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22, pages 5530–5537. International Joint Conferences on Artificial Intelligence Organization. Survey Track.
  31. Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. In International Conference on Learning Representations.
  32. Vilbert: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc.
  33. Rishabh Misra and Prahal Arora. 2023. Sarcasm detection using news headlines dataset. AI Open, 4:13–18.
  34. Sentiment analysis on multi-view social data. In MultiMedia Modeling, pages 15–27, Cham. Springer International Publishing.
  35. Modeling intra and inter-modality incongruity for multi-modal sarcasm detection. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 1383–1392, Online. Association for Computational Linguistics.
  36. Multimodal learning using optimal transport for sarcasm and humor detection. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 3930–3940.
  37. Learning transferable visual models from natural language supervision. In Proceedings of the 38th International Conference on Machine Learning, volume 139 of Proceedings of Machine Learning Research, pages 8748–8763. PMLR.
  38. I didn’t mean what i wrote! exploring multimodality for sarcasm detection. In 2020 International Joint Conference on Neural Networks (IJCNN), pages 1–8.
  39. Timo Schick and Hinrich Schütze. 2021. It’s not just size that matters: Small language models are also few-shot learners. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2339–2352, Online. Association for Computational Linguistics.
  40. Detecting sarcasm in multimodal social platforms. In Proceedings of the 24th ACM International Conference on Multimedia, MM ’16, page 1136–1145, New York, NY, USA. Association for Computing Machinery.
  41. Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 5100–5111, Hong Kong, China. Association for Computational Linguistics.
  42. Dynamic routing transformer network for multimodal sarcasm detection. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2468–2480, Toronto, Canada. Association for Computational Linguistics.
  43. Multimodal sarcasm target identification in tweets. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8164–8175, Dublin, Ireland. Association for Computational Linguistics.
  44. Why do pretrained language models help in downstream tasks? an analysis of head and prompt tuning. In Advances in Neural Information Processing Systems, volume 34, pages 16158–16170. Curran Associates, Inc.
  45. Dip: Dual incongruity perceiving network for sarcasm detection. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2540–2550.
  46. Modeling incongruity between modalities for multimodal sarcasm detection. IEEE MultiMedia, 28(2):86–95.
  47. A co-memory network for multimodal sentiment analysis. In The 41st International ACM SIGIR Conference on Research & Development in Information Retrieval, SIGIR ’18, page 929–932, New York, NY, USA. Association for Computing Machinery.
  48. Reasoning with multimodal sarcastic tweets via modeling cross-modality contrast and semantic association. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 3777–3786, Online. Association for Computational Linguistics.
  49. Image-text multimodal emotion classification via multi-view attentional network. IEEE Transactions on Multimedia, 23:4014–4026.
  50. Multimodal sentiment detection based on multi-channel graph neural networks. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 328–339, Online. Association for Computational Linguistics.
  51. Unified multi-modal pre-training for few-shot sentiment analysis with prompt-based learning. In Proceedings of the 30th ACM International Conference on Multimedia, MM ’22, page 189–198, New York, NY, USA. Association for Computing Machinery.
  52. Unified vision and language prompt learning.
  53. Vision-language models for vision tasks: A survey.
  54. Stance-level sarcasm detection with bert and stance-centered graph attention networks. ACM Trans. Internet Technol., 23(2).
  55. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16816–16825.
  56. Learning to prompt for vision-language models. Int. J. Comput. Vision, 130(9):2337–2348.
  57. Multi-Modal Sarcasm Detection in Twitter with Hierarchical Fusion Model. Institute of Computer Science and Technology, Peking University.
  58. Teng Niu and Shiai Zhu and Lei Pang and Abdulmotaleb El-Saddik. 2016. MVSA: Sentiment Analysis on Multi-view Social Data. MCRLab, University of Ottawa.
Citations (1)

Summary

We haven't generated a summary for this paper yet.