Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Token-Level Contrastive Learning with Modality-Aware Prompting for Multimodal Intent Recognition (2312.14667v2)

Published 22 Dec 2023 in cs.MM and cs.LG

Abstract: Multimodal intent recognition aims to leverage diverse modalities such as expressions, body movements and tone of speech to comprehend user's intent, constituting a critical task for understanding human language and behavior in real-world multimodal scenarios. Nevertheless, the majority of existing methods ignore potential correlations among different modalities and own limitations in effectively learning semantic features from nonverbal modalities. In this paper, we introduce a token-level contrastive learning method with modality-aware prompting (TCL-MAP) to address the above challenges. To establish an optimal multimodal semantic environment for text modality, we develop a modality-aware prompting module (MAP), which effectively aligns and fuses features from text, video and audio modalities with similarity-based modality alignment and cross-modality attention mechanism. Based on the modality-aware prompt and ground truth labels, the proposed token-level contrastive learning framework (TCL) constructs augmented samples and employs NT-Xent loss on the label token. Specifically, TCL capitalizes on the optimal textual semantic insights derived from intent labels to guide the learning processes of other modalities in return. Extensive experiments show that our method achieves remarkable improvements compared to state-of-the-art methods. Additionally, ablation analyses demonstrate the superiority of the modality-aware prompt over the handcrafted prompt, which holds substantial significance for multimodal prompt learning. The codes are released at https://github.com/thuiar/TCL-MAP.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. In Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.; and Lin, H., eds., Advances in Neural Information Processing Systems, volume 33, 12449–12460. Curran Associates, Inc.
  2. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision, 9650–9660.
  3. A simple framework for contrastive learning of visual representations. In International conference on machine learning, 1597–1607. PMLR.
  4. Exploring simple siamese representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 15750–15758.
  5. ImageNet: A large-scale hierarchical image database. In 2009 IEEE Conference on Computer Vision and Pattern Recognition, 248–255.
  6. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
  7. Improving spoken language understanding with cross-modal contrastive learning. Interspeech. ISCA.
  8. Decorate the newcomers: Visual domain prompt for continual test time adaptation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, 7595–7603.
  9. Connectionist Temporal Classification: Labelling Unsegmented Sequence Data with Recurrent Neural Networks. In Proceedings of the 23rd International Conference on Machine Learning, ICML ’06, 369–376. New York, NY, USA: Association for Computing Machinery. ISBN 1595933832.
  10. Bootstrap your own latent-a new approach to self-supervised learning. Advances in neural information processing systems, 33: 21271–21284.
  11. Bi-bimodal modality fusion for correlation-controlled multimodal sentiment analysis. In Proceedings of the 2021 International Conference on Multimodal Interaction, 6–15.
  12. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis. arXiv preprint arXiv:2109.00412.
  13. MISA: Modality-Invariant and -Specific Representations for Multimodal Sentiment Analysis. In Proceedings of the 28th ACM International Conference on Multimedia, MM ’20, 1122–1131. New York, NY, USA: Association for Computing Machinery. ISBN 9781450379885.
  14. Momentum contrast for unsupervised visual representation learning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 9729–9738.
  15. Deep multimodal multilinear fusion with high-order polynomial pooling. Advances in Neural Information Processing Systems, 32.
  16. Align and Prompt: Video-and-Language Pre-Training With Entity Prompts. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 4953–4963.
  17. Swin Transformer: Hierarchical Vision Transformer Using Shifted Windows. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 10012–10022.
  18. Efficient low-rank multimodal fusion with modality-specific factors. arXiv preprint arXiv:1806.00064.
  19. Decoupled weight decay regularization. arXiv preprint arXiv:1711.05101.
  20. TorchVision: PyTorch’s Computer Vision library. https://github.com/pytorch/vision.
  21. Mmlatch: Bottom-up top-down fusion for multimodal sentiment analysis. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 4573–4577. IEEE.
  22. Integrating multimodal information in large pretrained transformers. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 2020, 2359. NIH Public Access.
  23. Denseclip: Language-guided dense prediction with context-aware prompting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 18082–18091.
  24. Towards Emotion-aided Multi-modal Dialogue Act Classification. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 4361–4372. Online: Association for Computational Linguistics.
  25. Sohn, K. 2016. Improved deep metric learning with multi-class n-pair loss objective. Advances in neural information processing systems, 29.
  26. Contrastive multiview coding. In Computer Vision–ECCV 2020: 16th European Conference, Glasgow, UK, August 23–28, 2020, Proceedings, Part XI 16, 776–794. Springer.
  27. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 2019, 6558. NIH Public Access.
  28. Learning To Prompt for Continual Learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 139–149.
  29. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  30. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE conference on computer vision and pattern recognition, 3733–3742.
  31. Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 6210–6219.
  32. Speech-Text Pre-training for Spoken Dialog Understanding with Explicit Cross-Modal Alignment. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 7900–7913. Toronto, Canada: Association for Computational Linguistics.
  33. Tensor fusion network for multimodal sentiment analysis. arXiv preprint arXiv:1707.07250.
  34. Memory fusion network for multi-view sequential learning. In Proceedings of the AAAI conference on artificial intelligence, volume 32.
  35. TEXTOIR: An Integrated and Visualized Platform for Text Open Intent Recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing: System Demonstrations, 167–174.
  36. Deep open intent classification with adaptive decision boundary. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 14374–14382.
  37. Discovering new intents with deep aligned clustering. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, 14365–14373.
  38. A Clustering Framework for Unsupervised and Semi-supervised New Intent Discovery. IEEE Transactions on Knowledge and Data Engineering, 1–14.
  39. MIntRec: A New Dataset for Multimodal Intent Recognition. In Proceedings of the 30th ACM International Conference on Multimedia, MM ’22, 1688–1697. New York, NY, USA: Association for Computing Machinery. ISBN 9781450392037.
  40. Learning Discriminative Representations and Decision Boundaries for Open Intent Detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31: 1611–1623.
  41. Conditional prompt learning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 16816–16825.
  42. Learning to prompt for vision-language models. International Journal of Computer Vision, 130(9): 2337–2348.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Qianrui Zhou (6 papers)
  2. Hua Xu (78 papers)
  3. Hao Li (803 papers)
  4. Hanlei Zhang (13 papers)
  5. Xiaohan Zhang (78 papers)
  6. Yifan Wang (319 papers)
  7. Kai Gao (55 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com