Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multimodal Prompt Transformer with Hybrid Contrastive Learning for Emotion Recognition in Conversation (2310.04456v1)

Published 4 Oct 2023 in cs.CL, cs.SD, and eess.AS

Abstract: Emotion Recognition in Conversation (ERC) plays an important role in driving the development of human-machine interaction. Emotions can exist in multiple modalities, and multimodal ERC mainly faces two problems: (1) the noise problem in the cross-modal information fusion process, and (2) the prediction problem of less sample emotion labels that are semantically similar but different categories. To address these issues and fully utilize the features of each modality, we adopted the following strategies: first, deep emotion cues extraction was performed on modalities with strong representation ability, and feature filters were designed as multimodal prompt information for modalities with weak representation ability. Then, we designed a Multimodal Prompt Transformer (MPT) to perform cross-modal information fusion. MPT embeds multimodal fusion information into each attention layer of the Transformer, allowing prompt information to participate in encoding textual features and being fused with multi-level textual information to obtain better multimodal fusion features. Finally, we used the Hybrid Contrastive Learning (HCL) strategy to optimize the model's ability to handle labels with few samples. This strategy uses unsupervised contrastive learning to improve the representation ability of multimodal fusion and supervised contrastive learning to mine the information of labels with few samples. Experimental results show that our proposed model outperforms state-of-the-art models in ERC on two benchmark datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. A neuropsychological theory of positive affect and its influence on cognition. Psychological review 106, 3 (1999), 529.
  2. Mutual Information Neural Estimation. In Proceedings of the 35th International Conference on Machine Learning, ICML (Proceedings of Machine Learning Research, Vol. 80). 530–539. http://proceedings.mlr.press/v80/belghazi18a.html
  3. IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Evaluation 42, 4 (2008), 335–359. https://doi.org/10.1007/s10579-008-9076-6
  4. A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning, ICML (Proceedings of Machine Learning Research, Vol. 119). 1597–1607. http://proceedings.mlr.press/v119/chen20j.html
  5. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, Volume 1 (Long and Short Papers). 4171–4186. https://doi.org/10.18653/v1/n19-1423
  6. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th International Conference on Multimedia 2010. ACM, 1459–1462. https://doi.org/10.1145/1873951.1874246
  7. Emotion words shape emotion percepts. Emotion 12, 2 (2012), 314.
  8. DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP. 154–164. https://doi.org/10.18653/v1/D19-1015
  9. Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP. 9180–9192. https://doi.org/10.18653/v1/2021.emnlp-main.723
  10. ICON: Interactive Conversational Memory Network for Multimodal Emotion Detection. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP. 2594–2604. https://doi.org/10.18653/v1/d18-1280
  11. Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, Volume 1 (Long Papers). 2122–2132. https://doi.org/10.18653/v1/n18-1193
  12. Geoffrey E Hinton and Sam Roweis. 2002. Stochastic neighbor embedding. Advances in neural information processing systems 15 (2002).
  13. MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in Conversations. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP. 7037–7041. https://doi.org/10.1109/ICASSP43922.2022.9747397
  14. DialogueCRN: Contextual Reasoning Networks for Emotion Recognition in Conversations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP, (Volume 1: Long Papers). 7042–7052. https://doi.org/10.18653/v1/2021.acl-long.547
  15. MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP (Volume 1: Long Papers). 5666–5675. https://doi.org/10.18653/v1/2021.acl-long.440
  16. Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4700–4708.
  17. Relation-aware Graph Attention Networks with Relational Position Encodings for Emotion Recognition in Conversations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP. 7360–7370. https://doi.org/10.18653/v1/2020.emnlp-main.597
  18. COGMEN: COntextualized GNN based Multimodal Emotion recognitioN. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL. 4148–4164. https://doi.org/10.18653/v1/2022.naacl-main.306
  19. Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR. http://arxiv.org/abs/1412.6980
  20. Robert W Levenson. 2011. Basic emotion questions. Emotion review 3, 4 (2011), 379–386.
  21. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL. 7871–7880. https://doi.org/10.18653/v1/2020.acl-main.703
  22. Contrast and Generation Make BART a Good Dialogue Emotion Recognizer. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022. 11002–11010. https://ojs.aaai.org/index.php/AAAI/article/view/21348
  23. EmoCaps: Emotion Capsule based Model for Conversational Emotion Recognition. In Findings of the Association for Computational Linguistics: ACL. 1610–1618. https://doi.org/10.18653/v1/2022.findings-acl.126
  24. GCNet: Graph Completion Network for Incomplete Multimodal Learning in Conversation. CoRR abs/2203.02177 (2022). https://doi.org/10.48550/arXiv.2203.02177
  25. Contrastive Multimodal Fusion with TupleInfoNCE. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV. 734–743. https://doi.org/10.1109/ICCV48922.2021.00079
  26. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). arXiv:1907.11692 http://arxiv.org/abs/1907.11692
  27. A multi-view network for real-time emotion recognition in conversations. Knowl. Based Syst. 236 (2022), 107751. https://doi.org/10.1016/j.knosys.2021.107751
  28. A survey on empathetic dialogue systems. Inf. Fusion 64 (2020), 50–70. https://doi.org/10.1016/j.inffus.2020.06.011
  29. DialogueRNN: An Attentive RNN for Emotion Detection in Conversations. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019. 6818–6825. https://doi.org/10.1609/aaai.v33i01.33016818
  30. Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, Volume 1 (Long Papers). 2227–2237. https://doi.org/10.18653/v1/n18-1202
  31. Context-Dependent Sentiment Analysis in User-Generated Videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL, Volume 1: Long Papers. 873–883. https://doi.org/10.18653/v1/P17-1081
  32. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL, Volume 1: Long Papers. 527–536. https://doi.org/10.18653/v1/p19-1050
  33. DialogXL: All-in-One XLNet for Multi-Party Conversation Emotion Recognition. In Thirty-Fifth AAAI Conference on Artificial Intelligence,. 13789–13797. https://ojs.aaai.org/index.php/AAAI/article/view/17625
  34. Directed Acyclic Graph Network for Conversational Emotion Recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers). 1551–1560. https://doi.org/10.18653/v1/2021.acl-long.123
  35. Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL, Volume 1: Long Papers. 6558–6569. https://doi.org/10.18653/v1/p19-1656
  36. Representation Learning with Contrastive Predictive Coding. CoRR abs/1807.03748 (2018). arXiv:1807.03748 http://arxiv.org/abs/1807.03748
  37. Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems,NeurIPS. 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
  38. ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP, (Volume 1: Long Papers). 5065–5075. https://doi.org/10.18653/v1/2021.acl-long.393
  39. MTAG: Modal-Temporal Attention Graph for Unaligned Human Multimodal Language Sequences. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT. 1009–1021. https://doi.org/10.18653/v1/2021.naacl-main.79
  40. Modeling both Intra- and Inter-modal Influence for Real-Time Emotion Detection in Conversations. In MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12-16, 2020. 503–511. https://doi.org/10.1145/3394171.3413949
  41. Topic-Driven and Knowledge-Aware Transformer for Dialogue Emotion Detection. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP. 1571–1582. https://doi.org/10.18653/v1/2021.acl-long.125
  42. Improving multimodal fusion with Main Modal Transformer for emotion recognition in conversation. Knowl. Based Syst. 258 (2022), 109978. https://doi.org/10.1016/j.knosys.2022.109978
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Shihao Zou (17 papers)
  2. Xianying Huang (1 paper)
  3. Xudong Shen (19 papers)
Citations (4)