Multimodal Prompt Transformer with Hybrid Contrastive Learning for Emotion Recognition in Conversation (2310.04456v1)
Abstract: Emotion Recognition in Conversation (ERC) plays an important role in driving the development of human-machine interaction. Emotions can exist in multiple modalities, and multimodal ERC mainly faces two problems: (1) the noise problem in the cross-modal information fusion process, and (2) the prediction problem of less sample emotion labels that are semantically similar but different categories. To address these issues and fully utilize the features of each modality, we adopted the following strategies: first, deep emotion cues extraction was performed on modalities with strong representation ability, and feature filters were designed as multimodal prompt information for modalities with weak representation ability. Then, we designed a Multimodal Prompt Transformer (MPT) to perform cross-modal information fusion. MPT embeds multimodal fusion information into each attention layer of the Transformer, allowing prompt information to participate in encoding textual features and being fused with multi-level textual information to obtain better multimodal fusion features. Finally, we used the Hybrid Contrastive Learning (HCL) strategy to optimize the model's ability to handle labels with few samples. This strategy uses unsupervised contrastive learning to improve the representation ability of multimodal fusion and supervised contrastive learning to mine the information of labels with few samples. Experimental results show that our proposed model outperforms state-of-the-art models in ERC on two benchmark datasets.
- A neuropsychological theory of positive affect and its influence on cognition. Psychological review 106, 3 (1999), 529.
- Mutual Information Neural Estimation. In Proceedings of the 35th International Conference on Machine Learning, ICML (Proceedings of Machine Learning Research, Vol. 80). 530–539. http://proceedings.mlr.press/v80/belghazi18a.html
- IEMOCAP: interactive emotional dyadic motion capture database. Lang. Resour. Evaluation 42, 4 (2008), 335–359. https://doi.org/10.1007/s10579-008-9076-6
- A Simple Framework for Contrastive Learning of Visual Representations. In Proceedings of the 37th International Conference on Machine Learning, ICML (Proceedings of Machine Learning Research, Vol. 119). 1597–1607. http://proceedings.mlr.press/v119/chen20j.html
- BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, Volume 1 (Long and Short Papers). 4171–4186. https://doi.org/10.18653/v1/n19-1423
- Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th International Conference on Multimedia 2010. ACM, 1459–1462. https://doi.org/10.1145/1873951.1874246
- Emotion words shape emotion percepts. Emotion 12, 2 (2012), 314.
- DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, EMNLP-IJCNLP. 154–164. https://doi.org/10.18653/v1/D19-1015
- Improving Multimodal Fusion with Hierarchical Mutual Information Maximization for Multimodal Sentiment Analysis. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, EMNLP. 9180–9192. https://doi.org/10.18653/v1/2021.emnlp-main.723
- ICON: Interactive Conversational Memory Network for Multimodal Emotion Detection. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, EMNLP. 2594–2604. https://doi.org/10.18653/v1/d18-1280
- Conversational Memory Network for Emotion Recognition in Dyadic Dialogue Videos. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, Volume 1 (Long Papers). 2122–2132. https://doi.org/10.18653/v1/n18-1193
- Geoffrey E Hinton and Sam Roweis. 2002. Stochastic neighbor embedding. Advances in neural information processing systems 15 (2002).
- MM-DFN: Multimodal Dynamic Fusion Network for Emotion Recognition in Conversations. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP. 7037–7041. https://doi.org/10.1109/ICASSP43922.2022.9747397
- DialogueCRN: Contextual Reasoning Networks for Emotion Recognition in Conversations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP, (Volume 1: Long Papers). 7042–7052. https://doi.org/10.18653/v1/2021.acl-long.547
- MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP (Volume 1: Long Papers). 5666–5675. https://doi.org/10.18653/v1/2021.acl-long.440
- Densely connected convolutional networks. In Proceedings of the IEEE conference on computer vision and pattern recognition. 4700–4708.
- Relation-aware Graph Attention Networks with Relational Position Encodings for Emotion Recognition in Conversations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing, EMNLP. 7360–7370. https://doi.org/10.18653/v1/2020.emnlp-main.597
- COGMEN: COntextualized GNN based Multimodal Emotion recognitioN. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL. 4148–4164. https://doi.org/10.18653/v1/2022.naacl-main.306
- Diederik P. Kingma and Jimmy Ba. 2015. Adam: A Method for Stochastic Optimization. In 3rd International Conference on Learning Representations, ICLR. http://arxiv.org/abs/1412.6980
- Robert W Levenson. 2011. Basic emotion questions. Emotion review 3, 4 (2011), 379–386.
- BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, ACL. 7871–7880. https://doi.org/10.18653/v1/2020.acl-main.703
- Contrast and Generation Make BART a Good Dialogue Emotion Recognizer. In Thirty-Sixth AAAI Conference on Artificial Intelligence, AAAI 2022. 11002–11010. https://ojs.aaai.org/index.php/AAAI/article/view/21348
- EmoCaps: Emotion Capsule based Model for Conversational Emotion Recognition. In Findings of the Association for Computational Linguistics: ACL. 1610–1618. https://doi.org/10.18653/v1/2022.findings-acl.126
- GCNet: Graph Completion Network for Incomplete Multimodal Learning in Conversation. CoRR abs/2203.02177 (2022). https://doi.org/10.48550/arXiv.2203.02177
- Contrastive Multimodal Fusion with TupleInfoNCE. In 2021 IEEE/CVF International Conference on Computer Vision, ICCV. 734–743. https://doi.org/10.1109/ICCV48922.2021.00079
- RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR abs/1907.11692 (2019). arXiv:1907.11692 http://arxiv.org/abs/1907.11692
- A multi-view network for real-time emotion recognition in conversations. Knowl. Based Syst. 236 (2022), 107751. https://doi.org/10.1016/j.knosys.2021.107751
- A survey on empathetic dialogue systems. Inf. Fusion 64 (2020), 50–70. https://doi.org/10.1016/j.inffus.2020.06.011
- DialogueRNN: An Attentive RNN for Emotion Detection in Conversations. In The Thirty-Third AAAI Conference on Artificial Intelligence, AAAI 2019. 6818–6825. https://doi.org/10.1609/aaai.v33i01.33016818
- Deep Contextualized Word Representations. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT, Volume 1 (Long Papers). 2227–2237. https://doi.org/10.18653/v1/n18-1202
- Context-Dependent Sentiment Analysis in User-Generated Videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL, Volume 1: Long Papers. 873–883. https://doi.org/10.18653/v1/P17-1081
- MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL, Volume 1: Long Papers. 527–536. https://doi.org/10.18653/v1/p19-1050
- DialogXL: All-in-One XLNet for Multi-Party Conversation Emotion Recognition. In Thirty-Fifth AAAI Conference on Artificial Intelligence,. 13789–13797. https://ojs.aaai.org/index.php/AAAI/article/view/17625
- Directed Acyclic Graph Network for Conversational Emotion Recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP 2021, (Volume 1: Long Papers). 1551–1560. https://doi.org/10.18653/v1/2021.acl-long.123
- Multimodal Transformer for Unaligned Multimodal Language Sequences. In Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL, Volume 1: Long Papers. 6558–6569. https://doi.org/10.18653/v1/p19-1656
- Representation Learning with Contrastive Predictive Coding. CoRR abs/1807.03748 (2018). arXiv:1807.03748 http://arxiv.org/abs/1807.03748
- Attention is All you Need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems,NeurIPS. 5998–6008. https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
- ConSERT: A Contrastive Framework for Self-Supervised Sentence Representation Transfer. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP, (Volume 1: Long Papers). 5065–5075. https://doi.org/10.18653/v1/2021.acl-long.393
- MTAG: Modal-Temporal Attention Graph for Unaligned Human Multimodal Language Sequences. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT. 1009–1021. https://doi.org/10.18653/v1/2021.naacl-main.79
- Modeling both Intra- and Inter-modal Influence for Real-Time Emotion Detection in Conversations. In MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12-16, 2020. 503–511. https://doi.org/10.1145/3394171.3413949
- Topic-Driven and Knowledge-Aware Transformer for Dialogue Emotion Detection. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, ACL/IJCNLP. 1571–1582. https://doi.org/10.18653/v1/2021.acl-long.125
- Improving multimodal fusion with Main Modal Transformer for emotion recognition in conversation. Knowl. Based Syst. 258 (2022), 109978. https://doi.org/10.1016/j.knosys.2022.109978
- Shihao Zou (17 papers)
- Xianying Huang (1 paper)
- Xudong Shen (19 papers)