Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Comprehensive Survey on Multi-modal Conversational Emotion Recognition with Deep Learning (2312.05735v1)

Published 10 Dec 2023 in cs.AI

Abstract: Multi-modal conversation emotion recognition (MCER) aims to recognize and track the speaker's emotional state using text, speech, and visual information in the conversation scene. Analyzing and studying MCER issues is significant to affective computing, intelligent recommendations, and human-computer interaction fields. Unlike the traditional single-utterance multi-modal emotion recognition or single-modal conversation emotion recognition, MCER is a more challenging problem that needs to deal with more complex emotional interaction relationships. The critical issue is learning consistency and complementary semantics for multi-modal feature fusion based on emotional interaction relationships. To solve this problem, people have conducted extensive research on MCER based on deep learning technology, but there is still a lack of systematic review of the modeling methods. Therefore, a timely and comprehensive overview of MCER's recent advances in deep learning is of great significance to academia and industry. In this survey, we provide a comprehensive overview of MCER modeling methods and roughly divide MCER methods into four categories, i.e., context-free modeling, sequential context modeling, speaker-differentiated modeling, and speaker-relationship modeling. In addition, we further discuss MCER's publicly available popular datasets, multi-modal feature extraction methods, application areas, existing challenges, and future development directions. We hope that our review can help MCER researchers understand the current research status in emotion recognition, provide some inspiration, and develop more efficient models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (133)
  1. Openface 2.0: Facial behavior analysis toolkit. In 2018 13th IEEE international conference on automatic face & gesture recognition (FG 2018). IEEE, 59–66.
  2. Carmen Amador Barreiro and Luke Treglown. 2020. What makes an engaged employee? A facet-level approach to trait emotional intelligence as a predictor of employee engagement. Personality and Individual Differences 159 (2020), 109892.
  3. Bagged support vector machines for emotion recognition from speech. Knowledge-Based Systems 184 (2019), 104886.
  4. Cnn-cert: An efficient framework for certifying robustness of convolutional neural networks. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 3240–3247.
  5. IEMOCAP: Interactive emotional dyadic motion capture database. Language resources and evaluation 42 (2008), 335–359.
  6. Benchmarking multimodal sentiment analysis. In Computational Linguistics and Intelligent Text Processing: 18th International Conference, CICLing 2017, Budapest, Hungary, April 17–23, 2017, Revised Selected Papers, Part II 18. Springer, 166–179.
  7. Understanding emotions in text using deep learning and big data. Computers in Human Behavior 93 (2019), 309–317.
  8. Audio word2vec: Sequence-to-sequence autoencoding for unsupervised learning of audio segmentation and representation. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, 9 (2019), 1481–1493.
  9. M2fnet: Multi-modal fusion network for emotion recognition in conversation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 4652–4661.
  10. Jarosław Cichosz and Krzysztof Slot. 2007. Emotion recognition in speech signal using emotion-extracting binary decision trees. Proceedings of Affective Computing and Intelligent Interaction (2007), 1–10.
  11. Jiawen Deng and Fuji Ren. 2021. A survey of textual emotion recognition and its challenges. IEEE Transactions on Affective Computing (2021).
  12. Tsception: Capturing temporal dynamics and spatial asymmetry from EEG for emotion recognition. IEEE Transactions on Affective Computing (2022).
  13. Cross-network skip-gram embedding for joint network alignment and link prediction. IEEE Transactions on Knowledge and Data Engineering 34, 3 (2020), 1080–1095.
  14. Manifestation of depression in speech overlaps with characteristics used to represent and recognize speaker identity. Scientific Reports 13, 1 (2023), 11155.
  15. Naive Bayes for regression. Machine Learning 41 (2000), 5–25.
  16. Semglove: Semantic co-occurrences for glove from bert. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022), 2696–2704.
  17. Emonet: A transfer learning framework for multi-corpus speech emotion recognition. IEEE Transactions on Affective Computing (2021).
  18. Contextual inter-modal attention for multi-modal sentiment analysis. In proceedings of the 2018 conference on empirical methods in natural language processing. 3454–3466.
  19. COSMIC: COmmonSense knowledge for eMotion Identification in Conversations. In Findings of the Association for Computational Linguistics: EMNLP 2020. 2470–2481.
  20. DialogueGCN: A Graph Convolutional Neural Network for Emotion Recognition in Conversation. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics.
  21. Shreya Ghosh and Tarique Anwar. 2021. Depression intensity estimation via social media: a deep learning approach. IEEE Transactions on Computational Social Systems 8, 6 (2021), 1465–1474.
  22. A survey on vision transformer. IEEE Transactions on Pattern Analysis and Machine Intelligence 45, 1 (2022), 87–110.
  23. Transformer in transformer. Advances in Neural Information Processing Systems 34 (2021), 15908–15919.
  24. Tanvi Hardeniya and Dilipkumar A Borikar. 2016. Dictionary based approach to sentiment analysis-a review. International Journal of Advanced Engineering, Management and Science 2, 5 (2016), 239438.
  25. Icon: Interactive conversational memory network for multimodal emotion detection. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 2594–2604.
  26. Conversational memory network for emotion recognition in dyadic dialogue videos. In Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting, Vol. 2018. NIH Public Access, 2122.
  27. Conversational transfer learning for emotion recognition. Information Fusion 65 (2021), 1–12.
  28. Misa: Modality-invariant and-specific representations for multimodal sentiment analysis. In Proceedings of the 28th ACM international conference on multimedia. 1122–1131.
  29. Semantic Alignment Network for Multi-Modal Emotion Recognition. IEEE Transactions on Circuits and Systems for Video Technology 33, 9 (2023), 5318–5329.
  30. EmotionLines: An Emotion Corpus of Multi-Party Conversations. In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018).
  31. Supervised Adversarial Contrastive Learning for Emotion Recognition in Conversations. arXiv preprint arXiv:2306.01505 (2023).
  32. MM-DFN: Multimodal dynamic fusion network for emotion recognition in conversations. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 7037–7041.
  33. DialogueCRN: Contextual Reasoning Networks for Emotion Recognition in Conversations. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics. 7042–7052.
  34. UniMSE: Towards Unified Multimodal Sentiment Analysis and Emotion Recognition. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. 7837–7851.
  35. GMM supervector based SVM with spectral features for speech emotion recognition. In 2007 IEEE International Conference on Acoustics, Speech and Signal Processing-ICASSP’07, Vol. 4. IEEE, IV–413.
  36. Multimodal transformer fusion for continuous emotion recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 3507–3511.
  37. Relation-aware graph attention networks with relational position encodings for emotion recognition in conversations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 7360–7370.
  38. Review on Convolutional Neural Networks (CNN) in vegetation remote sensing. ISPRS journal of photogrammetry and remote sensing 173 (2021), 24–49.
  39. Smith K Khare and Varun Bajaj. 2020. Time–frequency representation and convolutional neural network-based emotion recognition. IEEE transactions on neural networks and learning systems 32, 7 (2020), 2901–2909.
  40. Randomly wired network based on RoBERTa and dialog history attention for response selection. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 2437–2442.
  41. Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). Association for Computational Linguistics.
  42. Dimitrios Kollias and Stefanos Zafeiriou. 2020. Exploiting multi-cnn features in cnn-rnn based dimensional emotion recognition on the omg in-the-wild dataset. IEEE Transactions on Affective Computing 12, 3 (2020), 595–606.
  43. Fake Speech Detection Using OpenSMILE Features. In International Conference on Speech and Computer. Springer, 404–415.
  44. Soonil Kwon et al. 2021. MLT-DNet: Speech emotion recognition using 1D dilated CNN based on multi-learning trick approach. Expert Systems with Applications 167 (2021), 114177.
  45. Informal caregivers disclose increasingly more to a social robot over time. In Chi Conference on Human Factors in Computing Systems Extended Abstracts. 1–7.
  46. Multi-task semi-supervised adversarial autoencoding for speech emotion recognition. IEEE Transactions on Affective computing 13, 2 (2020), 992–1004.
  47. Emotion recognition using a hierarchical binary decision tree approach. Speech Communication 53, 9-10 (2011), 1162–1171.
  48. The unboxing experience: Exploration and design of initial interactions between children and social robots. In Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems. 1–14.
  49. Hitrans: A transformer-based context-and speaker-sensitive model for emotion detection in conversations. In Proceedings of the 28th International Conference on Computational Linguistics. 4190–4200.
  50. Graphcfc: A directed graph based cross-modal feature complementation approach for multimodal conversational emotion recognition. IEEE Transactions on Multimedia (2023).
  51. BiERU: Bidirectional emotional recurrent unit for conversational sentiment analysis. Neurocomputing 467 (2022), 73–82.
  52. DailyDialog: A Manually Labelled Multi-turn Dialogue Dataset. In Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers). 986–995.
  53. EmoCaps: Emotion Capsule based Model for Conversational Emotion Recognition. In Findings of the Association for Computational Linguistics: ACL 2022. 1610–1618.
  54. AMOA: Global acoustic feature enhanced modal-order-aware network for multimodal sentiment analysis. In Proceedings of the 29th International Conference on Computational Linguistics. 7136–7146.
  55. GCNet: graph completion network for incomplete multimodal learning in conversation. IEEE Transactions on Pattern Analysis and Machine Intelligence (2023).
  56. CTNet: Conversational transformer network for emotion recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 (2021), 985–1000.
  57. Yi-Lin Lin and Gang Wei. 2005. Speech emotion recognition based on HMM and SVM. In 2005 International Conference on Machine Learning and Cybernetics, Vol. 8. IEEE, 4898–4901.
  58. Modeling intra-and inter-modal relations: Hierarchical graph contrastive learning for multimodal sentiment analysis. In Proceedings of the 29th International Conference on Computational Linguistics. 7124–7135.
  59. Multi-modal speech emotion recognition using self-attention mechanism and multi-scale fusion framework. Speech Communication 139 (2022), 1–9.
  60. Efficient Low-rank Multimodal Fusion With Modality-Specific Factors. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2247–2256.
  61. Speech emotion recognition based on feature selection and extreme learning machine decision tree. Neurocomputing 273 (2018), 271–280.
  62. Reza Lotfian and Carlos Busso. 2019. Curriculum learning for speech emotion recognition from crowdsourced labels. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, 4 (2019), 815–826.
  63. Exploring multimodal data analysis for emotion recognition in teachers’ teaching behavior based on LSTM and MSCNN. Soft Computing (2023), 1–8.
  64. Emotion recognition using multimodal residual LSTM network. In Proceedings of the 27th ACM international conference on multimedia. 176–183.
  65. T-bertsum: Topic-aware text summarization based on bert. IEEE Transactions on Computational Social Systems 9, 3 (2021), 879–890.
  66. Divide, conquer and combine: Hierarchical feature fusion network with local and global perspectives for multimodal affective computing. In Proceedings of the 57th annual meeting of the association for computational linguistics. 481–492.
  67. Locally confined modality fusion network with a global perspective for multimodal human affective computing. IEEE Transactions on Multimedia 22, 1 (2019), 122–137.
  68. Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis. IEEE Transactions on Affective Computing (2022).
  69. Dialoguernn: An attentive rnn for emotion detection in conversations. In Proceedings of the AAAI conference on artificial intelligence, Vol. 33. 6818–6825.
  70. The semaine database: Annotated multimodal records of emotionally colored conversations between a person and a limited agent. IEEE transactions on affective computing 3, 1 (2011), 5–17.
  71. A Multi-Message Passing Framework Based on Heterogeneous Graphs in Conversational Emotion Recognition. Available at SSRN 4353605 (2021).
  72. Towards multimodal sentiment analysis: Harvesting opinions from the web. In Proceedings of the 13th international conference on multimodal interfaces. 169–176.
  73. Attention gated tensor neural network architectures for speech emotion recognition. Biomedical Signal Processing and Control 71 (2022), 103173.
  74. Multimodal analysis and prediction of persuasiveness in online social multimedia. ACM Transactions on Interactive Intelligent Systems (TiiS) 6, 3 (2016), 1–25.
  75. Utterance-level multimodal sentiment analysis. In Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 973–982.
  76. Found in translation: Learning robust joint representations by cyclic translations between modalities. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6892–6899.
  77. Deep convolutional neural network textual features and multiple kernel learning for utterance-level multimodal sentiment analysis. In Proceedings of the 2015 conference on empirical methods in natural language processing. 2539–2544.
  78. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (volume 1: Long papers). 873–883.
  79. Convolutional MKL based multimodal emotion recognition and sentiment analysis. In 2016 IEEE 16th international conference on data mining (ICDM). IEEE, 439–448.
  80. MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics.
  81. Integrating multimodal information in large pretrained transformers. In Proceedings of the conference. Association for Computational Linguistics. Meeting, Vol. 2020. NIH Public Access, 2359.
  82. A novel attention-based gated recurrent unit and its efficacy in speech emotion recognition. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6294–6298.
  83. Lr-gcn: Latent relation-aware graph convolutional network for conversational emotion recognition. IEEE Transactions on Multimedia 24 (2021), 4422–4432.
  84. MALN: Multimodal Adversarial Learning Network for Conversational Emotion Recognition. IEEE Transactions on Circuits and Systems for Video Technology (2023).
  85. Ensemble of svm trees for multimodal emotion recognition. In Proceedings of the 2012 Asia Pacific Signal and Information Processing Association Annual Summit and Conference. IEEE, 1–4.
  86. Emotion recognition for everyday life using physiological signals from wearables: A systematic literature review. IEEE Transactions on Affective Computing (2022).
  87. Acoustic transparency in hearables—Perceptual sound quality evaluations. Journal of the Audio Engineering Society 68, 7/8 (2020), 495–507.
  88. A combined rule-based & machine learning audio-visual emotion recognition approach. IEEE Transactions on Affective Computing 9, 1 (2016), 3–13.
  89. Directed Acyclic Graph Network for Conversational Emotion Recognition. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics. 1551–1560.
  90. Summarize before aggregate: a global-to-local heterogeneous graph inference network for conversational emotion recognition. In Proceedings of the 28th International Conference on Computational Linguistics. 4153–4163.
  91. Graph Information Bottleneck for Remote Sensing Segmentation. arXiv preprint arXiv:2312.02545 (2023).
  92. CZL-CIAE: CLIP-driven Zero-shot Learning for Correcting Inverse Age Estimation. arXiv preprint arXiv:2312.01758 (2023).
  93. Object Detection in Medical Images Based on Hierarchical Transformer and Mask Mechanism. Computational Intelligence and Neuroscience 2022 (2022).
  94. Conversational emotion recognition studies based on graph convolutional neural networks and a dependent syntactic analysis. Neurocomputing 501 (2022), 629–639.
  95. Visualization of audio files using librosa. In Proceedings of 2nd International Conference on Mathematical Modeling and Computational Science: ICMMCS 2021. Springer, 409–418.
  96. A discourse-aware graph neural network for emotion recognition in multi-party conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021. 2949–2958.
  97. Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34. 8992–8999.
  98. A discriminatively deep fusion approach with improved conditional GAN (im-cGAN) for facial expression recognition. Pattern Recognition 135 (2023), 109157.
  99. Speech emotion recognition enhanced traffic efficiency solution for autonomous vehicles in a 5G-enabled space–air–ground integrated intelligent transportation system. IEEE Transactions on Intelligent Transportation Systems 23, 3 (2021), 2830–2842.
  100. Fei Tao and Gang Liu. 2018. Advanced LSTM: A study about better time dependency modeling in emotion recognition. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2906–2910.
  101. Learning spatiotemporal features with 3d convolutional networks. In Proceedings of the IEEE international conference on computer vision. 4489–4497.
  102. Multimodal transformer for unaligned multimodal language sequences. In Proceedings of the conference. Association for Computational Linguistics. Meeting, Vol. 2019. NIH Public Access, 6558.
  103. Context-and sentiment-aware networks for emotion recognition in conversation. IEEE Transactions on Artificial Intelligence 3, 5 (2022), 699–708.
  104. Integrating Recurrence Dynamics for Speech Emotion Recognition. Proc. Interspeech 2018 (2018), 927–931.
  105. Select-additive learning: Improving generalization in multimodal sentiment analysis. In 2017 IEEE International Conference on Multimedia and Expo (ICME). IEEE, 949–954.
  106. M2R2: Missing-Modality Robust emotion Recognition framework with iterative data augmentation. IEEE Transactions on Artificial Intelligence (2022).
  107. CNN-generated images are surprisingly easy to spot… for now. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 8695–8704.
  108. Words can shift: Dynamically adjusting word representations using nonverbal behaviors. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 7216–7223.
  109. Contextualized emotion recognition in conversation as sequence tagging. In Proceedings of the 21th Annual Meeting of the Special Interest Group on Discourse and Dialogue. 186–195.
  110. Sentiment Word Aware Multimodal Refinement for Multimodal Sentiment Analysis with ASR Errors. In Findings of the Association for Computational Linguistics: ACL 2022. 1397–1406.
  111. Speech emotion classification using attention-based LSTM. IEEE/ACM Transactions on Audio, Speech, and Language Processing 27, 11 (2019), 1675–1685.
  112. Adapted dynamic memory network for emotion recognition in conversation. IEEE Transactions on Affective Computing 13, 3 (2020), 1426–1439.
  113. Contextual and Cross-Modal Interaction for Multi-Modal Speech Emotion Recognition. IEEE Signal Processing Letters 29 (2022), 2093–2097.
  114. DEAL: An Unsupervised Domain Adaptive Framework for Graph-Level Classification. In Proceedings of the 30th ACM International Conference on Multimedia (MM ’22). Association for Computing Machinery, New York, NY, USA, 3470–3479. https://doi.org/10.1145/3503161.3548012
  115. CoCo: A Coupled Contrastive Framework for Unsupervised Domain Adaptive Graph Classification. arXiv preprint arXiv:2306.04979 (2023).
  116. OMG: Towards Effective Graph Classification Against Label Noise. IEEE Transactions on Knowledge and Data Engineering 35, 12 (2023), 12873–12886. https://doi.org/10.1109/TKDE.2023.3271677
  117. Messages are Never Propagated Alone: Collaborative Hypergraph Neural Network for Time-Series Forecasting. IEEE Transactions on Pattern Analysis and Machine Intelligence 01 (nov 5555), 1–15. https://doi.org/10.1109/TPAMI.2023.3331389
  118. Prediction Model of Dow Jones Index Based on LSTM-Adaboost. In 2021 International Conference on Communications, Information System and Computer Engineering (CISCE). IEEE, 808–812.
  119. EmotionIC: Emotional Inertia and Contagion-driven Dependency Modelling for Emotion Recognition in Conversation. arXiv preprint arXiv:2303.11117 (2023).
  120. Noise Imitation Based Adversarial Training for Robust Multimodal Sentiment Analysis. IEEE Transactions on Multimedia (2023).
  121. Tensor Fusion Network for Multimodal Sentiment Analysis. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 1103–1114.
  122. Memory fusion network for multi-view sequential learning. In Proceedings of the AAAI conference on artificial intelligence, Vol. 32.
  123. Multi-attention recurrent network for human communication comprehension. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 32.
  124. Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. 2236–2246.
  125. Sayyed M Zahiri and Jinho D Choi. 2017. Emotion detection on tv show transcripts with sequence-based convolutional neural networks. arXiv preprint arXiv:1708.04299 (2017).
  126. Haidong Zhang and Yekun Chai. 2021. Coin: Conversational interactive networks for emotion recognition in conversation. In Proceedings of the Third Workshop on Multimodal Artificial Intelligence. 12–18.
  127. Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review. Information Fusion 59 (2020), 103–126.
  128. Real-time video emotion recognition based on reinforcement learning and domain knowledge. IEEE Transactions on Circuits and Systems for Video Technology 32, 3 (2021), 1034–1047.
  129. ICDN: integrating consistency and difference networks by transformer for multimodal sentiment analysis. Applied Intelligence 53, 12 (2023), 16332–16345.
  130. M3GAT: A Multi-Modal Multi-Task Interactive Graph Attention Network for Conversational Sentiment Analysis and Emotion Recognition. ACM Transactions on Information Systems (2023).
  131. Knowledge-Enriched Transformer for Emotion Detection in Textual Conversations. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing. 165–176.
  132. Topic-driven and knowledge-aware transformer for dialogue emotion detection. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 1571–1582.
  133. Multimodal sentiment analysis based on fusion methods: A survey. Information Fusion 95 (2023), 306–325.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yuntao Shou (28 papers)
  2. Tao Meng (48 papers)
  3. Wei Ai (48 papers)
  4. Nan Yin (33 papers)
  5. Keqin Li (61 papers)
Citations (25)