Masked Graph Learning with Recurrent Alignment for Multimodal Emotion Recognition in Conversation (2407.16714v1)
Abstract: Since Multimodal Emotion Recognition in Conversation (MERC) can be applied to public opinion monitoring, intelligent dialogue robots, and other fields, it has received extensive research attention in recent years. Unlike traditional unimodal emotion recognition, MERC can fuse complementary semantic information between multiple modalities (e.g., text, audio, and vision) to improve emotion recognition. However, previous work ignored the inter-modal alignment process and the intra-modal noise information before multimodal fusion but directly fuses multimodal features, which will hinder the model for representation learning. In this study, we have developed a novel approach called Masked Graph Learning with Recursive Alignment (MGLRA) to tackle this problem, which uses a recurrent iterative module with memory to align multimodal features, and then uses the masked GCN for multimodal feature fusion. First, we employ LSTM to capture contextual information and use a graph attention-filtering mechanism to eliminate noise effectively within the modality. Second, we build a recurrent iteration module with a memory function, which can use communication between different modalities to eliminate the gap between modalities and achieve the preliminary alignment of features between modalities. Then, a cross-modal multi-head attention mechanism is introduced to achieve feature alignment between modalities and construct a masked GCN for multimodal feature fusion, which can perform random mask reconstruction on the nodes in the graph to obtain better node feature representation. Finally, we utilize a multilayer perceptron (MLP) for emotion recognition. Extensive experiments on two benchmark datasets (i.e., IEMOCAP and MELD) demonstrate that {MGLRA} outperforms state-of-the-art methods.
- Z. Li, Z. Li, J. Zhang, Y. Feng, and J. Zhou, “Bridging text and video: A universal multimodal transformer for audio-visual scene-aware dialog,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2476–2483, 2021.
- Y. Shou, T. Meng, W. Ai, S. Yang, and K. Li, “Conversational emotion recognition studies based on graph convolutional neural networks and a dependent syntactic analysis,” Neurocomputing, vol. 501, pp. 629–639, 2022.
- Y. Shou, T. Meng, W. Ai, C. Xie, H. Liu, and Y. Wang, “Object detection in medical images based on hierarchical transformer and mask mechanism,” Computational Intelligence and Neuroscience, vol. 2022, 2022.
- R. Ying, Y. Shou, and C. Liu, “Prediction model of dow jones index based on lstm-adaboost,” in 2021 International Conference on Communications, Information System and Computer Engineering (CISCE). IEEE, 2021, pp. 808–812.
- T. Meng, Y. Shou, W. Ai, J. Du, H. Liu, and K. Li, “A multi-message passing framework based on heterogeneous graphs in conversational emotion recognition,” Neurocomputing, vol. 569, p. 127109, 2024.
- Y. Shou, X. Cao, D. Meng, B. Dong, and Q. Zheng, “A low-rank matching attention based cross-modal feature fusion method for conversational emotion recognition,” arXiv preprint arXiv:2306.17799, 2023.
- Y. Shou, W. Ai, and T. Meng, “Graph information bottleneck for remote sensing segmentation,” arXiv preprint arXiv:2312.02545, 2023.
- T. Zhao, D. Lala, and T. Kawahara, “Designing precise and robust dialogue response evaluators,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 26–33.
- A. Graves, “Generating sequences with recurrent neural networks,” arXiv preprint arXiv:1308.0850, 2013.
- D. Ghosal, N. Majumder, S. Poria, N. Chhaya, and A. Gelbukh, “Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 154–164.
- J. Hu, Y. Liu, J. Zhao, and Q. Jin, “Mmgcn: Multimodal fusion via deep graph convolution network for emotion recognition in conversation,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 5666–5675.
- Y. Shou, W. Ai, T. Meng, and K. Li, “Czl-ciae: Clip-driven zero-shot learning for correcting inverse age estimation,” arXiv preprint arXiv:2312.01758, 2023.
- Y. Shou, T. Meng, W. Ai, N. Yin, and K. Li, “A comprehensive survey on multi-modal conversational emotion recognition with deep learning,” arXiv preprint arXiv:2312.05735, 2023.
- W. Ai, Y. Shou, T. Meng, and K. Li, “Der-gcn: Dialogue and event relation-aware graph convolutional neural network for multimodal dialogue emotion recognition,” arXiv preprint arXiv:2312.10579, 2023.
- T. Meng, Y. Shou, W. Ai, N. Yin, and K. Li, “Deep imbalanced learning for multimodal emotion recognition in conversations,” arXiv preprint arXiv:2312.06337, 2023.
- Y. Shou, T. Meng, W. Ai, and K. Li, “Adversarial representation with intra-modal and inter-modal graph contrastive learning for multimodal emotion recognition,” arXiv preprint arXiv:2312.16778, 2023.
- Y. Shou, T. Meng, W. Ai, F. Zhang, N. Yin, and K. Li, “Adversarial alignment and graph fusion via information bottleneck for multimodal emotion recognition in conversations,” Information Fusion, p. 102590, 2024.
- Y. Shou, X. Cao, and D. Meng, “Masked contrastive graph representation learning for age estimation,” arXiv preprint arXiv:2306.17798, 2023.
- Y. Shou, T. Meng, F. Zhang, N. Yin, and K. Li, “Revisiting multi-modal emotion learning with broad state space models and probability-guidance fusion,” arXiv preprint arXiv:2404.17858, 2024.
- Y. Shou, W. Ai, J. Du, T. Meng, and H. Liu, “Efficient long-distance latent relation-aware graph neural network for multi-modal emotion recognition in conversations,” arXiv preprint arXiv:2407.00119, 2024.
- W. Ai, F. Zhang, T. Meng, Y. Shou, H. Shao, and K. Li, “A two-stage multimodal emotion recognition model based on graph contrastive learning,” in 2023 IEEE 29th International Conference on Parallel and Distributed Systems (ICPADS). IEEE, 2023, pp. 397–404.
- T. Meng, F. Zhang, Y. Shou, W. Ai, N. Yin, and K. Li, “Revisiting multimodal emotion recognition in conversation from the perspective of graph spectrum,” arXiv preprint arXiv:2404.17862, 2024.
- J. Liu, Z. Liu, L. Wang, L. Guo, and J. Dang, “Speech emotion recognition with local-global aware deep representation learning,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7174–7178.
- Z. Lian, B. Liu, and J. Tao, “Ctnet: Conversational transformer network for emotion recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 985–1000, 2021.
- L. Yuan, G. Huang, F. Li, X. Yuan, C.-M. Pun, and G. Zhong, “Rba-gcn: Relational bilevel aggregation graph convolutional network for emotion recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 2325–2337, 2023.
- C. Zhang, S. Wang, J. Liu, S. Zhou, P. Zhang, X. Liu, E. Zhu, and C. Zhang, “Multi-view clustering via deep matrix factorization and partition alignment,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 4156–4164.
- X. Wang, L. Zhu, and Y. Yang, “T2vlad: global-local sequence alignment for text-video retrieval,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 5079–5088.
- M. Chen, S. Wang, P. P. Liang, T. Baltrušaitis, A. Zadeh, and L.-P. Morency, “Multimodal sentiment analysis with word-level fusion and reinforcement learning,” in Proceedings of the 19th ACM International Conference on Multimodal Interaction, 2017, pp. 163–171.
- H. Xu, H. Zhang, K. Han, Y. Wang, Y. Peng, and X. Li, “Learning alignment for multimodal emotion recognition from speech,” Proc. Interspeech 2019, pp. 3569–3573, 2019.
- X. Xue, C. Zhang, Z. Niu, and X. Wu, “Multi-level attention map network for multimodal sentiment analysis,” IEEE Transactions on Knowledge and Data Engineering, 2022.
- J. M. Richards, “The cognitive consequences of concealing feelings,” Current Directions in Psychological Science, pp. 131–134, 2004.
- N. Majumder, S. Poria, D. Hazarika, R. Mihalcea, A. Gelbukh, and E. Cambria, “Dialoguernn: An attentive rnn for emotion detection in conversations,” in Proceedings of the AAAI conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 6818–6825.
- D. Nguyen, D. T. Nguyen, R. Zeng, T. T. Nguyen, S. N. Tran, T. Nguyen, S. Sridharan, and C. Fookes, “Deep auto-encoders with sequential learning for multimodal dimensional emotion recognition,” IEEE Transactions on Multimedia, vol. 24, pp. 1313–1324, 2021.
- W. Nie, M. Ren, J. Nie, and S. Zhao, “C-gcn: Correlation based graph convolutional network for audio-video emotion recognition,” IEEE Transactions on Multimedia, vol. 23, pp. 3793–3804, 2020.
- C. Li, J. Wang, H. Wang, M. Zhao, W. Li, and X. Deng, “Visual-texual emotion analysis with deep coupled video and danmu neural networks,” IEEE Transactions on Multimedia, vol. 22, no. 6, pp. 1634–1646, 2019.
- X. Yang, S. Feng, D. Wang, and Y. Zhang, “Image-text multimodal emotion classification via multi-view attentional network,” IEEE Transactions on Multimedia, vol. 23, pp. 4014–4026, 2020.
- M. Ren, X. Huang, W. Li, D. Song, and W. Nie, “Lr-gcn: Latent relation-aware graph convolutional network for conversational emotion recognition,” IEEE Transactions on Multimedia, vol. 24, pp. 4422–4432, 2021.
- P. Liu, K. Li, and H. Meng, “Group gated fusion on attention-based bidirectional alignment for multimodal emotion recognition,” Proc. Interspeech 2020, pp. 379–383, 2020.
- H. Li, W. Ding, Z. Wu, and Z. Liu, “Learning fine-grained cross modality excitement for speech emotion recognition,” Proc. Interspeech 2021, pp. 3375–3379, 2020.
- B. Chen, Q. Cao, M. Hou, Z. Zhang, G. Lu, and D. Zhang, “Multimodal emotion recognition with temporal and semantic consistency,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3592–3603, 2021.
- Y. Gu, K. Yang, S. Fu, S. Chen, X. Li, and I. Marsic, “Multimodal affective analysis using hierarchical attention strategy with word-level alignment,” in Proceedings of the conference. Association for Computational Linguistics. Meeting, vol. 2018, 2018, pp. 2225–2235.
- M. Rohanian, J. Hough, M. Purver et al., “Detecting depression with word-level multimodal fusion.” in Interspeech, 2019, pp. 1443–1447.
- P. W. Battaglia, J. B. Hamrick, V. Bapst, A. Sanchez-Gonzalez, V. Zambaldi, M. Malinowski, A. Tacchetti, D. Raposo, A. Santoro, R. Faulkner et al., “Relational inductive biases, deep learning, and graph networks,” arXiv preprint arXiv:1806.01261, 2018.
- J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral networks and deep locally connected networks on graphs,” in 2nd International Conference on Learning Representations, ICLR 2014, 2014.
- T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
- P. Veličković, G. Cucurull, A. Casanova, A. Romero, P. Liò, and Y. Bengio, “Graph attention networks,” in International Conference on Learning Representations, 2018.
- D. Hazarika, S. Poria, A. Zadeh, E. Cambria, L.-P. Morency, and R. Zimmermann, “Conversational memory network for emotion recognition in dyadic dialogue videos,” in Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting, vol. 2018. NIH Public Access, 2018, p. 2122.
- C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol. 42, pp. 335–359, 2008.
- S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, “Meld: A multimodal multi-party dataset for emotion recognition in conversations,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. ACL, 2019, pp. 527–536.
- J. Li, X. Wang, G. Lv, and Z. Zeng, “Graphcfc: A directed graph based cross-modal feature complementation approach for multimodal conversational emotion recognition,” IEEE Transactions on Multimedia, 2023.
- Y. Kim, “Convolutional neural networks for sentence classification,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP). ACL, 2014, pp. 1746–1751.
- A. Zadeh, P. P. Liang, N. Mazumder, S. Poria, E. Cambria, and L.-P. Morency, “Memory fusion network for multi-view sequential learning,” in Proceedings of the AAAI conference on Artificial Intelligence, vol. 32, no. 1, 2018.
- S. Poria, E. Cambria, D. Hazarika, N. Majumder, A. Zadeh, and L.-P. Morency, “Context-dependent sentiment analysis in user-generated videos,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (volume 1: Long papers), 2017, pp. 873–883.
- D. Hazarika, S. Poria, R. Mihalcea, E. Cambria, and R. Zimmermann, “Icon: Interactive conversational memory network for multimodal emotion detection,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, 2018, pp. 2594–2604.
- S. Xing, S. Mai, and H. Hu, “Adapted dynamic memory network for emotion recognition in conversation,” IEEE Transactions on Affective Computing, vol. 13, no. 3, pp. 1426–1439, 2020.
- T. Ishiwatari, Y. Yasuda, T. Miyazaki, and J. Goto, “Relation-aware graph attention networks with relational position encodings for emotion recognition in conversations,” in Proceedings of the 2020 conference on empirical methods in natural language processing (EMNLP), 2020, pp. 7360–7370.
- W. Ai, Y. Shou, T. Meng, and K. Li, “Der-gcn: Dialog and event relation-aware graph convolutional neural network for multimodal dialog emotion recognition,” IEEE Transactions on Neural Networks and Learning Systems, 2024.
- Tao Meng (48 papers)
- FuChen Zhang (5 papers)
- Yuntao Shou (28 papers)
- HongEn Shao (4 papers)
- Wei Ai (48 papers)
- Keqin Li (61 papers)