Tracing Intricate Cues in Dialogue: Joint Graph Structure and Sentiment Dynamics for Multimodal Emotion Recognition (2407.21536v1)
Abstract: Multimodal emotion recognition in conversation (MERC) has garnered substantial research attention recently. Existing MERC methods face several challenges: (1) they fail to fully harness direct inter-modal cues, possibly leading to less-than-thorough cross-modal modeling; (2) they concurrently extract information from the same and different modalities at each network layer, potentially triggering conflicts from the fusion of multi-source data; (3) they lack the agility required to detect dynamic sentimental changes, perhaps resulting in inaccurate classification of utterances with abrupt sentiment shifts. To address these issues, a novel approach named GraphSmile is proposed for tracking intricate emotional cues in multimodal dialogues. GraphSmile comprises two key components, i.e., GSF and SDP modules. GSF ingeniously leverages graph structures to alternately assimilate inter-modal and intra-modal emotional dependencies layer by layer, adequately capturing cross-modal cues while effectively circumventing fusion conflicts. SDP is an auxiliary task to explicitly delineate the sentiment dynamics between utterances, promoting the model's ability to distinguish sentimental discrepancies. Furthermore, GraphSmile is effortlessly applied to multimodal sentiment analysis in conversation (MSAC), forging a unified multimodal affective model capable of executing MERC and MSAC tasks. Empirical results on multiple benchmarks demonstrate that GraphSmile can handle complex emotional and sentimental patterns, significantly outperforming baseline models.
- W. X. Jiao, H. Q. Yang, I. King, and M. R. Lyu, “HiGRU: Hierarchical gated recurrent units for utterance-level emotion recognition,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), 2019, pp. 397–406.
- D. Ghosal, N. Majumder, A. Gelbukh, R. Mihalcea, and S. Poria, “COSMIC: Commonsense knowledge for emotion identification in conversations,” in Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, 2020, pp. 2470–2481.
- D. Ong, J. Su, B. Chen, A. T. Luu, A. Narendranath, Y. Li, S. Sun, Y. Lin, and H. Wang, “Is discourse role important for emotion recognition in conversation?” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 10, pp. 11 121–11 129, Jun. 2022.
- D. Hu, Y. Bao, L. Wei, W. Zhou, and S. Hu, “Supervised adversarial contrastive learning for emotion recognition in conversations,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 10 835–10 852.
- W. Z. Shen, S. Y. Wu, Y. Y. Yang, and X. J. Quan, “Directed acyclic graph network for conversational emotion recognition,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Online: Association for Computational Linguistics, Aug. 2021, pp. 1551–1560.
- W. Zhao, Y. Zhao, and X. Lu, “CauAIN: Causal aware interaction network for emotion recognition in conversations,” in Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI-22. International Joint Conferences on Artificial Intelligence Organization, 7 2022, pp. 4524–4530, main Track.
- D. Hazarika, R. Zimmermann, and S. Poria, “MISA: Modality-invariant and -specific representations for multimodal sentiment analysis,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 1122–1131.
- S. Mai, Y. Zeng, S. Zheng, and H. Hu, “Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis,” IEEE Transactions on Affective Computing, pp. 1–1, 2022.
- N. Majumder, S. Poria, D. Hazarika, R. Mihalcea, A. Gelbukh, and E. Cambria, “DialogueRNN: An attentive RNN for emotion detection in conversations,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 6818–6825.
- Z. Li, F. Tang, M. Zhao, and Y. Zhu, “EmoCaps: Emotion capsule based model for conversational emotion recognition,” in Findings of the Association for Computational Linguistics: ACL 2022, Dublin, Ireland, 2022, pp. 1610–1618.
- G. Hu, T.-E. Lin, Y. Zhao, G. Lu, Y. Wu, and Y. Li, “UniMSE: Towards unified multimodal sentiment analysis and emotion recognition,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 2022, pp. 7837–7851.
- B. Li, H. Fei, L. Liao, Y. Zhao, C. Teng, T.-S. Chua, D. Ji, and F. Li, “Revisiting disentanglement and fusion on modality and context in conversational multimodal emotion recognition,” in Proceedings of the 31st ACM International Conference on Multimedia, ser. MM ’23. New York, NY, USA: Association for Computing Machinery, 2023, pp. 5923–5934.
- Y. Mao, G. Liu, X. Wang, W. Gao, and X. Li, “DialogueTRM: Exploring multi-modal emotional dynamics in a conversation,” in Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 2021, pp. 2694–2704.
- Z. Lian, B. Liu, and J. Tao, “CTNet: Conversational transformer network for emotion recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 985–1000, 2021.
- X. Zhang and Y. Li, “A cross-modality context fusion and semantic refinement network for emotion recognition in conversation,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 13 099–13 110.
- W. Zheng, J. Yu, R. Xia, and S. Wang, “A facial expression-aware multimodal multi-task learning framework for emotion recognition in multi-party conversations,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 15 445–15 459.
- J. Hu, Y. Liu, J. Zhao, and Q. Jin, “MMGCN: Multimodal fusion via deep graph convolution network for emotion recognition in conversation,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing, 2021, Conference Proceedings, pp. 5666–5675.
- J. Li, X. Wang, G. Lv, and Z. Zeng, “GraphCFC: A directed graph based cross-modal feature complementation approach for multimodal conversational emotion recognition,” IEEE Transactions on Multimedia, vol. 26, pp. 77–89, 2024.
- D. Zhang, F. Chen, J. Chang, X. Chen, and Q. Tian, “Structure aware multi-graph network for multi-modal emotion recognition in conversations,” IEEE Transactions on Multimedia, vol. 26, pp. 3987–3997, 2024.
- F. Chen, J. Shao, S. Zhu, and H. T. Shen, “Multivariate, multi-frequency and multimodal: Rethinking graph neural networks for emotion recognition in conversation,” in 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 10 761–10 770.
- P. P. Liang, A. Zadeh, and L.-P. Morency, “Foundations & trends in multimodal machine learning: Principles, challenges, and open questions,” ACM Computing Surveys, apr 2024, just Accepted.
- F. Zhao, C. Zhang, and B. Geng, “Deep multimodal data fusion,” ACM Computing Surveys, vol. 56, no. 9, apr 2024.
- S. Li, H. Yan, and X. Qiu, “Contrast and generation make BART a good dialogue emotion recognizer,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 36, no. 10, pp. 11 002–11 010, Jun. 2022.
- X. Song, L. Huang, H. Xue, and S. Hu, “Supervised prototypical contrastive learning for emotion recognition in conversation,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing. Abu Dhabi, United Arab Emirates: Association for Computational Linguistics, Dec. 2022, pp. 5197–5206.
- G. Tu, B. Liang, R. Mao, M. Yang, and R. Xu, “Context or knowledge is not always necessary: A contrastive learning framework for emotion recognition in conversations,” in Findings of the Association for Computational Linguistics: ACL 2023. Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 14 054–14 067.
- T. Zhang, Z. Chen, M. Zhong, and T. Qian, “Mimicking the thinking process for emotion recognition in conversation with prompts and paraphrasing,” in Proceedings of the Thirty-Second International Joint Conference on Artificial Intelligence, ser. IJCAI ’23, 2023.
- F. Yu, J. Guo, Z. Wu, and X. Dai, “Emotion-anchored contrastive learning framework for emotion recognition in conversation,” in Findings of the Association for Computational Linguistics: NAACL 2024. Mexico City, Mexico: Association for Computational Linguistics, Jun. 2024, pp. 4521–4534.
- D. Ghosal, N. Majumder, S. Poria, N. Chhaya, and A. Gelbukh, “DialogueGCN: A graph convolutional neural network for emotion recognition in conversation,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 154–164.
- K. Wang, W. Z. Shen, Y. Y. Yang, X. J. Quan, and R. Wang, “Relational graph attention network for aspect-based sentiment analysis,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, 2020, pp. 3229–3238.
- B. Lee and Y. S. Choi, “Graph based network with contextualized representations of turns in dialogue,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 443–455.
- M. Ren, X. Huang, W. Li, D. Song, and W. Nie, “LR-GCN: Latent relation-aware graph convolutional network for conversational emotion recognition,” IEEE Transactions on Multimedia, vol. 24, pp. 4422–4432, 2022.
- D. Zhang, F. Chen, and X. Chen, “DualGATs: Dual graph attention networks for emotion recognition in conversations,” in Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics. Toronto, Canada: Association for Computational Linguistics, Jul. 2023, pp. 7395–7408.
- Z. Yang, X. Li, Y. Cheng, T. Zhang, and X. Wang, “Emotion recognition in conversation based on a dynamic complementary graph convolutional network,” IEEE Transactions on Affective Computing, pp. 1–14, 2024.
- Z. Lian, L. Chen, L. Sun, B. Liu, and J. Tao, “GCNet: Graph completion network for incomplete multimodal learning in conversation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 7, pp. 8419–8432, 2023.
- V. Chudasama, P. Kar, A. Gudmalwar, N. Shah, P. Wasnik, and N. Onoe, “M2FNet: Multi-modal fusion network for emotion recognition in conversation,” in Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2022, pp. 4651–4660.
- S. Zou, X. Huang, X. Shen, and H. Liu, “Improving multimodal fusion with main modal transformer for emotion recognition in conversation,” Knowledge-Based Systems, vol. 258, p. 109978, 2022.
- D. Hu, X. Hou, L. Wei, L. Jiang, and Y. Mo, “MM-DFN: Multimodal dynamic fusion network for emotion recognition in conversations,” in Proceedings of ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, 2022, pp. 7037–7041.
- G. Tu, T. Xie, B. Liang, H. Wang, and R. Xu, “Adaptive graph learning for multimodal conversational emotion detection,” Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 17, pp. 19 089–19 097, Mar. 2024.
- S. Mai, H. Hu, and S. Xing, “Divide, conquer and combine: Hierarchical feature fusion network with local and global perspectives for multimodal affective computing,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, Jul. 2019, pp. 481–492.
- W. Han, H. Chen, and S. Poria, “Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, Nov. 2021, pp. 9180–9192.
- L. Sun, Z. Lian, B. Liu, and J. Tao, “Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis,” IEEE Transactions on Affective Computing, vol. 15, no. 1, pp. 309–325, 2024.
- C. Fan, K. Zhu, J. Tao, G. Yi, J. Xue, and Z. Lv, “Multi-level contrastive learning: Hierarchical alleviation of heterogeneity in multimodal sentiment analysis,” IEEE Transactions on Affective Computing, pp. 1–17, 2024.
- S. Verma, J. Wang, Z. Ge, R. Shen, F. Jin, Y. Wang, F. Chen, and W. Liu, “Deep-HOSeq: Deep higher order sequence fusion for multimodal sentiment analysis,” in 2020 IEEE International Conference on Data Mining, 2020, pp. 561–570.
- Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, “Multimodal transformer for unaligned multimodal language sequences,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence, Italy: Association for Computational Linguistics, Jul. 2019, pp. 6558–6569.
- J. Yang, Y. Wang, R. Yi, Y. Zhu, A. Rehman, A. Zadeh, S. Poria, and L.-P. Morency, “MTAG: Modal-temporal attention graph for unaligned human multimodal language sequences,” in Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Online: Association for Computational Linguistics, Jun. 2021, pp. 1009–1021.
- Z. Lin, B. Liang, Y. Long, Y. Dang, M. Yang, M. Zhang, and R. Xu, “Modeling intra- and inter-modal relations: Hierarchical graph contrastive learning for multimodal sentiment analysis,” in Proceedings of the 29th International Conference on Computational Linguistics, vol. 29, no. 1. Association for Computational Linguistics, October 2022, pp. 7124–7135.
- C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “IEMOCAP: Interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol. 42, no. 4, pp. 335–359, 2008.
- S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, “MELD: A multimodal multi-party dataset for emotion recognition in conversations,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 2019, pp. 527–536.
- A. Bagher Zadeh, P. P. Liang, S. Poria, E. Cambria, and L.-P. Morency, “Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne, Australia: Association for Computational Linguistics, Jul. 2018, pp. 2236–2246.
- A. Joshi, A. Bhat, A. Jain, A. Singh, and A. Modi, “COGMEN: COntextualized GNN based multimodal emotion recognitioN,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. Seattle, United States: Association for Computational Linguistics, Jul. 2022, pp. 4148–4164.
- Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A robustly optimized BERT pretraining approach,” arXiv preprint arXiv:1907.11692, pp. 1–15, 2019.
- G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2261–2269.
- E. Barsoum, C. Zhang, C. C. Ferrer, and Z. Zhang, “Training deep networks for facial expression recognition with crowd-sourced label distribution,” in Proceedings of the 18th ACM International Conference on Multimodal Interaction, New York, NY, USA, 2016, pp. 279–283.
- T. Baltrusaitis, A. Zadeh, Y. C. Lim, and L.-P. Morency, “OpenFace 2.0: Facial behavior analysis toolkit,” in 2018 13th IEEE International Conference on Automatic Face & Gesture Recognition, 2018, pp. 59–66.
- B. Schuller, A. Batliner, S. Steidl, and D. Seppi, “Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge,” Speech Communication, vol. 53, no. 9, pp. 1062–1087, 2011, sensing Emotion and Affect - Facing Realism in Speech Processing.
- F. Eyben, M. Wöllmer, and B. Schuller, “openSMILE: the munich versatile and fast open-source audio feature extractor,” in Proceedings of the 18th ACM International Conference on Multimedia, ser. MM ’10. New York, NY, USA: Association for Computing Machinery, 2010, pp. 1459–1462.
- B. McFee, C. Raffel, D. Liang, D. P. Ellis, M. McVicar, E. Battenberg, and O. Nieto, “librosa: Audio and music signal analysis in python,” in Proceedings of the 14th Python in Science Conference, 2015, pp. 18 – 24.
- D. Ghosal, N. Majumder, S. Poria, N. Chhaya, and A. Gelbukh, “DialogueGCN: A graph convolutional neural network for emotion recognition in conversation,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing. Hong Kong, China: Association for Computational Linguistics, Nov. 2019, pp. 154–164.
- F. Wu, A. Souza, T. Zhang, C. Fifty, T. Yu, and K. Weinberger, “Simplifying graph convolutional networks,” in Proceedings of the 36th International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 97. PMLR, 09–15 Jun 2019, pp. 6861–6871.
- G. Li, M. Müller, G. Qian, I. C. Delgadillo, A. Abualshour, A. Thabet, and B. Ghanem, “DeepGCNs: Making gcns go as deep as cnns,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 6, pp. 6923–6939, jun 2023.
- D. Li, Y. Wang, K. Funakoshi, and M. Okumura, “Joyful: Joint modality fusion and graph contrastive learning for multimoda emotion recognition,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. Singapore: Association for Computational Linguistics, Dec. 2023, pp. 16 051–16 069.
- W. Ai, Y. Shou, T. Meng, and K. Li, “DER-GCN: Dialog and event relation-aware graph convolutional neural network for multimodal dialog emotion recognition,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–14, 2024.
- F. Chen, Z. Sun, D. Ouyang, X. Liu, and J. Shao, “Learning what and when to drop: Adaptive multimodal and contextual dynamics for emotion recognition in conversation,” in Proceedings of the 29th ACM International Conference on Multimedia, New York, NY, USA, 2021, pp. 1064–1073.
- J. Wen, G. Tu, R. Li, D. Jiang, and W. Zhu, “Learning more from mixed emotions: A label refinement method for emotion recognition in conversations,” Transactions of the Association for Computational Linguistics, vol. 11, pp. 1485–1499, 12 2023.
- D. Hu, L. Wei, and X. Huai, “DialogueCRN: Contextual reasoning networks for emotion recognition in conversations,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing. Online: Association for Computational Linguistics, Aug. 2021, pp. 7042–7052.
- L. van der Maaten and G. Hinton, “Visualizing data using t-SNE,” Journal of Machine Learning Research, vol. 9, no. 86, pp. 2579–2605, 2008.
- Jiang Li (48 papers)
- Xiaoping Wang (56 papers)
- Zhigang Zeng (28 papers)