CFN-ESA: A Cross-Modal Fusion Network with Emotion-Shift Awareness for Dialogue Emotion Recognition (2307.15432v2)
Abstract: Multimodal emotion recognition in conversation (ERC) has garnered growing attention from research communities in various fields. In this paper, we propose a Cross-modal Fusion Network with Emotion-Shift Awareness (CFN-ESA) for ERC. Extant approaches employ each modality equally without distinguishing the amount of emotional information in these modalities, rendering it hard to adequately extract complementary information from multimodal data. To cope with this problem, in CFN-ESA, we treat textual modality as the primary source of emotional information, while visual and acoustic modalities are taken as the secondary sources. Besides, most multimodal ERC models ignore emotion-shift information and overfocus on contextual information, leading to the failure of emotion recognition under emotion-shift scenario. We elaborate an emotion-shift module to address this challenge. CFN-ESA mainly consists of unimodal encoder (RUME), cross-modal encoder (ACME), and emotion-shift module (LESM). RUME is applied to extract conversation-level contextual emotional cues while pulling together data distributions between modalities; ACME is utilized to perform multimodal interaction centered on textual modality; LESM is used to model emotion shift and capture emotion-shift information, thereby guiding the learning of the main task. Experimental results demonstrate that CFN-ESA can effectively promote performance for ERC and remarkably outperform state-of-the-art models.
- M. R. L. Wenxiang Jiao and I. King, “Real-time emotion recognition via attention gated hierarchical memory network,” in The Thirty-Fourth AAAI Conference on Artificial Intelligence, 2020, pp. 8002–8009.
- W. Shen, J. Chen, X. Quan, and Z. Xie, “DialogXL: All-in-one xlnet for multi-party conversation emotion recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2021, pp. 13 789–13 797.
- W. Nie, R. Chang, M. Ren, Y. Su, and A. Liu, “I-GCN: Incremental graph convolution network for conversation emotion detection,” IEEE Transactions on Multimedia, vol. 24, pp. 4471–4481, 2022.
- W. Zhao, Y. Zhao, and X. Lu, “CauAIN: Causal aware interaction network for emotion recognition in conversations,” in Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, L. D. Raedt, Ed., 2022, pp. 4524–4530, main Track.
- S. Li, H. Yan, and X. Qiu, “Contrast and generation make bart a good dialogue emotion recognizer,” in Proceedings of the AAAI Conference on Artificial Intelligence, 2022, pp. 11 002–11 010.
- W. Fan, X. Xu, B. Cai, and X. Xing, “ISNet: Individual standardization network for speech emotion recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1803–1814, 2022.
- S. Latif, R. Rana, S. Khalifa, R. Jurdak, and B. W. Schuller, “Multitask learning from augmented auxiliary data for improving speech emotion recognition,” IEEE Transactions on Affective Computing, pp. 1–13, 2022.
- J. Lei, X. Zhu, and Y. Wang, “BAT: Block and token self-attention for speech emotion recognition,” Neural Networks, vol. 156, pp. 67–80, 2022.
- Y. Zhou, X. Liang, Y. Gu, Y. Yin, and L. Yao, “Multi-classifier interactive learning for ambiguous speech emotion recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 695–705, 2022.
- L. He, Z. Wang, L. Wang, and F. Li, “Multimodal mutual attention-based sentiment analysis framework adapted to complicated contexts,” IEEE Transactions on Circuits and Systems for Video Technology, pp. 1–1, 2023.
- S. Mai, Y. Zeng, S. Zheng, and H. Hu, “Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis,” IEEE Transactions on Affective Computing, pp. 1–1, 2022.
- W. Yu, H. Xu, Z. Yuan, and J. Wu, “Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 12, 2021, pp. 10 790–10 797.
- S. Poria, E. Cambria, D. Hazarika, N. Majumder, A. Zadeh, and L.-P. Morency, “Context-dependent sentiment analysis in user-generated videos,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2017, pp. 873–883.
- D. Hazarika, S. Poria, A. Zadeh, E. Cambria, L.-P. Morency, and R. Zimmermann, “Conversational memory network for emotion recognition in dyadic dialogue videos,” in Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). NIH Public Access, 2018, pp. 2122–2132.
- D. Hazarika, S. Poria, R. Mihalcea, E. Cambria, and R. Zimmermann, “ICON: Interactive conversational memory network for multimodal emotion detection,” in Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Brussels, Belgium, 2018, pp. 2594–2604.
- N. Majumder, S. Poria, D. Hazarika, R. Mihalcea, A. Gelbukh, and E. Cambria, “DialogueRNN: An attentive RNN for emotion detection in conversations,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 6818–6825.
- D. Hazarika, R. Zimmermann, and S. Poria, “MISA: Modality-invariant and -specific representations for multimodal sentiment analysis,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 1122–1131.
- J. Hu, Y. Liu, J. Zhao, and Q. Jin, “MMGCN: Multimodal fusion via deep graph convolution network for emotion recognition in conversation,” in Proceedings of the Annual Meeting of the Association for Computational Linguistics and the International Joint Conference on Natural Language Processing, 2021, Conference Proceedings, pp. 5666–5675.
- F. Chen, Z. Sun, D. Ouyang, X. Liu, and J. Shao, “Learning what and when to drop: Adaptive multimodal and contextual dynamics for emotion recognition in conversation,” in Proceedings of the 29th ACM International Conference on Multimedia, New York, NY, USA, 2021, pp. 1064–1073.
- D. Hu, X. Hou, L. Wei, L. Jiang, and Y. Mo, “MM-DFN: Multimodal dynamic fusion network for emotion recognition in conversations,” in Proceedings of ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, 2022, pp. 7037–7041.
- F. Chen, J. Shao, A. Zhu, D. Ouyang, X. Liu, and H. T. Shen, “Modeling hierarchical uncertainty for multimodal emotion recognition in conversation,” IEEE Transactions on Cybernetics, pp. 1–12, 2022.
- Y. Mao, G. Liu, X. Wang, W. Gao, and X. Li, “DialogueTRM: Exploring multi-modal emotional dynamics in a conversation,” in Findings of the Association for Computational Linguistics: EMNLP 2021, Punta Cana, Dominican Republic, 2021, pp. 2694–2704.
- D. Ghosal, N. Majumder, A. Gelbukh, R. Mihalcea, and S. Poria, “COSMIC: Commonsense knowledge for emotion identification in conversations,” in Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, 2020, pp. 2470–2481.
- W. Shen, S. Wu, Y. Yang, and X. Quan, “Directed acyclic graph network for conversational emotion recognition,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing, 2021, pp. 1551–1560.
- Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. Salakhutdinov, and Q. V. Le, “XLNet: Generalized autoregressive pretraining for language understanding,” in Proceedings of the 33rd International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2019, pp. 1–11.
- M. Lewis, Y. Liu, N. Goyal, M. Ghazvininejad, A. Mohamed, O. Levy, V. Stoyanov, and L. Zettlemoyer, “BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension,” in Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Online, 2020, pp. 7871–7880.
- Z. Huang, “An investigation of emotion changes from speech,” in Proceedings of 2015 International Conference on Affective Computing and Intelligent Interaction (ACII), 2015, pp. 733–736.
- Z. Huang and J. Epps, “Detecting the instant of emotion change from speech using a martingale framework,” in Proceedings of 2016 IEEE International Conference on Acoustics, Speech and Signal Processing, 2016, pp. 5195–5199.
- L. Xu, Z. Wang, B. Wu, and S. Lui, “MDAN: Multi-level dependent attention network for visual emotion analysis,” in Proceedings of 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 9469–9478.
- X. Zhu, L. Li, W. Zhang, T. Rao, M. Xu, Q. Huang, and D. Xu, “Dependency Exploitation: A unified cnn-rnn approach for visual emotion recognition,” in Proceedings of the Twenty-Sixth International Joint Conference on Artificial Intelligence, 2017, pp. 3595–3601.
- D. She, J. Yang, M.-M. Cheng, Y.-K. Lai, P. L. Rosin, and L. Wang, “WSCNet: Weakly supervised coupled networks for visual sentiment classification and detection,” IEEE Transactions on Multimedia, vol. 22, no. 5, pp. 1358–1371, 2020.
- A. Joshi, A. Bhat, A. Jain, A. Singh, and A. Modi, “COGMEN: COntextualized GNN based multimodal emotion recognitioN,” in Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Seattle, United States, 2022, pp. 4148–4164.
- G. Hu, T.-E. Lin, Y. Zhao, G. Lu, Y. Wu, and Y. Li, “UniMSE: Towards unified multimodal sentiment analysis and emotion recognition,” in Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, Abu Dhabi, United Arab Emirates, 2022, pp. 7837–7851.
- C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, 2020.
- K. Bansal, H. Agarwal, A. Joshi, and A. Modi, “Shapes of emotions: Multimodal emotion recognition in conversations via emotion shifts,” in Proceedings of the First Workshop on Performance and Interpretability Evaluations of Multimodal, Multipurpose, Massive-Scale Models, Virtual, 2022, pp. 44–56.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings of the 31st International Conference on Neural Information Processing Systems, Red Hook, NY, USA, 2017, pp. 6000–6010.
- Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, “Multimodal transformer for unaligned multimodal language sequences,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, A. Korhonen, D. Traum, and L. Màrquez, Eds., Florence, Italy, 2019, pp. 6558–6569.
- L. Goncalves and C. Busso, “Auxformer: Robust approach to audiovisual emotion recognition,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7357–7361.
- J. Wagner, A. Triantafyllopoulos, H. Wierstorf, M. Schmitt, F. Burkhardt, F. Eyben, and B. W. Schuller, “Dawn of the transformer era in speech emotion recognition: Closing the valence gap,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 9, pp. 10 745–10 759, 2023.
- A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby, “An image is worth 16x16 words: Transformers for image recognition at scale,” in Proceedings of the Ninth International Conference on Learning Representations, 2021.
- J. Li, D. Li, S. Savarese, and S. Hoi, “BLIP-2: bootstrapping language-image pre-training with frozen image encoders and large language models,” in Proceedings of the 40th International Conference on Machine Learning, ser. ICML’23. JMLR.org, 2023.
- H. Liu, C. Li, Q. Wu, and Y. J. Lee, “Visual instruction tuning,” in Proceedings of the Thirty-seventh Conference on Neural Information Processing Systems, 2023.
- T. Gao, X. Yao, and D. Chen, “SimCSE: Simple contrastive learning of sentence embeddings,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Online and Punta Cana, Dominican Republic: Association for Computational Linguistics, 2021, pp. 6894–6910.
- R. Cipolla, Y. Gal, and A. Kendall, “Multi-task learning using uncertainty to weigh losses for scene geometry and semantics,” in Proceedings of 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition. IEEE, 2018, pp. 7482–7491.
- S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, “MELD: A multimodal multi-party dataset for emotion recognition in conversations,” in Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Florence, Italy, 2019, pp. 527–536.
- C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “IEMOCAP: interactive emotional dyadic motion capture database,” Language Resources and Evaluation, vol. 42, no. 4, pp. 335–359, 2008.
- G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 2261–2269.
- E. Barsoum, C. Zhang, C. C. Ferrer, and Z. Zhang, “Training deep networks for facial expression recognition with crowd-sourced label distribution,” in Proceedings of the 18th ACM International Conference on Multimodal Interaction, New York, NY, USA, 2016, pp. 279–283.
- B. Schuller, A. Batliner, S. Steidl, and D. Seppi, “Recognising realistic emotions and affect in speech: State of the art and lessons learnt from the first challenge,” Speech Commun., vol. 53, no. 9–10, pp. 1062–1087, 2011.
- Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “RoBERTa: A robustly optimized BERT pretraining approach,” arXiv preprint arXiv:1907.11692, pp. 1–15, 2019.
- I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in Proceedings of the Seventh International Conference on Learning Representations, 2019, pp. 1–8.
- Jiang Li (48 papers)
- Xiaoping Wang (56 papers)
- Yingjian Liu (10 papers)
- Zhigang Zeng (28 papers)