Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive Learning for Multimodal Emotion Recognition (2312.16778v2)

Published 28 Dec 2023 in cs.CL

Abstract: With the release of increasing open-source emotion recognition datasets on social media platforms and the rapid development of computing resources, multimodal emotion recognition tasks (MER) have begun to receive widespread research attention. The MER task extracts and fuses complementary semantic information from different modalities, which can classify the speaker's emotions. However, the existing feature fusion methods have usually mapped the features of different modalities into the same feature space for information fusion, which can not eliminate the heterogeneity between different modalities. Therefore, it is challenging to make the subsequent emotion class boundary learning. To tackle the above problems, we have proposed a novel Adversarial Representation with Intra-Modal and Inter-Modal Graph Contrastive for Multimodal Emotion Recognition (AR-IIGCN) method. Firstly, we input video, audio, and text features into a multi-layer perceptron (MLP) to map them into separate feature spaces. Secondly, we build a generator and a discriminator for the three modal features through adversarial representation, which can achieve information interaction between modalities and eliminate heterogeneity among modalities. Thirdly, we introduce contrastive graph representation learning to capture intra-modal and inter-modal complementary semantic information and learn intra-class and inter-class boundary information of emotion categories. Specifically, we construct a graph structure for three modal features and perform contrastive representation learning on nodes with different emotions in the same modality and the same emotion in different modalities, which can improve the feature representation ability of nodes. Extensive experimental works show that the ARL-IIGCN method can significantly improve emotion recognition accuracy on IEMOCAP and MELD datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (38)
  1. F. Huang, X. Li, C. Yuan, S. Zhang, J. Zhang, and S. Qiao, “Attention-emotion-enhanced convolutional lstm for sentiment analysis,” IEEE Transactions on Neural Networks and Learning Systems, vol. 33, no. 9, pp. 4332–4345, 2022.
  2. W. Ai, Y. Shou, T. Meng, and K. Li, “Der-gcn: Dialogue and event relation-aware graph convolutional neural network for multimodal dialogue emotion recognition,” arXiv preprint arXiv:2312.10579, 2023.
  3. T. Meng, Y. Shou, W. Ai, N. Yin, and K. Li, “Deep imbalanced learning for multimodal emotion recognition in conversations,” arXiv preprint arXiv:2312.06337, 2023.
  4. Y. Shou, T. Meng, W. Ai, N. Yin, and K. Li, “A comprehensive survey on multi-modal conversational emotion recognition with deep learning,” arXiv preprint arXiv:2312.05735, 2023.
  5. S. K. Khare and V. Bajaj, “Time–frequency representation and convolutional neural network-based emotion recognition,” IEEE transactions on neural networks and learning systems, vol. 32, no. 7, pp. 2901–2909, 2020.
  6. W. Zhao, Y. Zhao, and X. Lu, “Cauain: Causal aware interaction network for emotion recognition in conversations,” in Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence, IJCAI.   Morgan Kaufmann, 2022, pp. 4524–4530.
  7. A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.-P. Morency, “Tensor fusion network for multimodal sentiment analysis,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 1103–1114.
  8. Z. Liu, Y. Shen, V. B. Lakshminarasimhan, P. P. Liang, A. B. Zadeh, and L.-P. Morency, “Efficient low-rank multimodal fusion with modality-specific factors,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2247–2256.
  9. Y. Shou, W. Ai, T. Meng, and K. Li, “Czl-ciae: Clip-driven zero-shot learning for correcting inverse age estimation,” arXiv preprint arXiv:2312.01758, 2023.
  10. Y. Shou, W. Ai, and T. Meng, “Graph information bottleneck for remote sensing segmentation,” arXiv preprint arXiv:2312.02545, 2023.
  11. T. Meng, Y. Shou, W. Ai, J. Du, H. Liu, and K. Li, “A multi-message passing framework based on heterogeneous graphs in conversational emotion recognition,” Neurocomputing, p. 127109, 2023.
  12. R. Ying, Y. Shou, and C. Liu, “Prediction model of dow jones index based on lstm-adaboost,” in 2021 International Conference on Communications, Information System and Computer Engineering (CISCE).   IEEE, 2021, pp. 808–812.
  13. Y. Shou, T. Meng, W. Ai, C. Xie, H. Liu, and Y. Wang, “Object detection in medical images based on hierarchical transformer and mask mechanism,” Computational Intelligence and Neuroscience, vol. 2022, 2022.
  14. Y. Shou, T. Meng, W. Ai, S. Yang, and K. Li, “Conversational emotion recognition studies based on graph convolutional neural networks and a dependent syntactic analysis,” Neurocomputing, vol. 501, pp. 629–639, 2022.
  15. J. Hu, Y. Liu, J. Zhao, and Q. Jin, “Mmgcn: Multimodal fusion via deep graph convolution network for emotion recognition in conversation,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 5666–5675.
  16. S. Liu, P. Gao, Y. Li, W. Fu, and W. Ding, “Multi-modal fusion network with complementarity and importance for emotion recognition,” Information Sciences, vol. 619, pp. 679–694, 2023.
  17. R. Kosti, J. M. Alvarez, A. Recasens, and A. Lapedriza, “Emotion recognition in context,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition.   IEEE, 2017, pp. 1667–1675.
  18. J. Zhang, Z. Yin, P. Chen, and S. Nichele, “Emotion recognition using multi-modal data and machine learning techniques: A tutorial and review,” Information Fusion, vol. 59, pp. 103–126, 2020.
  19. J. Lee, S. Kim, S. Kim, J. Park, and K. Sohn, “Context-aware emotion recognition networks,” in Proceedings of the IEEE/CVF International Conference on Computer Vision.   IEEE, 2019, pp. 10 143–10 152.
  20. Z. Lian, B. Liu, and J. Tao, “Pirnet: Personality-enhanced iterative refinement network for emotion recognition in conversation,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–12, 2022.
  21. S. Poria, E. Cambria, D. Hazarika, N. Majumder, A. Zadeh, and L.-P. Morency, “Context-dependent sentiment analysis in user-generated videos,” in Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers).   ACL, 2017, pp. 873–883.
  22. R. Beard, R. Das, R. W. M. Ng, P. G. K. Gopalakrishnan, L. Eerens, P. Swietojanski, and O. Miksik, “Multi-modal sequence fusion via recursive attention for emotion recognition,” in Proceedings of the 22nd Conference on Computational Natural Language Learning.   ACL, 2018, pp. 251–259.
  23. M. Ren, X. Huang, W. Li, D. Song, and W. Nie, “Lr-gcn: Latent relation-aware graph convolutional network for conversational emotion recognition,” IEEE Transactions on Multimedia, pp. 1–1, 2021.
  24. W. Nie, M. Ren, J. Nie, and S. Zhao, “C-gcn: correlation based graph convolutional network for audio-video emotion recognition,” IEEE Transactions on Multimedia, vol. 23, pp. 3793–3804, 2020.
  25. S. Wu, L. Zhou, Z. Hu, and J. Liu, “Hierarchical context-based emotion recognition with scene graphs,” IEEE Transactions on Neural Networks and Learning Systems, pp. 1–15, 2022.
  26. S. Qian, D. Xue, Q. Fang, and C. Xu, “Integrating multi-label contrastive learning with dual adversarial graph neural networks for cross-modal retrieval,” IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–18, 2022.
  27. C.-M. Chang and C.-C. Lee, “Learning enhanced acoustic latent representation for small scale affective corpus with adversarial cross corpora integration,” IEEE Transactions on Affective Computing, pp. 1–1, 2021.
  28. M. Li, B. Yang, J. Levy, A. Stolcke, V. Rozgic, S. Matsoukas, C. Papayiannis, D. Bone, and C. Wang, “Contrastive unsupervised learning for speech emotion recognition,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2021, pp. 6329–6333.
  29. D. Kim and B. C. Song, “Contrastive adversarial learning for person independent facial emotion recognition,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 7.   AAAI, 2021, pp. 5948–5956.
  30. X. Wang, D. Zhang, H.-Z. Tan, and D.-J. Lee, “A self-fusion network based on contrastive learning for group emotion recognition,” IEEE Transactions on Computational Social Systems, pp. 1–12, 2022.
  31. Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov, “Roberta: A robustly optimized bert pretraining approach,” arXiv preprint arXiv:1907.11692, 2019.
  32. G. Huang, Z. Liu, L. Van Der Maaten, and K. Q. Weinberger, “Densely connected convolutional networks,” in 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR).   IEEE, 2017, pp. 2261–2269.
  33. Y. Jia, Y. Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, z. Chen, P. Nguyen, R. Pang, I. Lopez Moreno, and Y. Wu, “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,” in Advances in Neural Information Processing Systems, vol. 31.   MIT, 2018.
  34. S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, “MELD: A multimodal multi-party dataset for emotion recognition in conversations,” pp. 527–536, 2019.
  35. C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, pp. 335–359, 2008.
  36. Y. Kim, “Convolutional neural networks for sentence classification,” in Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP).   ACL, 2014, pp. 1746–1751.
  37. D. Ghosal, N. Majumder, S. Poria, N. Chhaya, and A. Gelbukh, “DialogueGCN: A graph convolutional neural network for emotion recognition in conversation,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP).   ACL, 2019, pp. 154–164.
  38. Z. Lian, B. Liu, and J. Tao, “Ctnet: Conversational transformer network for emotion recognition,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 985–1000, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Yuntao Shou (28 papers)
  2. Tao Meng (48 papers)
  3. Wei Ai (48 papers)
  4. Keqin Li (61 papers)
  5. Nan Yin (33 papers)
Citations (13)