Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cooperative Sentiment Agents for Multimodal Sentiment Analysis (2404.12642v1)

Published 19 Apr 2024 in cs.CL and cs.CV

Abstract: In this paper, we propose a new Multimodal Representation Learning (MRL) method for Multimodal Sentiment Analysis (MSA), which facilitates the adaptive interaction between modalities through Cooperative Sentiment Agents, named Co-SA. Co-SA comprises two critical components: the Sentiment Agents Establishment (SAE) phase and the Sentiment Agents Cooperation (SAC) phase. During the SAE phase, each sentiment agent deals with an unimodal signal and highlights explicit dynamic sentiment variations within the modality via the Modality-Sentiment Disentanglement (MSD) and Deep Phase Space Reconstruction (DPSR) modules. Subsequently, in the SAC phase, Co-SA meticulously designs task-specific interaction mechanisms for sentiment agents so that coordinating multimodal signals to learn the joint representation. Specifically, Co-SA equips an independent policy model for each sentiment agent that captures significant properties within the modality. These policies are optimized mutually through the unified reward adaptive to downstream tasks. Benefitting from the rewarding mechanism, Co-SA transcends the limitation of pre-defined fusion modes and adaptively captures unimodal properties for MRL in the multimodal interaction setting. To demonstrate the effectiveness of Co-SA, we apply it to address Multimodal Sentiment Analysis (MSA) and Multimodal Emotion Recognition (MER) tasks. Our comprehensive experimental results demonstrate that Co-SA excels at discovering diverse cross-modal features, encompassing both common and complementary aspects. The code can be available at https://github.com/smwanghhh/Co-SA.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (60)
  1. A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.-P. Morency, “Tensor fusion network for multimodal sentiment analysis,” arXiv preprint arXiv:1707.07250, 2017.
  2. Z. Liu, Y. Shen, V. B. Lakshminarasimhan, P. P. Liang, A. Zadeh, and L.-P. Morency, “Efficient low-rank multimodal fusion with modality-specific factors,” arXiv preprint arXiv:1806.00064, 2018.
  3. P. Fung and E. Jebalbarezi Sarbijan, “Modality-based factorization for multimodal fusion,” in ACL 2019: The 4th Workshop on Representation Learning for NLP (RepL4NLP-2019): Proceedings of the Workshop, 2019.
  4. D. Zhang, X. Ju, W. Zhang, J. Li, S. Li, Q. Zhu, and G. Zhou, “Multi-modal multi-label emotion recognition with heterogeneous hierarchical message passing,” in Proceedings of the AAAI conference on artificial intelligence, vol. 35, no. 16, 2021, pp. 14 338–14 346.
  5. A. Zadeh, P. P. Liang, N. Mazumder, S. Poria, E. Cambria, and L.-P. Morency, “Memory fusion network for multi-view sequential learning,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018.
  6. A. Zadeh, P. P. Liang, S. Poria, P. Vij, E. Cambria, and L.-P. Morency, “Multi-attention recurrent network for human communication comprehension,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018.
  7. Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, “Multimodal transformer for unaligned multimodal language sequences,” in Proceedings of the conference. Association for Computational Linguistics. Meeting, vol. 2019.   NIH Public Access, 2019, p. 6558.
  8. M. Chen, S. Wang, P. P. Liang, T. Baltrušaitis, A. Zadeh, and L.-P. Morency, “Multimodal sentiment analysis with word-level fusion and reinforcement learning,” in Proceedings of the 19th ACM international conference on multimodal interaction, 2017, pp. 163–171.
  9. Y. Li, K. Zhang, J. Wang, and X. Gao, “A cognitive brain model for multimodal sentiment analysis based on attention neural networks,” Neurocomputing, vol. 430, pp. 159–173, 2021.
  10. S. Mai, H. Hu, and S. Xing, “Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 01, 2020, pp. 164–172.
  11. Y. Li, Y. Wang, and Z. Cui, “Decoupled multimodal distilling for emotion recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 6631–6640.
  12. D. Yang, S. Huang, H. Kuang, Y. Du, and L. Zhang, “Disentangled representation learning for multimodal emotion recognition,” in Proceedings of the 30th ACM International Conference on Multimedia, 2022, pp. 1642–1651.
  13. T. Harada, K. Saito, Y. Mukuta, and Y. Ushiku, “Deep modality invariant adversarial network for shared representation learning,” in 2017 IEEE International Conference on Computer Vision Workshops (ICCVW).   IEEE, 2017, pp. 2623–2629.
  14. J. He, H. Yanga, C. Zhang, H. Chen, Y. Xua et al., “Dynamic invariant-specific representation fusion network for multimodal sentiment analysis,” Computational Intelligence and Neuroscience, vol. 2022, 2022.
  15. J. Sun, H. Yin, Y. Tian, J. Wu, L. Shen, and L. Chen, “Two-level multimodal fusion for sentiment analysis in public security,” Security and Communication Networks, vol. 2021, pp. 1–10, 2021.
  16. V. Lopes, A. Gaspar, L. A. Alexandre, and J. Cordeiro, “An automl-based approach to multimodal image sentiment analysis,” in 2021 International Joint Conference on Neural Networks (IJCNN).   IEEE, 2021, pp. 1–9.
  17. Z. Jin, M. Tao, X. Zhao, and Y. Hu, “Social media sentiment analysis based on dependency graph and co-occurrence graph,” Cognitive Computation, vol. 14, no. 3, pp. 1039–1054, 2022.
  18. J. Kober, J. A. Bagnell, and J. Peters, “Reinforcement learning in robotics: A survey,” The International Journal of Robotics Research, vol. 32, no. 11, pp. 1238–1274, 2013.
  19. S. S. Mousavi, M. Schukat, and E. Howley, “Deep reinforcement learning: an overview,” in Proceedings of SAI Intelligent Systems Conference (IntelliSys) 2016: Volume 2.   Springer, 2018, pp. 426–440.
  20. K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
  21. L. P. Kaelbling, M. L. Littman, and A. W. Moore, “Reinforcement learning: A survey,” Journal of artificial intelligence research, vol. 4, pp. 237–285, 1996.
  22. Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” nature, vol. 521, no. 7553, pp. 436–444, 2015.
  23. M. Soleymani, D. Garcia, B. Jou, B. Schuller, S.-F. Chang, and M. Pantic, “A survey of multimodal sentiment analysis,” Image and Vision Computing, vol. 65, pp. 3–14, 2017.
  24. N. Sebe, I. Cohen, and T. S. Huang, “Multimodal emotion recognition,” in Handbook of pattern recognition and computer vision.   World Scientific, 2005, pp. 387–409.
  25. M. Spezialetti, G. Placidi, and S. Rossi, “Emotion recognition for human-robot interaction: Recent advances and future perspectives,” Frontiers in Robotics and AI, p. 145, 2020.
  26. S. Wang, H. Shuai, and Q. Liu, “Phase space reconstruction driven spatio-temporal feature learning for dynamic facial expression recognition,” IEEE Transactions on Affective Computing, vol. 13, no. 3, pp. 1466–1476, 2020.
  27. F. Takens, “Detecting strange attractors in turbulence,” in Dynamical systems and turbulence, Warwick 1980.   Springer, 1981, pp. 366–381.
  28. H. D. Abarbanel, R. Brown, J. J. Sidorowich, and L. S. Tsimring, “The analysis of observed chaotic data in physical systems,” Reviews of modern physics, vol. 65, no. 4, p. 1331, 1993.
  29. M. Casdagli, S. Eubank, J. D. Farmer, and J. Gibson, “State space reconstruction in the presence of noise,” Physica D: Nonlinear Phenomena, vol. 51, no. 1-3, pp. 52–98, 1991.
  30. Q. Li, D. Gkoumas, C. Lioma, and M. Melucci, “Quantum-inspired multimodal fusion for video sentiment analysis,” Information Fusion, vol. 65, pp. 58–71, 2021.
  31. Z. Sun, P. Sarma, W. Sethares, and Y. Liang, “Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, 2020, pp. 8992–8999.
  32. Y. Wang, Y. Shen, Z. Liu, P. P. Liang, A. Zadeh, and L.-P. Morency, “Words can shift: Dynamically adjusting word representations using nonverbal behaviors,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 7216–7223.
  33. W. Rahman, M. K. Hasan, S. Lee, A. Zadeh, C. Mao, L.-P. Morency, and E. Hoque, “Integrating multimodal information in large pretrained transformers,” in Proceedings of the conference. Association for Computational Linguistics. Meeting, vol. 2020.   NIH Public Access, 2020, p. 2359.
  34. Z. Yuan, W. Li, H. Xu, and W. Yu, “Transformer-based feature reconstruction network for robust multimodal sentiment analysis,” in Proceedings of the 29th ACM International Conference on Multimedia, 2021, pp. 4400–4407.
  35. S. Mai, Y. Sun, Y. Zeng, and H. Hu, “Excavating multimodal correlation for representation learning,” Information Fusion, vol. 91, pp. 542–555, 2023.
  36. S. Mai, H. Hu, and S. Xing, “Divide, conquer and combine: Hierarchical feature fusion network with local and global perspectives for multimodal affective computing,” in Proceedings of the 57th annual meeting of the association for computational linguistics, 2019, pp. 481–492.
  37. S. Mai, S. Xing, and H. Hu, “Analyzing multimodal sentiment via acoustic-and visual-lstm with channel-aware temporal convolution network,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1424–1437, 2021.
  38. T. Rashid, M. Samvelyan, C. S. De Witt, G. Farquhar, J. Foerster, and S. Whiteson, “Monotonic value function factorisation for deep multi-agent reinforcement learning,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 7234–7284, 2020.
  39. J. Foerster, G. Farquhar, T. Afouras, N. Nardelli, and S. Whiteson, “Counterfactual multi-agent policy gradients,” in Proceedings of the AAAI conference on artificial intelligence, vol. 32, no. 1, 2018.
  40. J. R. Kok and N. Vlassis, “Collaborative multiagent reinforcement learning by payoff propagation,” Journal of Machine Learning Research, vol. 7, pp. 1789–1828, 2006.
  41. M. K. Hasan, S. Lee, W. Rahman, A. Zadeh, R. Mihalcea, L.-P. Morency, and E. Hoque, “Humor knowledge enriched transformer for understanding multimodal humor,” in Proceedings of the AAAI conference on artificial intelligence, vol. 35, no. 14, 2021, pp. 12 972–12 980.
  42. J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.
  43. J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
  44. G. Degottex, J. Kane, T. Drugman, T. Raitio, and S. Scherer, “Covarep—a collaborative voice analysis repository for speech technologies,” in 2014 ieee international conference on acoustics, speech and signal processing (icassp).   IEEE, 2014, pp. 960–964.
  45. S. Mai, Y. Zeng, S. Zheng, and H. Hu, “Hybrid contrastive learning of tri-modal representation for multimodal sentiment analysis,” IEEE Transactions on Affective Computing, 2022.
  46. X. Chen, M. Cao, H. Wei, Z. Shang, and L. Zhang, “Patient emotion recognition in human computer interaction system based on machine learning method and interactive design theory,” Journal of Medical Imaging and Health Informatics, vol. 11, no. 2, pp. 307–312, 2021.
  47. A. Zadeh, R. Zellers, E. Pincus, and L.-P. Morency, “Multimodal sentiment intensity analysis in videos: Facial gestures and verbal messages,” IEEE Intelligent Systems, vol. 31, no. 6, pp. 82–88, 2016.
  48. A. B. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L.-P. Morency, “Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2236–2246.
  49. C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, pp. 335–359, 2008.
  50. A. Tampuu, T. Matiisen, D. Kodelja, I. Kuzovkin, K. Korjus, J. Aru, J. Aru, and R. Vicente, “Multiagent cooperation and competition with deep reinforcement learning,” PloS one, vol. 12, no. 4, p. e0172395, 2017.
  51. T. Baltrušaitis, C. Ahuja, and L.-P. Morency, “Multimodal machine learning: A survey and taxonomy,” IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 2, pp. 423–443, 2018.
  52. L. Gao, L. Qi, E. Chen, and L. Guan, “Discriminative multiple canonical correlation analysis for information fusion,” IEEE Transactions on Image Processing, vol. 27, no. 4, pp. 1951–1965, 2017.
  53. Q.-S. Sun, S.-G. Zeng, Y. Liu, P.-A. Heng, and D.-S. Xia, “A new method of feature fusion and its application in image recognition,” Pattern Recognition, vol. 38, no. 12, pp. 2437–2448, 2005.
  54. Z. Lian, L. Chen, L. Sun, B. Liu, and J. Tao, “Gcnet: graph completion network for incomplete multimodal learning in conversation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  55. B. Xing and I. W. Tsang, “Relational temporal graph reasoning for dual-task dialogue language understanding,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  56. S. Fan, Z. Shen, M. Jiang, B. L. Koenig, M. S. Kankanhalli, and Q. Zhao, “Emotional attention: From eye tracking to computational modeling,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 45, no. 2, pp. 1682–1699, 2022.
  57. Z. Zeng, M. Pantic, G. I. Roisman, and T. S. Huang, “A survey of affect recognition methods: audio, visual and spontaneous expressions,” in Proceedings of the 9th international conference on Multimodal interfaces, 2007, pp. 126–133.
  58. Z. Deng, Z. Fu, L. Wang, Z. Yang, C. Bai, T. Zhou, Z. Wang, and J. Jiang, “False correlation reduction for offline reinforcement learning,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  59. Y. Jiang, C. Li, W. Dai, J. Zou, and H. Xiong, “Variance reduced domain randomization for reinforcement learning with policy gradient,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023.
  60. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Shanmin Wang (37 papers)
  2. Hui Shuai (7 papers)
  3. Qingshan Liu (46 papers)
  4. Fei Wang (573 papers)
Citations (1)