Trustworthy Multimodal Fusion for Sentiment Analysis in Ordinal Sentiment Space (2404.08923v1)
Abstract: Multimodal video sentiment analysis aims to integrate multiple modal information to analyze the opinions and attitudes of speakers. Most previous work focuses on exploring the semantic interactions of intra- and inter-modality. However, these works ignore the reliability of multimodality, i.e., modalities tend to contain noise, semantic ambiguity, missing modalities, etc. In addition, previous multimodal approaches treat different modalities equally, largely ignoring their different contributions. Furthermore, existing multimodal sentiment analysis methods directly regress sentiment scores without considering ordinal relationships within sentiment categories, with limited performance. To address the aforementioned problems, we propose a trustworthy multimodal sentiment ordinal network (TMSON) to improve performance in sentiment analysis. Specifically, we first devise a unimodal feature extractor for each modality to obtain modality-specific features. Then, an uncertainty distribution estimation network is customized, which estimates the unimodal uncertainty distributions. Next, Bayesian fusion is performed on the learned unimodal distributions to obtain multimodal distributions for sentiment prediction. Finally, an ordinal-aware sentiment space is constructed, where ordinal regression is used to constrain the multimodal distributions. Our proposed TMSON outperforms baselines on multimodal sentiment analysis tasks, and empirical results demonstrate that TMSON is capable of reducing uncertainty to obtain more robust predictions.
- A. Zadeh, M. Chen, S. Poria, E. Cambria, and L.-P. Morency, “Tensor fusion network for multimodal sentiment analysis,” in Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, 2017, pp. 1103–1114.
- D. Hazarika, R. Zimmermann, and S. Poria, “Misa: Modality-invariant and-specific representations for multimodal sentiment analysis,” in Proceedings of the 28th ACM International Conference on Multimedia, 2020, pp. 1122–1131.
- H. Pham, P. P. Liang, T. Manzini, L.-P. Morency, and B. Póczos, “Found in translation: Learning robust joint representations by cyclic translations between modalities,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 6892–6899.
- Z. Sun, P. Sarma, W. Sethares, and Y. Liang, “Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, 2020, pp. 8992–8999.
- A. Zadeh, P. P. Liang, N. Mazumder, S. Poria, E. Cambria, and L.-P. Morency, “Memory fusion network for multi-view sequential learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 32, no. 1, 2018, pp. 5634–5641.
- M. K. Hasan, S. Lee, W. Rahman, A. Zadeh, R. Mihalcea, L.-P. Morency, and E. Hoque, “Humor knowledge enriched transformer for understanding multimodal humor,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 14, 2021, pp. 12 972–12 980.
- Y. Gu, K. Yang, S. Fu, S. Chen, X. Li, and I. Marsic, “Multimodal affective analysis using hierarchical attention strategy with word-level alignment,” in Proceedings of the conference. Association for Computational Linguistics. Meeting, vol. 2018. NIH Public Access, 2018, pp. 2225–2235.
- A. Zadeh, P. P. Liang, S. Poria, P. Vij, E. Cambria, and L.-P. Morency, “Multi-attention recurrent network for human communication comprehension,” in Thirty-Second AAAI Conference on Artificial Intelligence, 2018, pp. 5642–5649.
- H. Pham, T. Manzini, P. P. Liang, and B. Poczos, “Seq2seq2sentiment: Multimodal sequence to sequence models for sentiment analysis,” arXiv preprint arXiv:1807.03915, 2018.
- L. Vilnis and A. McCallum, “Word representations via gaussian embedding.” in International Conference on Learning Representations, 2015.
- S. J. Oh, K. P. Murphy, J. Pan, J. Roth, F. Schroff, and A. C. Gallagher, “Modeling uncertainty with hedged instance embeddings,” in International Conference on Learning Representations, 2018.
- Y. Shi and A. K. Jain, “Probabilistic face embeddings,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2019, pp. 6902–6911.
- Z. Han, C. Zhang, H. Fu, and J. T. Zhou, “Trusted multi-view classification,” in International Conference on Learning Representations, 2020.
- Y. Geng, Z. Han, C. Zhang, and Q. Hu, “Uncertainty-aware multi-view representation learning,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 9, 2021, pp. 7545–7553.
- S. Mai, Y. Zeng, and H. Hu, “Multimodal information bottleneck: Learning minimal sufficient unimodal and multimodal representations,” IEEE Transactions on Multimedia, 2022.
- W. Yu, H. Xu, Z. Yuan, and J. Wu, “Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 35, no. 12, 2021, pp. 10 790–10 797.
- Y. Wang, Y. Shen, Z. Liu, P. P. Liang, A. Zadeh, and L.-P. Morency, “Words can shift: Dynamically adjusting word representations using nonverbal behaviors,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, 2019, pp. 7216–7223.
- V. Rozgić, S. Ananthakrishnan, S. Saleem, R. Kumar, and R. Prasad, “Ensemble of svm trees for multimodal emotion recognition,” in Proceedings of the 2012 Asia Pacific signal and information processing association annual summit and conference. IEEE, 2012, pp. 1–4.
- S. Poria, E. Cambria, D. Hazarika, N. Majumder, A. Zadeh, and L.-P. Morency, “Context-dependent sentiment analysis in user-generated videos,” in Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers), 2017, pp. 873–883.
- S. Zhang, S. Zhang, T. Huang, W. Gao, and Q. Tian, “Learning affective features with a hybrid deep model for audio–visual emotion recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 28, no. 10, pp. 3030–3043, 2017.
- B. Nojavanasghari, D. Gopinath, J. Koushik, T. Baltrušaitis, and L.-P. Morency, “Deep multimodal fusion for persuasiveness prediction,” in Proceedings of the 18th ACM International Conference on Multimodal Interaction, 2016, pp. 284–288.
- O. Kampman, D. Bertero, P. N. Fung et al., “Investigating audio, video, and text fusion methods for end-to-end automatic personality prediction,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (ACL 2018): Short Papers, 2018, pp. 606–611.
- L. Chu, Y. Zhang, G. Li, S. Wang, W. Zhang, and Q. Huang, “Effective multimodality fusion framework for cross-media topic detection,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 26, no. 3, pp. 556–569, 2014.
- J. Yang, Y. Wang, R. Yi, Y. Zhu, A. Rehman, A. Zadeh, S. Poria, and L.-P. Morency, “Mtgat: Multimodal temporal graph attention networks for unaligned human multimodal language sequences,” arXiv preprint arXiv:2010.11985, 2020.
- S. Mai, H. Hu, and S. Xing, “Modality to modality translation: An adversarial representation learning and graph fusion network for multimodal fusion,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 01, 2020, pp. 164–172.
- A. B. Zadeh, P. P. Liang, S. Poria, E. Cambria, and L.-P. Morency, “Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2236–2246.
- Z. Liu and Y. Shen, “Efficient low-rank multimodal fusion with modality-specific factors,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers), 2018, pp. 2247–2256.
- Y. Ou, Z. Chen, and F. Wu, “Multimodal local-global attention network for affective video content analysis,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 31, no. 5, pp. 1901–1914, 2020.
- J. Wang, Y. Yang, K. Liu, Z. Zhu, and X. Liu, “M3s: Scene graph driven multi-granularity multi-task learning for multi-modal ner,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 31, pp. 111–120, 2022.
- J. Wu, S. Mai, and H. Hu, “Graph capsule aggregation for unaligned multimodal sequences,” in Proceedings of the 2021 International Conference on Multimodal Interaction, 2021, pp. 521–529.
- T. Liang, G. Lin, L. Feng, Y. Zhang, and F. Lv, “Attention is not enough: Mitigating the distribution discrepancy in asynchronous multimodal sequence fusion,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2021, pp. 8148–8156.
- F. Lv, X. Chen, Y. Huang, L. Duan, and G. Lin, “Progressive modality reinforcement for human multimodal emotion recognition from unaligned multimodal sequences,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 2554–2562.
- S. Mai, H. Hu, J. Xu, and S. Xing, “Multi-fusion residual memory network for multimodal human sentiment comprehension,” IEEE Transactions on Affective Computing, pp. 320–334, 2020.
- J. Chang, Z. Lan, C. Cheng, and Y. Wei, “Data uncertainty learning in face recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 5710–5719.
- P.-Y. Huang, W.-T. Hsu, C.-Y. Chiu, T.-F. Wu, and M. Sun, “Efficient uncertainty estimation for semantic segmentation in videos,” in Proceedings of the European Conference on Computer Vision (ECCV), 2018, pp. 520–535.
- M. Rottmann and M. Schubert, “Uncertainty measures and prediction quality rating for the semantic segmentation of nested multi resolution street scene images,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 1361–1369.
- F. Kraus and K. Dietmayer, “Uncertainty estimation in one-stage object detection,” in 2019 IEEE Intelligent Transportation Systems Conference (ITSC). IEEE, 2019, pp. 53–60.
- A. Harakeh, M. Smart, and S. L. Waslander, “Bayesod: A bayesian approach for uncertainty estimation in deep object detectors,” in 2020 IEEE International Conference on Robotics and Automation (ICRA). IEEE, 2020, pp. 87–93.
- A. Kendall and Y. Gal, “What uncertainties do we need in bayesian deep learning for computer vision?” Advances in neural information processing systems, vol. 30, pp. 5574–5584, 2017.
- A. Amini, W. Schwarting, A. Soleimany, and D. Rus, “Deep evidential regression,” Advances in Neural Information Processing Systems, vol. 33, pp. 14 927–14 937, 2020.
- Z. Han, C. Zhang, H. Fu, and J. T. Zhou, “Trusted multi-view classification with dynamic evidential fusion,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2022.
- Z. Han, F. Yang, J. Huang, C. Zhang, and J. Yao, “Multimodal dynamics: Dynamical fusion for trustworthy multimodal classification,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 707–20 717.
- H. Ma, Z. Han, C. Zhang, H. Fu, J. T. Zhou, and Q. Hu, “Trustworthy multimodal regression with mixture of normal-inverse gamma distributions,” Advances in Neural Information Processing Systems, vol. 34, pp. 6881–6893, 2021.
- A. Zadeh, R. Zellers, E. Pincus, and L.-P. Morency, “Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos,” arXiv preprint arXiv:1606.06259, 2016.
- Z. Niu, M. Zhou, L. Wang, X. Gao, and G. Hua, “Ordinal regression with multiple output cnn for age estimation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 4920–4928.
- W. Li, X. Huang, J. Lu, J. Feng, and J. Zhou, “Learning probabilistic ordinal embeddings for uncertainty-aware regression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 13 896–13 905.
- N.-H. Shin, S.-H. Lee, and C.-S. Kim, “Moving window regression: A novel approach to ordinal regression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18 760–18 769.
- Q. Li, J. Wang, Z. Yao, Y. Li, P. Yang, J. Yan, C. Wang, and S. Pu, “Unimodal-concentrated loss: Fully adaptive label distribution learning for ordinal regression,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 20 513–20 522.
- Y. Zhao, J. Li, Y. Zhang, Y. Song, and Y. Tian, “Ordinal multi-task part segmentation with recurrent prior generation,” IEEE transactions on pattern analysis and machine intelligence, vol. 43, no. 5, pp. 1636–1648, 2019.
- R. Diaz and A. Marathe, “Soft labels for ordinal regression,” in Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 4738–4747.
- H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for monocular depth estimation,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2018, pp. 2002–2011.
- C. Herrmann, R. S. Bowen, N. Wadhwa, R. Garg, Q. He, J. T. Barron, and R. Zabih, “Learning to autofocus,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 2230–2239.
- R. Kreuzig, M. Ochs, and R. Mester, “Distancenet: Estimating traveled distance from monocular images using a recurrent convolutional neural network,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 1258–1266.
- A. A. Ismail, M. Hasan, and F. Ishtiaq, “Improving multimodal accuracy through modality pre-training and attention,” arXiv preprint arXiv:2011.06102, 2020.
- W. Rahman, M. K. Hasan, S. Lee, A. Zadeh, C. Mao, L.-P. Morency, and E. Hoque, “Integrating multimodal information in large pretrained transformers,” in Proceedings of the conference. Association for Computational Linguistics. Meeting, vol. 2020. NIH Public Access, 2020, pp. 2359–2369.
- J. Pennington, R. Socher, and C. D. Manning, “Glove: Global vectors for word representation,” in Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), 2014, pp. 1532–1543.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” International Conference on Learning Representations, 2014.
- G.-N. Dong, C.-M. Pun, and Z. Zhang, “Temporal relation inference network for multi-modal speech emotion recognition,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, pp. 6472–6485, 2022.
- W. Han, H. Chen, and S. Poria, “Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis,” in Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, 2021, pp. 9180–9192.
- W. Yu, H. Xu, F. Meng, Y. Zhu, Y. Ma, J. Wu, J. Zou, and K. Yang, “Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality,” in Proceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 3718–3727.
- R. Socher, A. Perelygin, J. Wu, J. Chuang, C. D. Manning, A. Y. Ng, and C. Potts, “Recursive deep models for semantic compositionality over a sentiment treebank,” in Proceedings of the 2013 conference on empirical methods in natural language processing, 2013, pp. 1631–1642.
- Y.-H. H. Tsai, P. P. Liang, A. Zadeh, L.-P. Morency, and R. Salakhutdinov, “Learning factorized multimodal representations,” in International Conference on Representation Learning, 2019.
- D. S. Chauhan, M. S. Akhtar, A. Ekbal, and P. Bhattacharyya, “Context-aware interactive attention for multi-modal sentiment and emotion analysis,” in Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 2019, pp. 5647–5657.
- Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le, “Xlnet: Generalized autoregressive pretraining for language understanding,” Advances in neural information processing systems, vol. 32, pp. 5754–5764, 2019.
- A. Shenoy, A. Sardana, and N. Graphics, “Multilogue-net: A context aware rnn for multi-modal emotion detection and sentiment analysis in conversation,” ACL 2020, pp. 19–28, 2020.
- Y.-H. H. Tsai, S. Bai, P. P. Liang, J. Z. Kolter, L.-P. Morency, and R. Salakhutdinov, “Multimodal transformer for unaligned multimodal language sequences,” in Proceedings of the conference. Association for Computational Linguistics. Meeting. NIH Public Access, 2019, pp. 6558–6569.
- G. Degottex, J. Kane, T. Drugman, T. Raitio, and S. Scherer, “Covarep: A collaborative voice analysis repository for speech technologies,” in 2014 IEEE International Conference on Acoustics Speech and Signal Processing (ICASSP), 2014, pp. 960–964.
- D. Hazarika, Y. Li, B. Cheng, S. Zhao, R. Zimmermann, and S. Poria, “Analyzing modality robustness in multimodal sentiment analysis,” Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics, pp. 685–696, 2022.
- L. Van der Maaten and G. Hinton, “Visualizing data using t-sne.” Journal of machine learning research, vol. 9, no. 11, pp. 2579–2605, 2008.
- Zhuyang Xie (4 papers)
- Yan Yang (119 papers)
- Jie Wang (480 papers)
- Xiaorong Liu (2 papers)
- Xiaofan Li (52 papers)