MSEVA : A System for Multimodal Short Videos Emotion Visual Analysis (2312.04279v2)
Abstract: YouTube Shorts, a new section launched by YouTube in 2021, is a direct competitor to short video platforms like TikTok. It reflects the rising demand for short video content among online users. Social media platforms are often flooded with short videos that capture different perspectives and emotions on hot events. These videos can go viral and have a significant impact on the public's mood and views. However, short videos' affective computing was a neglected area of research in the past. Monitoring the public's emotions through these videos requires a lot of time and effort, which may not be enough to prevent undesirable outcomes. In this paper, we create the first multimodal dataset of short video news covering hot events. We also propose an automatic technique for audio segmenting and transcribing. In addition, we improve the accuracy of the multimodal affective computing model by about 4.17% by optimizing it. Moreover, a novel system MSEVA for emotion analysis of short videos is proposed. Achieving good results on the bili-news dataset, the MSEVA system applies the multimodal emotion analysis method in the real world. It is helpful to conduct timely public opinion guidance and stop the spread of negative emotions. Data and code from our investigations can be accessed at: http://xxx.github.com.
- Shuai Yang, Yuzhen Zhao and Yifang Ma “Analysis of the reasons and development of short video application-Taking Tik Tok as an example” In Proceedings of the 2019 9th International Conference on Information and Social Science (ICISS 2019), Manila, Philippines, 2019, pp. 12–14
- Qinglan Wei, Xuling Huang and Yuan Zhang “FV2ES: A Fully End2End Multimodal System for Fast Yet Effective Video Emotion Recognition Inference” In IEEE Transactions on Broadcasting 69.1 IEEE, 2022, pp. 10–20
- “IEMOCAP: Interactive emotional dyadic motion capture database” In Language resources and evaluation 42 Springer, 2008, pp. 335–359
- “CMU-MOSEAS: A multimodal language dataset for Spanish, Portuguese, German and French” In Proceedings of the Conference on Empirical Methods in Natural Language Processing. Conference on Empirical Methods in Natural Language Processing 2020, 2020, pp. 1801 NIH Public Access
- “Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph” In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp. 2236–2246
- Amir Zadeh “Micro-opinion sentiment intensity analysis and summarization in online videos” In Proceedings of the 2015 ACM on International Conference on Multimodal Interaction, 2015, pp. 587–591
- “UR-FUNNY: A multimodal language dataset for understanding humor” In arXiv preprint arXiv:1904.06618, 2019
- “Ch-sims: A chinese multimodal sentiment analysis dataset with fine-grained annotation of modality” In Proceedings of the 58th annual meeting of the association for computational linguistics, 2020, pp. 3718–3727
- “6 seconds of sound and vision: Creativity in micro-videos” In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2014, pp. 4272–4279
- “Shorter-is-better: Venue category estimation from micro-video” In Proceedings of the 24th ACM international conference on Multimedia, 2016, pp. 1415–1424
- “Micro tells macro: Predicting the popularity of micro-videos via a transductive model” In Proceedings of the 24th ACM international conference on Multimedia, 2016, pp. 898–907
- “Mgat: Multimodal graph attention network for recommendation” In Information Processing & Management 57.5 Elsevier, 2020, pp. 102277
- “FakeSV: A multimodal benchmark with rich social context for fake news detection on short video platforms” In Proceedings of the AAAI Conference on Artificial Intelligence 37.12, 2023, pp. 14444–14452
- Hang-Bong Kang “Affective content detection using HMMs” In Proceedings of the eleventh ACM international conference on Multimedia, 2003, pp. 259–262
- Hee Lin Wang and Loong-Fah Cheong “Affective understanding in film” In IEEE Transactions on circuits and systems for video technology 16.6 IEEE, 2006, pp. 689–704
- “Youtube movie reviews: Sentiment analysis in an audio-visual context” In IEEE Intelligent Systems 28.3 IEEE, 2013, pp. 46–53
- “Ensemble application of ELM and GPU for real-time multimodal sentiment analysis” In Memetic Computing 10 Springer, 2018, pp. 3–13
- Brendan Jou, Subhabrata Bhattacharya and Shih-Fu Chang “Predicting viewer perceived emotions in animated GIFs” In Proceedings of the 22nd ACM international conference on Multimedia, 2014, pp. 213–216
- Zhengyuan Yang, Yixuan Zhang and Jiebo Luo “Human-centered emotion recognition in animated gifs” In 2019 IEEE International Conference on Multimedia and Expo (ICME), 2019, pp. 1090–1095 IEEE
- “An end-to-end visual-audio attention network for emotion recognition in user-generated videos” In Proceedings of the AAAI Conference on Artificial Intelligence 34.01, 2020, pp. 303–311
- “Multimodal end-to-end sparse model for emotion recognition” In arXiv preprint arXiv:2103.09666, 2021
- “Robust speech recognition via large-scale weak supervision” In International Conference on Machine Learning, 2023, pp. 28492–28518 PMLR
- “Roberta: A robustly optimized bert pretraining approach” In arXiv preprint arXiv:1907.11692, 2019
- “Cross-modal complementary network with hierarchical fusion for multimodal sentiment classification” In Tsinghua Science and Technology 27.4 TUP, 2021, pp. 664–679
- Sanghyun Lee, David K Han and Hanseok Ko “Multimodal emotion recognition fusion analysis adapting BERT with heterogeneous feature unification” In IEEE Access 9 IEEE, 2021, pp. 94557–94572
- “AMOA: Global acoustic feature enhanced modal-order-aware network for multimodal sentiment analysis” In Proceedings of the 29th International Conference on Computational Linguistics, 2022, pp. 7136–7146
- “A multimodal corpus for emotion recognition in sarcasm” In arXiv preprint arXiv:2206.02119, 2022
- “QAP: A Quantum-Inspired Adaptive-Priority-Learning Model for Multimodal Emotion Recognition” In Findings of the Association for Computational Linguistics: ACL 2023, 2023, pp. 12191–12204
- “TETFN: A text enhanced transformer fusion network for multimodal sentiment analysis” In Pattern Recognition 136 Elsevier, 2023, pp. 109259
- “Sentiment Analysis in the Era of Large Language Models: A Reality Check” In arXiv preprint arXiv:2305.15005, 2023
- “Albert: A lite bert for self-supervised learning of language representations” In arXiv preprint arXiv:1909.11942, 2019
- “Language models are unsupervised multitask learners” In OpenAI blog 1.8, 2019, pp. 9
- “Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension” In arXiv preprint arXiv:1910.13461, 2019
- “DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter” In arXiv preprint arXiv:1910.01108, 2019
- “Crosslingual generalization through multitask finetuning” In arXiv preprint arXiv:2211.01786, 2022
- “Scaling instruction-finetuned language models” In arXiv preprint arXiv:2210.11416, 2022