EALD-MLLM: Emotion Analysis in Long-sequential and De-identity videos with Multi-modal Large Language Model (2405.00574v1)
Abstract: Emotion AI is the ability of computers to understand human emotional states. Existing works have achieved promising progress, but two limitations remain to be solved: 1) Previous studies have been more focused on short sequential video emotion analysis while overlooking long sequential video. However, the emotions in short sequential videos only reflect instantaneous emotions, which may be deliberately guided or hidden. In contrast, long sequential videos can reveal authentic emotions; 2) Previous studies commonly utilize various signals such as facial, speech, and even sensitive biological signals (e.g., electrocardiogram). However, due to the increasing demand for privacy, developing Emotion AI without relying on sensitive signals is becoming important. To address the aforementioned limitations, in this paper, we construct a dataset for Emotion Analysis in Long-sequential and De-identity videos called EALD by collecting and processing the sequences of athletes' post-match interviews. In addition to providing annotations of the overall emotional state of each video, we also provide the Non-Facial Body Language (NFBL) annotations for each player. NFBL is an inner-driven emotional expression and can serve as an identity-free clue to understanding the emotional state. Moreover, we provide a simple but effective baseline for further research. More precisely, we evaluate the Multimodal LLMs (MLLMs) with de-identification signals (e.g., visual, speech, and NFBLs) to perform emotion analysis. Our experimental results demonstrate that: 1) MLLMs can achieve comparable, even better performance than the supervised single-modal models, even in a zero-shot scenario; 2) NFBL is an important cue in long sequential emotion analysis. EALD will be available on the open-source platform.
- Social interaction context shapes emotion recognition through body language, not facial expressions. Emotion 21, 3 (2021), 557.
- Body cues, not facial expressions, discriminate between intense positive and negative emotions. Science 338, 6111 (2012), 1225–1229.
- LIRIS-ACCEDE: A video database for affective content analysis. IEEE Transactions on Affective Computing 6, 1 (2015), 43–55.
- Is space-time attention all you need for video understanding?. In ICML, Vol. 2. 4.
- IEMOCAP: Interactive emotional dyadic motion capture database. Language Resources and Evaluation 42 (2008), 335–359.
- Joao Carreira and Andrew Zisserman. 2017. Quo vadis, action recognition? a new model and the kinetics dataset. In proceedings of the IEEE Conference on Computer Vision and Pattern Recognition. 6299–6308.
- Emotion modelling for social robotics applications: a review. Journal of Bionic Engineering 15 (2018), 185–203.
- Beats: Audio pre-training with acoustic tokenizers. arXiv preprint arXiv:2212.09058 (2022).
- Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition. Ieee, 248–255.
- Collecting Large, Richly Annotated Facial-Expression Databases from Movies. IEEE MultiMedia 19, 3 (2012), 34–41. https://doi.org/10.1109/MMUL.2012.26
- Video and image based emotion recognition challenges in the wild: Emotiw 2015. In ACM on International Conference on Multimodal Interaction. 423–426.
- The HUMAINE database: Addressing the collection and annotation of naturalistic and induced emotional data. In International Conference on Affective Computing and Intelligent Interaction. Springer, 488–500.
- Survey on speech emotion recognition: Features, classification schemes, and databases. Pattern recognition 44, 3 (2011), 572–587.
- K Ezzameli and H Mahersia. 2023. Emotion recognition from unimodal to multimodal analysis: A review. Information Fusion (2023), 101847.
- Emotionet: An accurate, real-time algorithm for the automatic annotation of a million facial expressions in the wild. In IEEE Conference on Computer Vision and Pattern Recognition. 5562–5570.
- Slowfast networks for video recognition. In Proceedings of the IEEE/CVF international conference on computer vision. 6202–6211.
- Nesrine Fourati and Catherine Pelachaud. 2014. Emilya: Emotional body expression in daily actions database.. In LREC. 3486–3493.
- Audio set: An ontology and human-labeled dataset for audio events. In 2017 IEEE international conference on acoustics, speech and Signal Processing. IEEE, 776–780.
- Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 15180–15190.
- The Vera am Mittag German audio-visual emotional speech database. In IEEE International Conference on Multimedia and Expo. IEEE, 865–868.
- Emotion analysis: A survey. In 2017 international conference on computer, communications and electronics (COMPTELIX). IEEE, 397–402.
- Automatic ECG-based emotion recognition in music listening. IEEE Transactions on Affective Computing 11, 1 (2017), 85–99.
- Emotion recognition and artificial intelligence: A systematic review (2014–2023) and research recommendations. Information Fusion (2023), 102019.
- Deap: A database for emotion analysis; using physiological signals. IEEE Transactions on Affective Computing 3, 1 (2011), 18–31.
- Emotion recognition and its applications. Human-computer Systems Interaction: Backgrounds and Applications 3 (2014), 51–62.
- Sewa db: A rich database for audio-visual emotion and sentiment research in the wild. IEEE Transactions on Pattern Analysis and Machine Intelligence 43, 3 (2019), 1022–1040.
- Shan Li and Weihong Deng. 2020. Deep facial expression recognition: A survey. IEEE Transactions on Affective Computing 13, 3 (2020), 1195–1215.
- EEG based emotion recognition: A tutorial and review. Comput. Surveys 55, 4 (2022), 1–57.
- Tsm: Temporal shift module for efficient video understanding. In IEEE/CVF International Conference on Computer Vision. 7083–7093.
- Retracted: Human emotion recognition based on galvanic skin response signal feature selection and svm. In International Conference on Smart City and Systems Engineering. IEEE, 157–160.
- iMiGUE: An identity-free video dataset for micro-gesture understanding and emotion analysis. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 10631–10642.
- Video swin transformer. In IEEE/CVF Conference on Computer Vision and Pattern Recognition. 3202–3211.
- Recognition of emotion from body language among patients with unipolar depression. Psychiatry research 209, 1 (2013), 40–49.
- Stephen Edward McAdams. 1984. Spectral fusion, spectral parsing and the formation of auditory images. Stanford university.
- Affectiva-mit facial expression dataset (am-fed): Naturalistic and spontaneous facial expressions collected. In IEEE Conference on Computer Vision and Pattern Recognition Workshops. 881–888.
- Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing 10, 1 (2017), 18–31.
- Towards multimodal sentiment analysis: Harvesting opinions from the web. In International Conference on Multimodal Interfaces. 169–176.
- Pansy Nandwani and Rupali Verma. 2021. A review on sentiment analysis and emotion detection from text. Social network analysis and mining 11, 1 (2021), 81.
- Julia Navarro and Marvin Karlins. 2008. What every body is saying. HarperCollins Publishers New York, NY, USA:.
- Training language models to follow instructions with human feedback. Advances in neural information processing systems 35 (2022), 27730–27744.
- Speaker anonymisation using the McAdams coefficient. arXiv preprint arXiv:2011.01130 (2020).
- A multimodal emotion recognition system using facial landmark analysis. Iranian Journal of Science and Technology, Transactions of Electrical Engineering 43 (2019), 171–189.
- The VoicePrivacy 2024 Challenge Evaluation Plan. (2024). arXiv:2404.02677 [eess.AS]
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
- Multimodal language analysis in the wild: Cmu-mosei dataset and interpretable dynamic fusion graph. In Annual Meeting of the Association for Computational Linguistics. 2236–2246.
- Video-LLaMA: An Instruction-tuned Audio-Visual Language Model for Video Understanding. arXiv preprint arXiv:2306.02858 (2023). https://arxiv.org/abs/2306.02858
- Vision-language models for vision tasks: A survey. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024).
- Deng Li (9 papers)
- Xin Liu (820 papers)
- Bohao Xing (9 papers)
- Baiqiang Xia (4 papers)
- Yuan Zong (28 papers)
- Bihan Wen (86 papers)
- Heikki Kälviäinen (20 papers)