Multimodal Group Emotion Recognition In-the-wild Using Privacy-Compliant Features (2312.05265v1)
Abstract: This paper explores privacy-compliant group-level emotion recognition ''in-the-wild'' within the EmotiW Challenge 2023. Group-level emotion recognition can be useful in many fields including social robotics, conversational agents, e-coaching and learning analytics. This research imposes itself using only global features avoiding individual ones, i.e. all features that can be used to identify or track people in videos (facial landmarks, body poses, audio diarization, etc.). The proposed multimodal model is composed of a video and an audio branches with a cross-attention between modalities. The video branch is based on a fine-tuned ViT architecture. The audio branch extracts Mel-spectrograms and feed them through CNN blocks into a transformer encoder. Our training paradigm includes a generated synthetic dataset to increase the sensitivity of our model on facial expression within the image in a data-driven way. The extensive experiments show the significance of our methodology. Our privacy-compliant proposal performs fairly on the EmotiW challenge, with 79.24% and 75.13% of accuracy respectively on validation and test set for the best models. Noticeably, our findings highlight that it is possible to reach this accuracy level with privacy-compliant features using only 5 frames uniformly distributed on the video.
- Natalya S Belova. Group-level affect recognition in video using deviation of frame features. In Analysis of Images, Social Networks and Texts: 10th International Conference, AIST 2021, Tbilisi, Georgia, December 16–18, 2021, Revised Selected Papers, volume 13217, page 199. Springer Nature, 2022.
- Emotiw 2023: Emotion recognition in the wild challenge. In Proceedings of the 25th International Conference on Multimodal Interaction (ICMI 2023), 2023.
- An image is worth 16x16 words: Transformers for image recognition at scale. CoRR, abs/2010.11929, 2020. URL https://arxiv.org/abs/2010.11929.
- Counting out time: Class agnostic video repetition counting in the wild. In Proceedings of the IEEE/CVF conference on Computer Vision and Pattern Recognition, pages 10387–10396, 2020.
- Faces-a database of facial expressions in young, middle-aged, and older women and men: Development and validation. Behavior Research Methods, 42:351–362, 2 2010. ISSN 1554351X. doi:10.3758/BRM.42.1.351.
- Lev Evtodienko. Multimodal end-to-end group emotion recognition using cross-modal attention. CoRR, abs/2111.05890, 2021. URL https://arxiv.org/abs/2111.05890.
- Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM International Conference on Multimedia, pages 1459–1462, 2010.
- Group-level emotion recognition using hybrid deep models based on faces, scenes, skeletons and visual attentions. Proceedings of the International Conference on Multimodal Interaction (ICMI 2018), 2018. doi:10.1145/3242969. URL https://doi.org/10.1145/3242969.3264990.
- Graph neural networks for image understanding based on multiple cues: Group emotion recognition and event recognition as use cases. CoRR, abs/1909.12911, 2019. URL http://arxiv.org/abs/1909.12911.
- An attention model for group-level emotion recognition. Proceedings of the International Conference on Multimodal Interaction (ICMI 2018), pages 611–615, 10 2018. doi:10.1145/3242969.3264985.
- Emotion recognition using deep learning approach from audio–visual emotional big data. Information Fusion, 49:69–78, 9 2019. ISSN 15662535. doi:10.1016/j.inffus.2018.09.008.
- Group level audio-video emotion recognition using hybrid networks. In Proceedings of the 2020 International Conference on Multimodal Interaction (ICMI 2020), pages 807–812. Association for Computing Machinery, Inc, 10 2020. ISBN 9781450375818. doi:10.1145/3382507.3417968.
- Laeo-net: revisiting people looking at each other in videos, 2019.
- Group-level speech emotion recognition utilising deep spectrum features. In Proceedings of the 2020 International Conference on Multimodal Interaction, pages 821–826, 2020.
- Group-Level Emotion Recognition Using a Unimodal Privacy-Safe Non-Individual Approach. In EmotiW2020 Challenge at the 22nd ACM International Conference on Multimodal Interaction (ICMI2020), Utrecht, Netherlands, October 2020. URL https://inria.hal.science/hal-02937871.
- Audiovisual classification of group emotion valence using activity recognition networks. In IEEE 4th International Conference on Image Processing, Applications and Systems (IPAS 2020), pages 114–119. IEEE, 2020.
- Andrey V Savchenko and IA Makarov. Neural network model for video-based analysis of student’s emotions in e-learning. Optical Memory and Neural Networks, 31(3):237–244, 2022.
- Automatic group level affect and cohesion prediction in videos. 8th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos, ACIIW 2019, pages 161–167, 9 2019. doi:10.1109/ACIIW.2019.8925231.
- Audio-visual automatic group affect analysis. IEEE Transactions on Affective Computing, 2021.
- Multi-modal fusion using spatio-temporal and static features for group emotion recognition. Proceedings of the 2020 International Conference on Multimodal Interaction (ICMI 2020), pages 835–840, 10 2020. doi:10.1145/3382507.3417971.
- Emotion recognition in the wild using deep neural networks and bayesian classifiers. In Proceedings of the 19th ACM International Conference on Multimodal Interaction (ICMI 2017), pages 593–597, 2017.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Cascade attention networks for group emotion recognition with face, body and image cues. Proceedings of the 2018 International Conference on Multimodal Interaction (ICMI 2018), pages 640–645, 10 2018. doi:10.1145/3242969.3264991.
- Implicit knowledge injectable cross attention audiovisual model for group emotion recognition. In Proceedings of the International Conference on Multimodal Interaction (ICMI 2020), pages 827–834. Association for Computing Machinery, Inc, 10 2020. ISBN 9781450375818. doi:10.1145/3382507.3417960.
- Group emotion recognition based on global and local features. IEEE Access, 7:111617–111624, 2019. ISSN 21693536. doi:10.1109/ACCESS.2019.2932797.
- LSUN: construction of a large-scale image dataset using deep learning with humans in the loop. CoRR, abs/1506.03365, 2015. URL http://arxiv.org/abs/1506.03365.
- Anderson Augusma (2 papers)
- Dominique Vaufreydaz (21 papers)
- Frédérique Letué (6 papers)