Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Efficient Feature Extraction and Late Fusion Strategy for Audiovisual Emotional Mimicry Intensity Estimation (2403.11757v2)

Published 18 Mar 2024 in cs.MM, cs.LG, cs.SD, and eess.AS

Abstract: In this paper, we present the solution to the Emotional Mimicry Intensity (EMI) Estimation challenge, which is part of 6th Affective Behavior Analysis in-the-wild (ABAW) Competition.The EMI Estimation challenge task aims to evaluate the emotional intensity of seed videos by assessing them from a set of predefined emotion categories (i.e., "Admiration", "Amusement", "Determination", "Empathic Pain", "Excitement" and "Joy"). To tackle this challenge, we extracted rich dual-channel visual features based on ResNet18 and AUs for the video modality and effective single-channel features based on Wav2Vec2.0 for the audio modality. This allowed us to obtain comprehensive emotional features for the audiovisual modality. Additionally, leveraging a late fusion strategy, we averaged the predictions of the visual and acoustic models, resulting in a more accurate estimation of audiovisual emotional mimicry intensity. Experimental results validate the effectiveness of our approach, with the average Pearson's correlation Coefficient($\rho$) across the 6 emotion dimensionson the validation set achieving 0.3288.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (35)
  1. MuSe 2023 Challenge: Multimodal Prediction of Mimicked Emotions, Cross-Cultural Humour, and Personalised Recognition of Affects. In Proceedings of the 31st ACM International Conference on Multimedia (MM’23), October 29-November 2, 2023, Ottawa, Canada. Association for Computing Machinery, Ottawa, Canada. to appear.
  2. Snore Sound Classification Using Image-based Deep Spectrum Features. In Proceedings INTERSPEECH 2017, 18th Annual Conference of the International Speech Communication Association. ISCA, ISCA, Stockholm, Sweden, 3512–3516.
  3. Issa Annamoradnejad and Gohar Zoghi. 2020. ColBERT: Using BERT Sentence Embedding in Parallel Neural Networks for Computational Humor.
  4. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems 33 (2020), 12449–12460.
  5. Kim Binsted et al. 1995. Using humour to make natural language interfaces more friendly. In Proceedings of the ai, alife and entertainment workshop, intern. Joint conf. On artificial intelligence.
  6. Assessing humor at work: The humor climate questionnaire. Humor 27, 2 (2014), 307–323.
  7. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF international conference on computer vision. 9650–9660.
  8. Chengxin Chen and Pengyuan Zhang. 2022. Integrating Cross-Modal Interactions via Latent Representation Shift for Multi-Modal Humor Detection. In Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge (Lisboa, Portugal) (MuSe’ 22). Association for Computing Machinery, New York, NY, USA, 23–28. https://doi.org/10.1145/3551876.3554805
  9. Peng-Yu Chen and Von-Wun Soo. 2018. Humor Recognition Using Deep Learning. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers). Association for Computational Linguistics, New Orleans, Louisiana, 113–117. https://doi.org/10.18653/v1/N18-2018
  10. The MuSe 2023 Multimodal Sentiment Analysis Challenge: Mimicked Emotions, Cross-Cultural Humour, and Personalisation. In MuSe’23: Proceedings of the 4th Multimodal Sentiment Analysis Workshop and Challenge. Association for Computing Machinery. co-located with ACM Multimedia 2022, to appear.
  11. Towards Multimodal Prediction of Spontaneous Humour: A Novel Dataset and First Results. arXiv:2209.14272 [cs.LG]
  12. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4171–4186.
  13. The Geneva minimalistic acoustic parameter set (GeMAPS) for voice research and affective computing. IEEE Transactions on Affective Computing 7, 2 (2015), 190–202.
  14. Opensmile: the munich versatile and fast open-source audio feature extractor. In Proceedings of the 18th ACM International Conference on Multimedia. Association for Computing Machinery, Firenze, Italy, 1459–1462.
  15. Phonetics and ambiguity comprehension gated attention network for humor recognition. Complexity 2020 (2020), 1–9.
  16. Leader positive humor and organizational cynicism: LMX as a mediator. Leadership & Organization Development Journal 35 (2014), 305 – 315.
  17. Challenges in Representation Learning: A report on three machine learning contests. In Springer Berlin Heidelberg.
  18. Bi-Bimodal Modality Fusion for Correlation-Controlled Multimodal Sentiment Analysis. In Proceedings of the 2021 International Conference on Multimodal Interaction (Montréal, QC, Canada) (ICMI ’21). Association for Computing Machinery, New York, NY, USA, 6–15. https://doi.org/10.1145/3462244.3479919
  19. Deep Residual Learning for Image Recognition. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). 770–778. https://doi.org/10.1109/CVPR.2016.90
  20. Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization. CoRR abs/1412.6980 (2014).
  21. Anna Ladilova and Ulrike Schröder. 2022. Humor in intercultural interaction: A source for misunderstanding or a common ground builder? A multimodal analysis. Intercultural Pragmatics 19, 1 (2022), 71–101.
  22. Decoupled Multimodal Distilling for Emotion Recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6631–6640.
  23. Focal Loss for Dense Object Detection. arXiv:1708.02002 [cs.CV]
  24. R. Lotfian and C. Busso. 2019. Building Naturalistic Emotionally Balanced Speech Corpus by Retrieving Emotional Speech From Existing Podcast Recordings. IEEE Transactions on Affective Computing 10, 4 (October-December 2019), 471–483. https://doi.org/10.1109/TAFFC.2017.2736999
  25. Individual differences in uses of humor and their relation to psychological well-being: Development of the Humor Styles Questionnaire. Journal of Research in Personality 37, 1 (2003), 48–75. https://doi.org/10.1016/S0092-6566(02)00534-2
  26. PyTorch: An Imperative Style, High-Performance Deep Learning Library. Curran Associates Inc., Red Hook, NY, USA.
  27. Multimodal Learning using Optimal Transport for Sarcasm and Humor Detection. 2022 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV) (2021), 546–556.
  28. Is smiling during humor so obvious? A cross-cultural comparison of smiling behavior in humorous sequences in American English and French interactions. Intercultural Pragmatics 15 (2018), 563 – 591.
  29. Multimodal Humor Detection Based on Cross-Modal Attention and Modal Maximum Correlation. In 2022 IEEE 9th International Conference on Data Science and Advanced Analytics (DSAA). 1–2. https://doi.org/10.1109/DSAA54385.2022.10032426
  30. An Attention Network via Pronunciation, Lexicon and Syntax for Humor Recognition. Applied Intelligence 52, 3 (feb 2022), 2690–2702. https://doi.org/10.1007/s10489-021-02580-3
  31. Multimodal Transformer for Unaligned Multimodal Language Sequences. Proceedings of the conference. Association for Computational Linguistics. Meeting 2019 (2019), 6558–6569.
  32. Attention is All you Need. In NIPS.
  33. Hybrid Multimodal Fusion for Humor Detection. In Proceedings of the 3rd International on Multimodal Sentiment Analysis Workshop and Challenge (Lisboa, Portugal) (MuSe’ 22). Association for Computing Machinery, New York, NY, USA, 15–21. https://doi.org/10.1145/3551876.3554802
  34. Joint Face Detection and Alignment Using Multitask Cascaded Convolutional Networks. IEEE Signal Processing Letters 23 (04 2016).
  35. Learning Deep Global Multi-Scale and Local Attention Features for Facial Expression Recognition in the Wild. IEEE Transactions on Image Processing 30 (2021), 6544–6556. https://doi.org/10.1109/TIP.2021.3093397
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Jun Yu (233 papers)
  2. Wangyuan Zhu (8 papers)
  3. Jichao Zhu (10 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.