Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 30 tok/s
Gemini 2.5 Pro 46 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 12 tok/s Pro
GPT-4o 91 tok/s Pro
Kimi K2 184 tok/s Pro
GPT OSS 120B 462 tok/s Pro
Claude Sonnet 4 36 tok/s Pro
2000 character limit reached

Multimodal Fusion Method with Spatiotemporal Sequences and Relationship Learning for Valence-Arousal Estimation (2403.12425v2)

Published 19 Mar 2024 in cs.CV, cs.SD, and eess.AS

Abstract: This paper presents our approach for the VA (Valence-Arousal) estimation task in the ABAW6 competition. We devised a comprehensive model by preprocessing video frames and audio segments to extract visual and audio features. Through the utilization of Temporal Convolutional Network (TCN) modules, we effectively captured the temporal and spatial correlations between these features. Subsequently, we employed a Transformer encoder structure to learn long-range dependencies, thereby enhancing the model's performance and generalization ability. Our method leverages a multimodal data fusion approach, integrating pre-trained audio and video backbones for feature extraction, followed by TCN-based spatiotemporal encoding and Transformer-based temporal information capture. Experimental results demonstrate the effectiveness of our approach, achieving competitive performance in VA estimation on the AffWild2 dataset.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. An empirical evaluation of generic convolutional and recurrent networks for sequence modeling, 2018.
  2. Invertible residual networks, 2019.
  3. Comparison of parametric representations for monosyllabic word recognition in continuously spoken sentences. IEEE transactions on acoustics, speech, and signal processing, 28(4):357–366, 1980.
  4. Iterative distillation for better uncertainty estimates in multitask emotion recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3557–3566, 2021.
  5. Arcface: Additive angular margin loss for deep face recognition. IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(10):5962–5979, 2022.
  6. Improved residual networks for image and video recognition, 2020.
  7. A survey on vision transformer. IEEE transactions on pattern analysis and machine intelligence, 45(1):87–110, 2022.
  8. Deep residual learning for image recognition, 2015.
  9. Cnn architectures for large-scale audio classification, 2017.
  10. Emotion recognition and artificial intelligence: A systematic review (2014–2023) and research recommendations. Information Fusion, 102:102019, 2024.
  11. Facial expression recognition with swin transformer. arXiv preprint arXiv:2203.13472, 2022.
  12. Expression, affect, action unit recognition: Aff-wild2, multi-task learning and arcface. arXiv preprint arXiv:1910.04855, 2019.
  13. Affect analysis in-the-wild: Valence-arousal, expressions, action units and a unified framework, 2021.
  14. Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond. International Journal of Computer Vision, 127(6–7):907–929, 2019.
  15. Two-stream aural-visual affect analysis in the wild. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pages 600–605. IEEE, 2020.
  16. A review on speech emotion recognition using deep learning and attention mechanism. Electronics, 10(10):1163, 2021.
  17. Multi-modal emotion estimation for in-the-wild videos. arXiv preprint arXiv:2203.13032, 2022.
  18. Affectnet: A database for facial expression, valence, and arousal computing in the wild. IEEE Transactions on Affective Computing, 10(1):18–31, 2019.
  19. Attention bottlenecks for multimodal fusion. Advances in neural information processing systems, 34:14200–14213, 2021.
  20. Emotion recognition using fusion of audio and video features. In 2019 IEEE International Conference on Systems, Man and Cybernetics (SMC), pages 3847–3852. IEEE, 2019.
  21. Detecting expressions with multimodal transformers. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 636–643. IEEE, 2021.
  22. Convolutional mkl based multimodal emotion recognition and sentiment analysis. In 2016 IEEE 16th international conference on data mining (ICDM), pages 439–448. IEEE, 2016.
  23. Static and dynamic 3d facial expression recognition: A comprehensive survey. Image Vis. Comput., 30:683–697, 2012.
  24. Deep neural networks for acoustic emotion recognition: Raising the benchmarks. In 2011 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5688–5691. IEEE, 2011.
  25. End-to-end multimodal emotion recognition using deep neural networks. IEEE Journal of selected topics in signal processing, 11(8):1301–1309, 2017.
  26. Using deep autoencoders for facial expression recognition, 2018.
  27. Multitask multi-database emotion recognition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3637–3644, 2021.
  28. A multi-task mean teacher for semi-supervised facial affective behavior analysis. In Proceedings of the IEEE/CVF international conference on computer vision, pages 3603–3608, 2021.
  29. Aff-wild: Valence and arousal ‘in-the-wild’ challenge. 2017 IEEE Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 1980–1987, 2017.
  30. Multimodal continuous emotion recognition: A technical report for abaw5, 2023.
  31. Prior aided streaming network for multi-task affective analysis. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3539–3549, 2021.
  32. Transformer-based multimodal information fusion for facial expression analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2428–2437, 2022.
  33. Attention based fully convolutional network for speech emotion recognition. In 2018 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC), pages 1771–1775. IEEE, 2018.
  34. M 3 f: Multi-modal continuous valence-arousal estimation in the wild. In 2020 15th IEEE International Conference on Automatic Face and Gesture Recognition (FG 2020), pages 632–636. IEEE, 2020.
Citations (5)
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com