Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Recursive Joint Cross-Modal Attention for Multimodal Fusion in Dimensional Emotion Recognition (2403.13659v4)

Published 20 Mar 2024 in cs.CV, cs.SD, and eess.AS

Abstract: Though multimodal emotion recognition has achieved significant progress over recent years, the potential of rich synergic relationships across the modalities is not fully exploited. In this paper, we introduce Recursive Joint Cross-Modal Attention (RJCMA) to effectively capture both intra- and inter-modal relationships across audio, visual, and text modalities for dimensional emotion recognition. In particular, we compute the attention weights based on cross-correlation between the joint audio-visual-text feature representations and the feature representations of individual modalities to simultaneously capture intra- and intermodal relationships across the modalities. The attended features of the individual modalities are again fed as input to the fusion model in a recursive mechanism to obtain more refined feature representations. We have also explored Temporal Convolutional Networks (TCNs) to improve the temporal modeling of the feature representations of individual modalities. Extensive experiments are conducted to evaluate the performance of the proposed fusion model on the challenging Affwild2 dataset. By effectively capturing the synergic intra- and inter-modal relationships across audio, visual, and text modalities, the proposed fusion model achieves a Concordance Correlation Coefficient (CCC) of 0.585 (0.542) and 0.674 (0.619) for valence and arousal respectively on the validation set(test set). This shows a significant improvement over the baseline of 0.240 (0.211) and 0.200 (0.191) for valence and arousal, respectively, in the validation set (test set), achieving second place in the valence-arousal challenge of the 6th Affective Behavior Analysis in-the-Wild (ABAW) competition.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Cnn architectures for large-scale audio classification. In ICASSP 2017.
  2. Multi-objective based spatio-temporal feature representation learning robust to expression intensity variations for facial expression recognition. IEEE Trans. on Affective Computing, 10(2):223–236, 2019.
  3. Dimitrios Kollias. Abaw: Valence-arousal estimation, expression recognition, action unit detection and multi-task learning challenges. arXiv:2202.10659, 2022.
  4. Dimitrios Kollias. Abaw: learning from synthetic data & multi-task learning challenges. In European Conference on Computer Vision, pages 157–172. Springer, 2023a.
  5. Dimitrios Kollias. Multi-label compound expression recognition: C-expr database & network. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5589–5598, 2023b.
  6. Expression, affect, action unit recognition: Aff-wild2, multi-task learning and arcface. arXiv:1910.04855, 2019.
  7. Affect analysis in-the-wild: Valence-arousal, expressions, action units and a unified framework. arXiv:2103.15792, 2021a.
  8. Analysing affective behavior in the second abaw2 competition. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 3652–3660, 2021b.
  9. Face behavior a la carte: Expressions, affect and action units in a single network. arXiv:1910.11111, 2019a.
  10. Deep affect prediction in-the-wild: Aff-wild database and challenge, deep architectures, and beyond. International Journal of Computer Vision, pages 1–23, 2019b.
  11. Analysing affective behavior in the first abaw 2020 competition. In FG, 2020.
  12. Distribution matching for heterogeneous multi-task learning: a large-scale face study. arXiv:2105.03790, 2021.
  13. Abaw: Valence-arousal estimation, expression recognition, action unit detection & emotional reaction intensity estimation challenges. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 5888–5897, 2023.
  14. The 6th affective behavior analysis in-the-wild (abaw) competition. arXiv preprint arXiv:2402.19344, 2024.
  15. Two-stream aural-visual affect analysis in the wild. In FGW, 2020.
  16. Multi-modal emotion estimation for in-the-wild videos. arXiv:2203.13032, 2022.
  17. Continuous prediction of spontaneous affect from multiple cues and modalities in valence-arousal space. IEEE Trans. on Affective Computing, 2:92–105, 2011.
  18. Cross attentional audio-visual fusion for dimensional emotion recognition. In FG, 2021.
  19. A joint cross-attention model for audio-visual fusion in dimensional emotion recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 2486–2495, 2022.
  20. Audio–visual fusion for emotion recognition in the valence–arousal space using joint cross-attention. IEEE Transactions on Biometrics, Behavior, and Identity Science, 5(3):360–373, 2023a.
  21. Recursive joint attention for audio-visual fusion in regression based emotion recognition. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023b.
  22. Speech based emotion recognition. In Speech and Audio Processing for Coding, Enhancement and Recognition, 2015.
  23. Lstm-modeling of continuous emotions in an a-v affect recognition framework. IVC, 31(2):153–163, 2013.
  24. Aff-wild: Valence and arousal ‘in-the-wild’challenge. In Computer Vision and Pattern Recognition Workshops (CVPRW), 2017 IEEE Conference on, pages 1980–1987. IEEE, 2017.
  25. Continuous emotion recognition with audio-visual leader-follower attentive fusion. In ICCV Workshop, 2021.
  26. Multimodal continuous emotion recognition: A technical report for abaw5. In 2023 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 5764–5769, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. R. Gnana Praveen (15 papers)
  2. Jahangir Alam (16 papers)
Citations (11)