Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Cross-Attention is Not Always Needed: Dynamic Cross-Attention for Audio-Visual Dimensional Emotion Recognition (2403.19554v1)

Published 28 Mar 2024 in cs.CV

Abstract: In video-based emotion recognition, audio and visual modalities are often expected to have a complementary relationship, which is widely explored using cross-attention. However, they may also exhibit weak complementary relationships, resulting in poor representations of audio-visual features, thus degrading the performance of the system. To address this issue, we propose Dynamic Cross-Attention (DCA) that can dynamically select cross-attended or unattended features on the fly based on their strong or weak complementary relationship with each other, respectively. Specifically, a simple yet efficient gating layer is designed to evaluate the contribution of the cross-attention mechanism and choose cross-attended features only when they exhibit a strong complementary relationship, otherwise unattended features. We evaluate the performance of the proposed approach on the challenging RECOLA and Aff-Wild2 datasets. We also compare the proposed approach with other variants of cross-attention and show that the proposed model consistently improves the performance on both datasets.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (25)
  1. “Multimodal emotion recognition using cross-modal attention and 1d convolutional neural networks,” in Interspeech, 2020.
  2. “Cross attentional audio-visual fusion for dimensional emotion recognition,” in FG, 2021.
  3. “Multimodal learning with transformers: A survey,” TPAMI, 2023.
  4. “Leaky gated cross-attention for weakly supervised multi-modal temporal action localization,” in WACV, 2022.
  5. “M2lens: Visualizing and explaining multimodal models for sentiment analysis,” TVCG, 2022.
  6. “Time-continuous audiovisual fusion with recurrence vs attention for in-the-wild affect recognition,” in CVPRW, 2022.
  7. “Leveraging tcn and transformer for effective visual-audio fusion in continuous emotion recognition,” in CVPRW, 2023.
  8. “Valence and arousal estimation based on multimodal temporal-aware features for videos in the wild,” in CVPRW, 2022.
  9. “Detecting expressions with multimodal transformers,” in IEEE SLT Workshop, 2021.
  10. “Continuous emotion recognition with audio-visual leader-follower attentive fusion,” in ICCVW, 2021.
  11. “Multi-modal facial affective analysis based on masked autoencoder,” in IEEE CVPRW, 2023.
  12. “Audio-visual fusion for emotion recognition in the valence-arousal space using joint cross-attention,” TBIOM, 2023.
  13. “Recursive joint attention for audio-visual fusion in regression based emotion recognition,” in ICASSP, 2023.
  14. “Not all attention is needed: Gated attention network for sequence data,” AAAI, 2020.
  15. “Gated mechanism for attention based multi modal sentiment analysis,” in ICASSP, 2020.
  16. “Embrace smaller attention: Efficient cross-modal matching with dual gated attention fusion,” in Proc. of IEEE ICASSP, 2023.
  17. “Gated attention fusion network for multimodal sentiment classification,” Knowledge-Based Systems, 2022.
  18. “Audio-visual gated-sequenced neural networks for affect recognition,” TAC, 2022.
  19. “Group gated fusion on attention-based bidirectional alignment for multimodal emotion recognition,” in Interspeech, 2020.
  20. “Regularizing deep neural networks by noise: Its interpretation and optimization,” in NIPS, 2017.
  21. “Introducing the recola multimodal corpus of remote collaborative and affective interactions,” in FG, 2013.
  22. “Abaw: Valence-arousal estimation, expression recognition, action unit detection & emotional reaction intensity estimation challenges,” in CVPRW, 2023.
  23. “Leveraging recent advances in deep learning for audio-visual emotion recognition,” PR Letters, 2021.
  24. “End-to-end multimodal emotion recognition using deep neural networks,” JSTSP, 2017.
  25. “Emotion recognition using fusion of audio and video features,” in SMC, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. R. Gnana Praveen (15 papers)
  2. Jahangir Alam (16 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Youtube Logo Streamline Icon: https://streamlinehq.com