Cross-Attention is Not Always Needed: Dynamic Cross-Attention for Audio-Visual Dimensional Emotion Recognition (2403.19554v1)
Abstract: In video-based emotion recognition, audio and visual modalities are often expected to have a complementary relationship, which is widely explored using cross-attention. However, they may also exhibit weak complementary relationships, resulting in poor representations of audio-visual features, thus degrading the performance of the system. To address this issue, we propose Dynamic Cross-Attention (DCA) that can dynamically select cross-attended or unattended features on the fly based on their strong or weak complementary relationship with each other, respectively. Specifically, a simple yet efficient gating layer is designed to evaluate the contribution of the cross-attention mechanism and choose cross-attended features only when they exhibit a strong complementary relationship, otherwise unattended features. We evaluate the performance of the proposed approach on the challenging RECOLA and Aff-Wild2 datasets. We also compare the proposed approach with other variants of cross-attention and show that the proposed model consistently improves the performance on both datasets.
- “Multimodal emotion recognition using cross-modal attention and 1d convolutional neural networks,” in Interspeech, 2020.
- “Cross attentional audio-visual fusion for dimensional emotion recognition,” in FG, 2021.
- “Multimodal learning with transformers: A survey,” TPAMI, 2023.
- “Leaky gated cross-attention for weakly supervised multi-modal temporal action localization,” in WACV, 2022.
- “M2lens: Visualizing and explaining multimodal models for sentiment analysis,” TVCG, 2022.
- “Time-continuous audiovisual fusion with recurrence vs attention for in-the-wild affect recognition,” in CVPRW, 2022.
- “Leveraging tcn and transformer for effective visual-audio fusion in continuous emotion recognition,” in CVPRW, 2023.
- “Valence and arousal estimation based on multimodal temporal-aware features for videos in the wild,” in CVPRW, 2022.
- “Detecting expressions with multimodal transformers,” in IEEE SLT Workshop, 2021.
- “Continuous emotion recognition with audio-visual leader-follower attentive fusion,” in ICCVW, 2021.
- “Multi-modal facial affective analysis based on masked autoencoder,” in IEEE CVPRW, 2023.
- “Audio-visual fusion for emotion recognition in the valence-arousal space using joint cross-attention,” TBIOM, 2023.
- “Recursive joint attention for audio-visual fusion in regression based emotion recognition,” in ICASSP, 2023.
- “Not all attention is needed: Gated attention network for sequence data,” AAAI, 2020.
- “Gated mechanism for attention based multi modal sentiment analysis,” in ICASSP, 2020.
- “Embrace smaller attention: Efficient cross-modal matching with dual gated attention fusion,” in Proc. of IEEE ICASSP, 2023.
- “Gated attention fusion network for multimodal sentiment classification,” Knowledge-Based Systems, 2022.
- “Audio-visual gated-sequenced neural networks for affect recognition,” TAC, 2022.
- “Group gated fusion on attention-based bidirectional alignment for multimodal emotion recognition,” in Interspeech, 2020.
- “Regularizing deep neural networks by noise: Its interpretation and optimization,” in NIPS, 2017.
- “Introducing the recola multimodal corpus of remote collaborative and affective interactions,” in FG, 2013.
- “Abaw: Valence-arousal estimation, expression recognition, action unit detection & emotional reaction intensity estimation challenges,” in CVPRW, 2023.
- “Leveraging recent advances in deep learning for audio-visual emotion recognition,” PR Letters, 2021.
- “End-to-end multimodal emotion recognition using deep neural networks,” JSTSP, 2017.
- “Emotion recognition using fusion of audio and video features,” in SMC, 2019.
- R. Gnana Praveen (15 papers)
- Jahangir Alam (16 papers)