Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
55 tokens/sec
2000 character limit reached

Cross-Attention is Not Enough: Incongruity-Aware Dynamic Hierarchical Fusion for Multimodal Affect Recognition (2305.13583v4)

Published 23 May 2023 in cs.CL, cs.MM, eess.AS, and eess.IV

Abstract: Fusing multiple modalities has proven effective for multimodal information processing. However, the incongruity between modalities poses a challenge for multimodal fusion, especially in affect recognition. In this study, we first analyze how the salient affective information in one modality can be affected by the other, and demonstrate that inter-modal incongruity exists latently in crossmodal attention. Based on this finding, we propose the Hierarchical Crossmodal Transformer with Dynamic Modality Gating (HCT-DMG), a lightweight incongruity-aware model, which dynamically chooses the primary modality in each training batch and reduces fusion times by leveraging the learned hierarchy in the latent space to alleviate incongruity. The experimental evaluation on five benchmark datasets: CMU-MOSI, CMU-MOSEI, and IEMOCAP (sentiment and emotion), where incongruity implicitly lies in hard samples, as well as UR-FUNNY (humour) and MUStaRD (sarcasm), where incongruity is common, verifies the efficacy of our approach, showing that HCT-DMG: 1) outperforms previous multimodal models with a reduced size of approximately 0.8M parameters; 2) recognizes hard samples where incongruity makes affect recognition difficult; 3) mitigates the incongruity at the latent level in crossmodal attention.

Citations (8)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.