TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation (2401.12987v2)

Published 16 Jan 2024 in cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: Emotion Recognition in Conversation (ERC) plays a crucial role in enabling dialogue systems to effectively respond to user requests. The emotions in a conversation can be identified by the representations from various modalities, such as audio, visual, and text. However, due to the weak contribution of non-verbal modalities to recognize emotions, multimodal ERC has always been considered a challenging task. In this paper, we propose Teacher-leading Multimodal fusion network for ERC (TelME). TelME incorporates cross-modal knowledge distillation to transfer information from a LLM acting as the teacher to the non-verbal students, thereby optimizing the efficacy of the weak modalities. We then combine multimodal features using a shifting fusion approach in which student networks support the teacher. TelME achieves state-of-the-art performance in MELD, a multi-speaker conversation dataset for ERC. Finally, we demonstrate the effectiveness of our components through additional experiments.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (48)

Authors (4)

Taeyang Yun (1 paper)
Hyunkuk Lim (2 papers)
Jeonghwan Lee (6 papers)
Min Song (25 papers)

Citations (4)

View on Semantic Scholar

Tweets

https://twitter.com/ArxivSound/status/1775115800757620806

https://twitter.com/AudioAndSpeech/status/1775383654660714610

https://twitter.com/gastronomy/status/1750338866915991827

TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation (2401.12987v2)

Related Papers

Tweets