Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation (2401.12987v2)

Published 16 Jan 2024 in cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: Emotion Recognition in Conversation (ERC) plays a crucial role in enabling dialogue systems to effectively respond to user requests. The emotions in a conversation can be identified by the representations from various modalities, such as audio, visual, and text. However, due to the weak contribution of non-verbal modalities to recognize emotions, multimodal ERC has always been considered a challenging task. In this paper, we propose Teacher-leading Multimodal fusion network for ERC (TelME). TelME incorporates cross-modal knowledge distillation to transfer information from a LLM acting as the teacher to the non-verbal students, thereby optimizing the efficacy of the weak modalities. We then combine multimodal features using a shifting fusion approach in which student networks support the teacher. TelME achieves state-of-the-art performance in MELD, a multi-speaker conversation dataset for ERC. Finally, we demonstrate the effectiveness of our components through additional experiments.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning, pages 1298–1312. PMLR.
  2. Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443.
  3. Is space-time attention all you need for video understanding? In ICML, volume 2, page 4.
  4. Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42:335–359.
  5. M2fnet: multi-modal fusion network for emotion recognition in conversation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4652–4661.
  6. Cosmic: Commonsense knowledge for emotion identification in conversations. arXiv preprint arXiv:2010.02795.
  7. Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation. arXiv preprint arXiv:1908.11540.
  8. Cross modal distillation for supervision transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2827–2836.
  9. Icon: Interactive conversational memory network for multimodal emotion detection. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 2594–2604.
  10. Conversational memory network for emotion recognition in dyadic dialogue videos. In Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting, volume 2018, page 2122. NIH Public Access.
  11. A comprehensive overhaul of feature distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1921–1930.
  12. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
  13. Supervised adversarial contrastive learning for emotion recognition in conversations. arXiv preprint arXiv:2306.01505.
  14. Mm-dfn: Multimodal dynamic fusion network for emotion recognition in conversations. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7037–7041. IEEE.
  15. Dialoguecrn: Contextual reasoning networks for emotion recognition in conversations. arXiv preprint arXiv:2106.01978.
  16. Unimse: Towards unified multimodal sentiment analysis and emotion recognition. arXiv preprint arXiv:2211.11256.
  17. Mmgcn: Multimodal fusion via deep graph convolution network for emotion recognition in conversation. arXiv preprint arXiv:2107.06779.
  18. Knowledge distillation from a stronger teacher. arXiv preprint arXiv:2205.10536.
  19. Relation-aware graph attention networks with relational position encodings for emotion recognition in conversations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7360–7370.
  20. Higru: Hierarchical gated recurrent units for utterance-level emotion recognition. arXiv preprint arXiv:1904.04446.
  21. Msd: Saliency-aware knowledge distillation for multimodal understanding. arXiv preprint arXiv:2101.01881.
  22. Taewoon Kim and Piek Vossen. 2021. Emoberta: Speaker-aware emotion recognition in conversation with roberta. arXiv preprint arXiv:2108.12009.
  23. Joosung Lee and Wooin Lee. 2021. Compm: Context modeling with speaker’s pre-trained memory tracking for emotion recognition in conversation. arXiv preprint arXiv:2108.11626.
  24. Ga2mif: Graph and attention based two-stage multi-source information fusion for conversational emotion detection. IEEE Transactions on Affective Computing.
  25. Hitrans: A transformer-based context-and speaker-sensitive model for emotion detection in conversations. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4190–4200.
  26. Decoupled multimodal distilling for emotion recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6631–6640.
  27. Emocaps: Emotion capsule based model for conversational emotion recognition. arXiv preprint arXiv:2203.13504.
  28. Foundations and recent trends in multimodal machine learning: Principles, challenges, and open questions. arXiv preprint arXiv:2209.03430.
  29. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
  30. A transformer-based model with self-distillation for multimodal emotion recognition in conversations. IEEE Transactions on Multimedia.
  31. A survey on empathetic dialogue systems. Information Fusion, 64:50–70.
  32. Dialoguernn: An attentive rnn for emotion detection in conversations. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 6818–6825.
  33. Dialoguetrm: Exploring multi-modal emotional dynamics in a conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2694–2704.
  34. Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 873–883.
  35. Meld: A multimodal multi-party dataset for emotion recognition in conversations. arXiv preprint arXiv:1810.02508.
  36. Emotion recognition in conversation: Research challenges, datasets, and recent advances. IEEE Access, 7:100943–100953.
  37. Integrating multimodal information in large pretrained transformers. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 2020, page 2359. NIH Public Access.
  38. Directed acyclic graph network for conversational emotion recognition. arXiv preprint arXiv:2105.12907.
  39. Supervised prototypical contrastive learning for emotion recognition in conversation. arXiv preprint arXiv:2210.08713.
  40. Emotionflow: Capture the dialogue level emotion transitions. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8542–8546. IEEE.
  41. From within to between: Knowledge distillation for cross modality retrieval. In Proceedings of the Asian Conference on Computer Vision, pages 3223–3240.
  42. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
  43. The modality focusing hypothesis: Towards understanding crossmodal knowledge distillation. arXiv preprint arXiv:2206.06487.
  44. Modeling both context-and speaker-sensitive dependence for emotion detection in multi-speaker conversations. In IJCAI, pages 5415–5421.
  45. Multi-channel weight-sharing autoencoder based on cascade multi-head attention for multimodal emotion recognition. IEEE Transactions on Multimedia.
  46. A facial expression-aware multimodal multi-task learning framework for emotion recognition in multi-party conversations. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15445–15459.
  47. Knowledge-enriched transformer for emotion detection in textual conversations. arXiv preprint arXiv:1909.10681.
  48. Topic-driven and knowledge-aware transformer for dialogue emotion detection. arXiv preprint arXiv:2106.01071.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Taeyang Yun (1 paper)
  2. Hyunkuk Lim (2 papers)
  3. Jeonghwan Lee (6 papers)
  4. Min Song (25 papers)
Citations (4)