TelME: Teacher-leading Multimodal Fusion Network for Emotion Recognition in Conversation (2401.12987v2)
Abstract: Emotion Recognition in Conversation (ERC) plays a crucial role in enabling dialogue systems to effectively respond to user requests. The emotions in a conversation can be identified by the representations from various modalities, such as audio, visual, and text. However, due to the weak contribution of non-verbal modalities to recognize emotions, multimodal ERC has always been considered a challenging task. In this paper, we propose Teacher-leading Multimodal fusion network for ERC (TelME). TelME incorporates cross-modal knowledge distillation to transfer information from a LLM acting as the teacher to the non-verbal students, thereby optimizing the efficacy of the weak modalities. We then combine multimodal features using a shifting fusion approach in which student networks support the teacher. TelME achieves state-of-the-art performance in MELD, a multi-speaker conversation dataset for ERC. Finally, we demonstrate the effectiveness of our components through additional experiments.
- Data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning, pages 1298–1312. PMLR.
- Multimodal machine learning: A survey and taxonomy. IEEE transactions on pattern analysis and machine intelligence, 41(2):423–443.
- Is space-time attention all you need for video understanding? In ICML, volume 2, page 4.
- Iemocap: Interactive emotional dyadic motion capture database. Language resources and evaluation, 42:335–359.
- M2fnet: multi-modal fusion network for emotion recognition in conversation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4652–4661.
- Cosmic: Commonsense knowledge for emotion identification in conversations. arXiv preprint arXiv:2010.02795.
- Dialoguegcn: A graph convolutional neural network for emotion recognition in conversation. arXiv preprint arXiv:1908.11540.
- Cross modal distillation for supervision transfer. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2827–2836.
- Icon: Interactive conversational memory network for multimodal emotion detection. In Proceedings of the 2018 conference on empirical methods in natural language processing, pages 2594–2604.
- Conversational memory network for emotion recognition in dyadic dialogue videos. In Proceedings of the conference. Association for Computational Linguistics. North American Chapter. Meeting, volume 2018, page 2122. NIH Public Access.
- A comprehensive overhaul of feature distillation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 1921–1930.
- Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531.
- Supervised adversarial contrastive learning for emotion recognition in conversations. arXiv preprint arXiv:2306.01505.
- Mm-dfn: Multimodal dynamic fusion network for emotion recognition in conversations. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7037–7041. IEEE.
- Dialoguecrn: Contextual reasoning networks for emotion recognition in conversations. arXiv preprint arXiv:2106.01978.
- Unimse: Towards unified multimodal sentiment analysis and emotion recognition. arXiv preprint arXiv:2211.11256.
- Mmgcn: Multimodal fusion via deep graph convolution network for emotion recognition in conversation. arXiv preprint arXiv:2107.06779.
- Knowledge distillation from a stronger teacher. arXiv preprint arXiv:2205.10536.
- Relation-aware graph attention networks with relational position encodings for emotion recognition in conversations. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 7360–7370.
- Higru: Hierarchical gated recurrent units for utterance-level emotion recognition. arXiv preprint arXiv:1904.04446.
- Msd: Saliency-aware knowledge distillation for multimodal understanding. arXiv preprint arXiv:2101.01881.
- Taewoon Kim and Piek Vossen. 2021. Emoberta: Speaker-aware emotion recognition in conversation with roberta. arXiv preprint arXiv:2108.12009.
- Joosung Lee and Wooin Lee. 2021. Compm: Context modeling with speaker’s pre-trained memory tracking for emotion recognition in conversation. arXiv preprint arXiv:2108.11626.
- Ga2mif: Graph and attention based two-stage multi-source information fusion for conversational emotion detection. IEEE Transactions on Affective Computing.
- Hitrans: A transformer-based context-and speaker-sensitive model for emotion detection in conversations. In Proceedings of the 28th International Conference on Computational Linguistics, pages 4190–4200.
- Decoupled multimodal distilling for emotion recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6631–6640.
- Emocaps: Emotion capsule based model for conversational emotion recognition. arXiv preprint arXiv:2203.13504.
- Foundations and recent trends in multimodal machine learning: Principles, challenges, and open questions. arXiv preprint arXiv:2209.03430.
- Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692.
- A transformer-based model with self-distillation for multimodal emotion recognition in conversations. IEEE Transactions on Multimedia.
- A survey on empathetic dialogue systems. Information Fusion, 64:50–70.
- Dialoguernn: An attentive rnn for emotion detection in conversations. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 6818–6825.
- Dialoguetrm: Exploring multi-modal emotional dynamics in a conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 2694–2704.
- Context-dependent sentiment analysis in user-generated videos. In Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 873–883.
- Meld: A multimodal multi-party dataset for emotion recognition in conversations. arXiv preprint arXiv:1810.02508.
- Emotion recognition in conversation: Research challenges, datasets, and recent advances. IEEE Access, 7:100943–100953.
- Integrating multimodal information in large pretrained transformers. In Proceedings of the conference. Association for Computational Linguistics. Meeting, volume 2020, page 2359. NIH Public Access.
- Directed acyclic graph network for conversational emotion recognition. arXiv preprint arXiv:2105.12907.
- Supervised prototypical contrastive learning for emotion recognition in conversation. arXiv preprint arXiv:2210.08713.
- Emotionflow: Capture the dialogue level emotion transitions. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 8542–8546. IEEE.
- From within to between: Knowledge distillation for cross modality retrieval. In Proceedings of the Asian Conference on Computer Vision, pages 3223–3240.
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771.
- The modality focusing hypothesis: Towards understanding crossmodal knowledge distillation. arXiv preprint arXiv:2206.06487.
- Modeling both context-and speaker-sensitive dependence for emotion detection in multi-speaker conversations. In IJCAI, pages 5415–5421.
- Multi-channel weight-sharing autoencoder based on cascade multi-head attention for multimodal emotion recognition. IEEE Transactions on Multimedia.
- A facial expression-aware multimodal multi-task learning framework for emotion recognition in multi-party conversations. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15445–15459.
- Knowledge-enriched transformer for emotion detection in textual conversations. arXiv preprint arXiv:1909.10681.
- Topic-driven and knowledge-aware transformer for dialogue emotion detection. arXiv preprint arXiv:2106.01071.
- Taeyang Yun (1 paper)
- Hyunkuk Lim (2 papers)
- Jeonghwan Lee (6 papers)
- Min Song (25 papers)