MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation (2107.06779v1)

Published 14 Jul 2021 in cs.CL, cs.SD, and eess.AS

Abstract: Emotion recognition in conversation (ERC) is a crucial component in affective dialogue systems, which helps the system understand users' emotions and generate empathetic responses. However, most works focus on modeling speaker and contextual information primarily on the textual modality or simply leveraging multimodal information through feature concatenation. In order to explore a more effective way of utilizing both multimodal and long-distance contextual information, we propose a new model based on multimodal fused graph convolutional network, MMGCN, in this work. MMGCN can not only make use of multimodal dependencies effectively, but also leverage speaker information to model inter-speaker and intra-speaker dependency. We evaluate our proposed model on two public benchmark datasets, IEMOCAP and MELD, and the results prove the effectiveness of MMGCN, which outperforms other SOTA methods by a significant margin under the multimodal conversation setting.

View on arXiv

Authors (4)

Jingwen Hu (9 papers)
Yuchen Liu (156 papers)
Jinming Zhao (26 papers)
Qin Jin (94 papers)

Citations (165)

View on Semantic Scholar

Summary

Multimodal Fusion in ERC: MMGCN Approach

Emotion recognition in conversation (ERC) is an increasingly significant area in natural language processing and multimodal processing. It aids affective dialogue systems in detecting users' emotional states and generating responses that are empathetic and contextually appropriate. The paper "MMGCN: Multimodal Fusion via Deep Graph Convolution Network for Emotion Recognition in Conversation" addresses some of the existing limitations in ERC research by focusing on multimodal fusion and contextual dependency modeling using a novel graph convolutional approach.

Methodology and Model Design

The paper introduces a multimodal fused graph convolutional network (MMGCN), which is designed to utilize multimodal dependencies effectively and model both inter-speaker and intra-speaker dependencies in conversations. Unlike previous models which emphasize textual information or employ simplistic feature concatenation methods, MMGCN constructs a fully connected graph across multiple modalities (textual, acoustic, and visual). This structure allows for richer interaction of contextual information across modalities. The integration of speaker embedding further enhances the model's effectiveness by distinguishing speaker-specific emotional tendencies and further improving the utterance representation within the graph framework.

The MMGCN employs spectral domain graph convolution for encoding the multimodal graph and extends convolutional layers to deep levels, optimizing the information flow and dependency capture between utterances. Experiments are conducted using two benchmark datasets, IEMOCAP and MELD, which are widely recognized in ERC research. Strong numerical results highlight the efficacy of MMGCN, outperforming existing state-of-the-art methods by a notable margin.

Experimental Results

The experimental evaluation shows that MMGCN consistently delivers superior performance under different multimodal settings. When compared against models using early fusion, late fusion, and gated attention mechanisms, MMGCN maintains a higher weighted average F1-score across datasets. It achieves 66.22% on IEMOCAP and 58.65% on MELD, outperforming other graph-based models like DialogueGCN which use only textual information. Additionally, experiments demonstrate the importance of modality-specific modeling, where MMGCN leverages acoustic and visual cues alongside textual signals to enrich emotional understanding in conversations.

Implications and Future Directions

The effectiveness of MMGCN in ERC brings several implications. Practically, it can enhance applications requiring emotional detection, such as virtual assistants, sentiment analysis tools, and human-computer interaction systems. Theoretically, it opens avenues for further exploration into multimodal fusion techniques and graph-based modeling. Future research can aim at refining graph construction strategies, investigating alternate convolutional mechanisms, and exploring scalability across larger and more diverse datasets.

Furthermore, MMGCN's success underscores the importance of considering both speaker-level details and cross-modal dependencies—elements critical in developing robust affective computing systems. The model can be extended or modified to suit non-dyadic conversation settings or integrate additional context cues like sentiment intensity or emotion trajectory over time, which could provide more nuanced insights into conversational dynamics.

In summary, the MMGCN model presents a significant step forward in multimodal emotion recognition in conversations, combining advanced graph convolutional techniques with comprehensive multi-source data integration. It generates a deeper understanding of conversational context, crucial for evolving ERC methodologies and practical applications.

PDF Markdown

Related Papers

Find Related Papers