MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations (1810.02508v6)

Published 5 Oct 2018 in cs.CL

Abstract: Emotion recognition in conversations is a challenging task that has recently gained popularity due to its potential applications. Until now, however, a large-scale multimodal multi-party emotional conversational database containing more than two speakers per dialogue was missing. Thus, we propose the Multimodal EmotionLines Dataset (MELD), an extension and enhancement of EmotionLines. MELD contains about 13,000 utterances from 1,433 dialogues from the TV-series Friends. Each utterance is annotated with emotion and sentiment labels, and encompasses audio, visual and textual modalities. We propose several strong multimodal baselines and show the importance of contextual and multimodal information for emotion recognition in conversations. The full dataset is available for use at http:// affective-meld.github.io.

Authors (6)

Soujanya Poria (138 papers)
Devamanyu Hazarika (33 papers)
Navonil Majumder (48 papers)
Gautam Naik (3 papers)
Erik Cambria (136 papers)
Rada Mihalcea (131 papers)

Citations (920)

View on Semantic Scholar

Summary

MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations

The paper "MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations" addresses the pressing need for a comprehensive dataset that aids in the development of robust emotion recognition systems in natural conversational settings. Emotion recognition in conversations (ERC) is a nuanced and complex task due to the inherent multimodal nature of human communication, necessitating data that captures textual, auditory, and visual cues.

Introduction and Motivation

Emotion recognition in multimodal dialogues has garnered substantial interest due to applications spanning dialogue systems, user behavior analysis, and emotional state monitoring. Existing multimodal datasets such as IEMOCAP and SEMAINE cater to dyadic conversations but fall short in representing multi-party interaction scenarios. The unavailability of extensive multimodal multi-party datasets impedes progress in this domain. To bridge this gap, MELD—the Multimodal EmotionLines Dataset—was proposed. An extension of the EmotionLines dataset, MELD encompasses approximately 13,000 utterances from 1,433 dialogues extracted from the TV series Friends, annotated with emotion and sentiment labels across textual, audio, and visual modalities.

Dataset Construction and Annotation

The original EmotionLines dataset was limited to textual analysis. MELD extends this by incorporating audio and visual modalities, thus providing a richer dataset suited for multimodal ERC. The dataset construction involved the following steps:

Extraction of starting and ending timestamps of utterances.
Annotating dialogues that span across multiple scenes or episodes to ensure coherence.
Employing three annotators to label each utterance based on the integrated multimodal information, leading to a higher Fleiss' kappa score (0.43) compared to the original EmotionLines (0.34).

Challenges in ERC

ERC systems need to handle several challenges such as modeling conversational context, recognizing emotion shifts, and dealing with short utterances. Recognizing emotions based on isolated utterances can lead to inaccuracies, thus emphasizing the need for context-aware models. The paper demonstrates how facial expressions, vocal tonality, and textual content together provide a robust basis for ERC.

Baseline Models and Results

To demonstrate the efficacy of MELD, the paper presents strong baselines:

text-CNN: Applies CNN to isolate utterances without contextual information.
bcLSTM: Incorporates bi-modal context representation for audio and text using RNNs.
DialogueRNN: State-of-the-art model that tracks individual speaker states and models inter-speaker interactions through gated recurrent units (GRU).

The numerical results indicate that the multimodal variant of DialogueRNN achieved the highest performance with a weighted F-score of 67.56% for sentiment classification and 60.25% for emotion classification in a 7-way classification setting. The addition of multimodal features results in a modest improvement, recognizing the necessity for better fusion methodologies and enhanced extraction of auditory features.

Implications and Future Directions

The introduction of MELD holds significant implications for the development of contextually and multimodally aware ERC systems. Future research directions highlighted by the paper include:

Improvement of contextual modeling in ERC systems.
Enhanced audio and visual feature extraction methodologies.
Exploration of advanced multimodal fusion techniques beyond simple concatenation.

Applications

The MELD dataset will benefit various applications such as empathetic response generation in dialogue systems and personality modeling in interactive settings. Artificially intelligent personal assistants could leverage such datasets to interpret and respond to users more naturally.

Conclusion

Overall, MELD represents an incremental but crucial step toward building more sophisticated multimodal ERC systems. The rigorous methodology and comprehensive dataset enable advancements in understanding and responding to human emotions within the conversational context. The potential for future research leveraging MELD is vast, promising deeper insights and more nuanced applications in emotion recognition and beyond.

PDF Markdown

Related Papers

Find Related Papers