MELT: Towards Automated Multimodal Emotion Data Annotation by Leveraging LLM Embedded Knowledge (2505.24493v1)

Published 30 May 2025 in cs.AI, cs.SD, and eess.AS

Abstract: Although speech emotion recognition (SER) has advanced significantly with deep learning, annotation remains a major hurdle. Human annotation is not only costly but also subject to inconsistencies annotators often have different preferences and may lack the necessary contextual knowledge, which can lead to varied and inaccurate labels. Meanwhile, LLMs have emerged as a scalable alternative for annotating text data. However, the potential of LLMs to perform emotional speech data annotation without human supervision has yet to be thoroughly investigated. To address these problems, we apply GPT-4o to annotate a multimodal dataset collected from the sitcom Friends, using only textual cues as inputs. By crafting structured text prompts, our methodology capitalizes on the knowledge GPT-4o has accumulated during its training, showcasing that it can generate accurate and contextually relevant annotations without direct access to multimodal inputs. Therefore, we propose MELT, a multimodal emotion dataset fully annotated by GPT-4o. We demonstrate the effectiveness of MELT by fine-tuning four self-supervised learning (SSL) backbones and assessing speech emotion recognition performance across emotion datasets. Additionally, our subjective experiments\' results demonstrate a consistence performance improvement on SER.

Summary

The paper introduces MELT, a dataset fully annotated by GPT-4o using structured text prompts, demonstrating a scalable and cost-effective method for automated multimodal emotion data annotation.
Experimental results indicate that models fine-tuned on the MELT dataset achieve better generalization and robustness across various emotion recognition benchmarks compared to models trained on traditional human-annotated data.
This research highlights the potential of leveraging LLMs like GPT-4o to overcome inefficiencies in emotion data annotation, paving the way for more expansive and varied emotion recognition systems while acknowledging potential LLM biases.

MELT: Towards Automated Multimodal Emotion Data Annotation by Leveraging LLM Embedded Knowledge

The paper "MELT: Towards Automated Multimodal Emotion Data Annotation by Leveraging LLM Embedded Knowledge" presents a novel approach to annotating multimodal emotion data by utilizing the capabilities of LLMs, specifically GPT-4o. The work addresses critical challenges in the field of Speech Emotion Recognition (SER), focusing on the inefficiencies and inconsistencies present in human-based annotation processes.

Methodological Framework

The researchers propose MELT, a dataset fully annotated using GPT-4o, derived from the television series "Friends." The dataset utilizes structured text prompts to leverage GPT-4o's contextual and embedded knowledge, demonstrating the LLM's ability to annotate multimodal data based solely on textual input. This method circumvents the need for direct human supervision or access to audio data, proposing a scalable and cost-effective annotation pipeline.

Key to their methodology is the structured prompting framework that incorporates elements such as context-awareness, Chain-of-Thought (CoT) reasoning, and cross-validation. This approach ensures that the model not only accurately captures the emotional content of the dialogue but also maintains consistency across annotations. Using these structured prompts, GPT-4o was successfully applied to create a coherent and contextually relevant emotional annotation for the MELT dataset.

Experimental Analysis and Results

The research employs both subjective and objective evaluations to ascertain the effectiveness of the GPT-4o-based annotations. Subjectively, a Mean Opinion Score (MOS) experiment involving human participants revealed a preference for annotations generated by MELT over those created via traditional human-annotated datasets such as MELD. This indicates that LLM-augmented annotations are perceived as more contextually relevant and accurate by human evaluators.

Objectively, the authors fine-tuned four self-supervised learning (SSL) models using the MELT dataset and compared their performance on various emotion recognition benchmarks, including IEMOCAP, TESS, RAVDESS, and CREMA-D. The results show that models trained on MELT generally achieve better generalization and robustness across these datasets, yielding improvements in Unweighted Accuracy Recall (UAR) and other classification metrics.

Implications and Future Directions

This research highlights the potential of integrating LLMs into the process of emotion annotation, particularly for multimodal datasets. The ability of LLMs like GPT-4o to understand and generate contextually appropriate emotion annotations without requiring direct audio input or human intervention suggests a significant shift towards more scalable and efficient data annotation practices.

However, the work also points to potential biases in LLMs due to their reliance on internet-derived data, which the authors note may affect the contextual understanding of certain emotional expressions. Future research may focus on refining these models further, potentially through hybrid systems that combine LLM annotations with human validation to mitigate biases and enhance reliability.