Explainable Multimodal Emotion Recognition (2306.15401v6)

Published 27 Jun 2023 in cs.MM and cs.HC

Abstract: Multimodal emotion recognition is an important research topic in artificial intelligence, whose main goal is to integrate multimodal clues to identify human emotional states. Current works generally assume accurate labels for benchmark datasets and focus on developing more effective architectures. However, emotion annotation relies on subjective judgment. To obtain more reliable labels, existing datasets usually restrict the label space to some basic categories, then hire plenty of annotators and use majority voting to select the most likely label. However, this process may result in some correct but non-candidate or non-majority labels being ignored. To ensure reliability without ignoring subtle emotions, we propose a new task called ``Explainable Multimodal Emotion Recognition (EMER)''. Unlike traditional emotion recognition, EMER takes a step further by providing explanations for these predictions. Through this task, we can extract relatively reliable labels since each label has a certain basis. Meanwhile, we borrow LLMs to disambiguate unimodal clues and generate more complete multimodal explanations. From them, we can extract richer emotions in an open-vocabulary manner. This paper presents our initial attempt at this task, including introducing a new dataset, establishing baselines, and defining evaluation metrics. In addition, EMER can serve as a benchmark task to evaluate the audio-video-text understanding performance of multimodal LLMs.

PDF HTML Abstract

Explainable Multimodal Emotion Recognition: Advancements and Challenges

The paper "Explainable Multimodal Emotion Recognition" introduces a novel approach to addressing the complexities of emotion recognition through Explainable Multimodal Emotion Recognition (EMER). The researchers highlight the limitations of existing emotion recognition systems, primarily stemming from label ambiguity and the subjective nature of emotion annotations. This paper seeks to rectify these issues by proposing EMER, a task that not only identifies emotions from multimodal data but also provides explanations for these emotions, thereby enhancing both the transparency and reliability of emotion recognition models.

Key Contributions

Introduction of EMER: The EMER task is designed to provide explanations for identified emotions, addressing the prevalent issue of label ambiguity in conventional datasets. By generating explanations, the task aids in producing reliable and interpretable emotion labels.
Database and Metrics: The paper introduces a newly constructed dataset tailored for EMER, alongside baseline models and evaluation metrics specifically developed for this task. The dataset is derived from the MER2023 corpus, selectively annotated to focus on detailed emotion explanations.
Role of LLMs: EMER utilizes LLMs to disambiguate unimodal clues and synthesize comprehensive multimodal explanations. This approach leverages the reasoning capabilities of LLMs to interpret audio, video, and textual data in concert, providing a richer set of emotional categories in an open-vocabulary format.
Open Vocabulary Approach: Unlike traditional models that limit emotion identification to a fixed set of categories, EMER allows for an open-vocabulary emotion recognition process. This flexibility enables the extraction of nuanced emotional states that are otherwise overlooked with predefined label sets.

Numerical Results and Findings

The paper provides empirical results demonstrating that EMER can significantly enhance the accuracy and reliability of emotion recognition tasks. The proposed models, when evaluated on the newly developed dataset, show improved performance over traditional one-hot label approaches by producing a wider range of emotion categories. The authors report that the EMER framework can map complex emotional states accurately with a close correlation to human-annotated emotions, as reflected in high Top-1 and Top-2 accuracy rates.

Practical and Theoretical Implications

The practical significance of EMER lies in its potential applications in human-computer interaction, sentiment analysis, and affective computing, where understanding nuanced human emotions is critical. Theoretically, this paper pushes the boundaries of multimodal learning by integrating explicability into emotion recognition, thus fostering the development of more robust and human-like AI systems.

Future Directions

The research paves the way for future studies focused on expanding EMER to other domains and further refining the distinction between subtle emotional nuances. There is also an opportunity to enhance the dataset by integrating more diverse cultural and linguistic contexts to improve the generalization capabilities of the model. Additionally, further exploration into the interpretability of AI models can be facilitated through the methodological frameworks introduced in this paper.

In conclusion, the exploration of Explainable Multimodal Emotion Recognition as detailed in this paper represents an important stride towards more transparent and accurate emotion AI systems. By leveraging multimodal data and emphasizing explainability, the proposed EMER framework not only enhances emotion recognition but also opens new avenues for research in AI interpretability and human-centric AI development.