Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning (2406.11161v2)

Published 17 Jun 2024 in cs.AI and cs.MM

Abstract: Accurate emotion perception is crucial for various applications, including human-computer interaction, education, and counseling. However, traditional single-modality approaches often fail to capture the complexity of real-world emotional expressions, which are inherently multimodal. Moreover, existing Multimodal LLMs (MLLMs) face challenges in integrating audio and recognizing subtle facial micro-expressions. To address this, we introduce the MERR dataset, containing 28,618 coarse-grained and 4,487 fine-grained annotated samples across diverse emotional categories. This dataset enables models to learn from varied scenarios and generalize to real-world applications. Furthermore, we propose Emotion-LLaMA, a model that seamlessly integrates audio, visual, and textual inputs through emotion-specific encoders. By aligning features into a shared space and employing a modified LLaMA model with instruction tuning, Emotion-LLaMA significantly enhances both emotional recognition and reasoning capabilities. Extensive evaluations show Emotion-LLaMA outperforms other MLLMs, achieving top scores in Clue Overlap (7.83) and Label Overlap (6.25) on EMER, an F1 score of 0.9036 on MER2023-SEMI challenge, and the highest UAR (45.59) and WAR (59.37) in zero-shot evaluations on DFEW dataset.

PDF HTML Abstract

The paper "Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning" (Cheng et al., 17 Jun 2024 ) introduces a novel multimodal LLM (MLLM) designed for advanced emotion recognition and reasoning. The key contributions and findings of this work are:

MERR Dataset: The authors created a new Multimodal Emotion Recognition and Reasoning (MERR) dataset. This dataset contains 28,618 coarse-grained and 4,487 fine-grained annotated samples, encompassing a broad spectrum of emotional categories. The MERR dataset addresses the limitations of existing multimodal emotion instruction datasets and facilitates learning across diverse scenarios.
Emotion-LLaMA Model: The paper details the development of Emotion-LLaMA, an MLLM integrating audio, visual, and textual inputs through specialized emotion encoders. The model uses HuBERT for processing audio data and employs multiview visual encoders, including MAE, VideoMAE, and EVA, to capture detailed facial information. Instruction tuning is used to refine emotional recognition and reasoning capabilities.
Performance Benchmarking: Emotion-LLaMA was evaluated extensively, demonstrating superior performance compared to other MLLMs across multiple datasets. Key performance metrics include:
- Clue Overlap score of 7.83 on the EMER dataset
- Label Overlap score of 6.25 on the EMER dataset
- F1 score of 0.9036 on the MER2023 challenge
- Unweighted Average Recall (UAR) of 45.59 in zero-shot evaluations on the DFEW dataset
- Weighted Average Recall (WAR) of 59.37 in zero-shot evaluations on the DFEW dataset

The paper's primary finding is that Emotion-LLaMA significantly improves emotional recognition and reasoning through effective multimodal input integration and instruction tuning. This establishes a new state-of-the-art for multimodal emotion analysis.

PDF Markdown Bookmark Chat (Pro)

Authors (9)

Zebang Cheng (10 papers)
Zhi-Qi Cheng (61 papers)
Jun-Yan He (27 papers)
Jingdong Sun (11 papers)
Kai Wang (624 papers)
Yuxiang Lin (7 papers)
Zheng Lian (51 papers)
Xiaojiang Peng (59 papers)
Alexander Hauptmann (46 papers)

Citations (7)

View on Semantic Scholar

Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning (2406.11161v2)

Related Papers