SZTU-CMU at MER2024: Improving Emotion-LLaMA with Conv-Attention for Multimodal Emotion Recognition (2408.10500v2)

Published 20 Aug 2024 in cs.MM, cs.CV, cs.SD, and eess.AS

Abstract: This paper presents our winning approach for the MER-NOISE and MER-OV tracks of the MER2024 Challenge on multimodal emotion recognition. Our system leverages the advanced emotional understanding capabilities of Emotion-LLaMA to generate high-quality annotations for unlabeled samples, addressing the challenge of limited labeled data. To enhance multimodal fusion while mitigating modality-specific noise, we introduce Conv-Attention, a lightweight and efficient hybrid framework. Extensive experimentation vali-dates the effectiveness of our approach. In the MER-NOISE track, our system achieves a state-of-the-art weighted average F-score of 85.30%, surpassing the second and third-place teams by 1.47% and 1.65%, respectively. For the MER-OV track, our utilization of Emotion-LLaMA for open-vocabulary annotation yields an 8.52% improvement in average accuracy and recall compared to GPT-4V, securing the highest score among all participating large multimodal models. The code and model for Emotion-LLaMA are available at https://github.com/ZebangCheng/Emotion-LLaMA.

Citations (1)

View on Semantic Scholar

Summary

The paper introduces improving Emotion-LLaMA with a novel Conv-Attention mechanism to enhance multimodal emotion recognition, achieving state-of-the-art performance on MER-NOISE.
Emotion-LLaMA is utilized to generate pseudo-labels for unlabeled samples, effectively augmenting training data and enhancing the model's ability to reason across audio, visual, and text modalities.
The Conv-Attention framework efficiently fuses multimodal features while prioritizing crucial information and minimizing noise, offering a lightweight yet robust approach for feature extraction.

Improving Emotion-LLaMA for Multimodal Emotion Recognition

The paper "SZTU-CMU at MER2024: Improving Emotion-LLaMA with Conv-Attention for Multimodal Emotion Recognition" presents an approach that addresses challenges in Multimodal Emotion Recognition (MER) through enhanced emotional understanding capabilities with the Emotion-LLaMA model and a novel Conv-Attention mechanism. The research focuses on two critical domains within the MER context: noise robustness (MER-NOISE) and open-vocabulary recognition (MER-OV), demonstrating significant advancements achieved by integrating new methodologies into existing frameworks.

Summary of Contributions

This paper demonstrates a system that combines Emotion-LLaMA’s capabilities with a Conv-Attention model to effectively tackle the issues associated with limited labeled data and modality-specific noise. The proposed model achieves substantial improvements in the MER-NOISE competition, with a state-of-the-art weighted average F-score of 85.30%, outperforming other competitors by a notable margin. The utilization of open-vocabulary annotation in the MER-OV track using Emotion-LLaMA also shows an impressive 8.52% improvement in average accuracy and recall compared to existing models like GPT-4V.

Key Methodological Insights

Emotion-LLaMA for Pseudo-Labeling: The research leverages the strengths of the Emotion-LLaMA model to generate high-quality pseudo-labels for unlabeled samples. By addressing the limitation of available labeled data, the model augments the training dataset effectively, crucially enhancing the generalization of emotion recognition systems. Emotion-LLaMA is further highlighted for its advanced ability to reason across modalities, enhancing the interpretation of emotions by aligning audio, visual, and text features.
Conv-Attention for Feature Fusion: The introduction of the Conv-Attention framework offers a significant enhancement over traditional attention-based models. Combining the strengths of convolutional operations with global attention mechanisms, Conv-Attention efficiently fuses multimodal features, prioritizing crucial information while effectively minimizing noise. This lightweight approach allows critical feature extraction with fewer data demands, addressing both computational efficiency and robustness in noisy environments.

Experimental Validation

The authors conducted extensive experimentation to validate the proposed methodologies. Unimodal and multimodal approaches were tested, showing superior performance by the Conv-Attention model, especially when trained on the augmented dataset that includes pseudo-labels. Emotion-LLaMA’s impact is especially evident in the MER-OV track, where it outperformed other contemporary models in both accuracy and recall, reflecting its robustness and versatility in handling complex emotional datasets.

Implications and Future Directions

This research outlines a promising direction for future developments in AI and affective computing by integrating pseudo-label generation and feature fusion through new models such as Conv-Attention and Emotion-LLaMA. The proposed system’s scalability and efficiency highlight its potential applicability in real-world situations where emotion recognition complexity is high, such as human-computer interaction, healthcare, and education.

Future work could focus on expanding this model's applicability across various languages and cultural contexts. Additionally, efforts to enhance real-time processing capabilities and reduce computational costs further would make these solutions more practical for broader applications. As emotion recognition technologies advance, the methodologies presented in this paper could significantly influence the development of robust, multimodal systems that reliably operate in diverse and challenging environments.

PDF Markdown

Related Papers

GitHub

GitHub - ZebangCheng/Emotion-LLaMA: Emotion-LLaMA: Multimodal Emotion Recognition and Reasoning with Instruction Tuning (103 stars)