Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

AffectGPT: Dataset and Framework for Explainable Multimodal Emotion Recognition (2407.07653v1)

Published 10 Jul 2024 in cs.HC

Abstract: Explainable Multimodal Emotion Recognition (EMER) is an emerging task that aims to achieve reliable and accurate emotion recognition. However, due to the high annotation cost, the existing dataset (denoted as EMER-Fine) is small, making it difficult to perform supervised training. To reduce the annotation cost and expand the dataset size, this paper reviews the previous dataset construction process. Then, we simplify the annotation pipeline, avoid manual checks, and replace the closed-source models with open-source models. Finally, we build \textbf{EMER-Coarse}, a coarsely-labeled dataset containing large-scale samples. Besides the dataset, we propose a two-stage training framework \textbf{AffectGPT}. The first stage exploits EMER-Coarse to learn a coarse mapping between multimodal inputs and emotion-related descriptions; the second stage uses EMER-Fine to better align with manually-checked results. Experimental results demonstrate the effectiveness of our proposed method on the challenging EMER task. To facilitate further research, we will make the code and dataset available at: https://github.com/zeroQiaoba/AffectGPT.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zheng Lian (51 papers)
  2. Haiyang Sun (45 papers)
  3. Licai Sun (19 papers)
  4. Jiangyan Yi (77 papers)
  5. Bin Liu (441 papers)
  6. Jianhua Tao (139 papers)
Citations (2)

Summary

Explaining AffectGPT: Advances in Multimodal Emotion Recognition

The paper "AffectGPT: Dataset and Framework for Explainable Multimodal Emotion Recognition" addresses a notable challenge in emotion recognition—achieving reliable and accurate predictions across multiple modalities while incorporating explainability into the process. The researchers introduced a comprehensive approach to enhance Explainable Multimodal Emotion Recognition (EMER), including the development of a new dataset, EMER-Coarse, and a two-stage training framework named AffectGPT.

Dataset Development and Methodological Approach

The authors recognized a critical limitation in the field: the high annotation costs which impede the collection of sufficiently large datasets for effective training in supervised contexts. To mitigate these costs, a coarsely labeled dataset, EMER-Coarse, comprising 115,595 samples, was constructed using an automated pipeline that leverages open-source models instead of the traditionally employed manual checks and closed-source models. This automation is not only intended to maintain annotation quality at a reduced cost but also aims to significantly expand the data volume available for training purposes.

The Two-Stage Training Framework: AffectGPT

AffectGPT is presented as a two-stage framework designed to optimize the learning process using EMER-Coarse and EMER-Fine datasets. The first stage utilizes EMER-Coarse to learn a generalized mapping between multimodal inputs and emotion-related descriptions, constituting a foundation for emotion recognition. In contrast, the second stage employs EMER-Fine, a smaller but manually checked dataset, to refine this alignment, ensuring that the model better conforms to verified emotion annotations.

This two-stage approach is quantitative in its improvement over single-stage training. The performance metrics obtained demonstrate superior results in "Accuracyₛ" and "Recallₛ", where the two-stage model systematically outperforms alternatives trained on either EMER-Coarse or EMER-Fine exclusively. The findings indicate that both large-scale data exposure and meticulous fine-tuning are necessary for robust EMER systems.

Implications and Future Directions

The implications of this research for EMER are substantial. For practical deployment, models derived from this framework could facilitate the development of more nuanced human-computer interaction systems, capable of discerning complex emotional states from diverse inputs without the rigidity of a fixed vocabulary. This advancement moves away from the constraints of fixed emotion labels, embracing a more flexible, open-vocabulary classification that aligns better with real-world interactions.

Theoretically, this work opens pathways for exploring emotion recognition beyond traditional boundaries, encouraging further research into the integration of multimodal and multilingual models for improved contextual understanding. Future work might include the exploration of larger LLMs or testing different neural architectures, particularly those that incorporate advanced techniques such as attention mechanisms to more effectively leverage multimodal inputs.

Conclusion

The paper offers a thorough exposition into an innovative approach for multimodal emotion recognition, significantly contributing to both the dataset compilation methodologies and the architectural paradigms employed for training. The introduction of AffectGPT and EMER-Coarse represents a progressive step in the synthesis of emotion recognition research, combining large-scale data management with nuanced training processes.

Github Logo Streamline Icon: https://streamlinehq.com