Explainable Multimodal Emotion Recognition
Abstract: Multimodal emotion recognition is an important research topic in artificial intelligence, whose main goal is to integrate multimodal clues to identify human emotional states. Current works generally assume accurate labels for benchmark datasets and focus on developing more effective architectures. However, emotion annotation relies on subjective judgment. To obtain more reliable labels, existing datasets usually restrict the label space to some basic categories, then hire plenty of annotators and use majority voting to select the most likely label. However, this process may result in some correct but non-candidate or non-majority labels being ignored. To ensure reliability without ignoring subtle emotions, we propose a new task called ``Explainable Multimodal Emotion Recognition (EMER)''. Unlike traditional emotion recognition, EMER takes a step further by providing explanations for these predictions. Through this task, we can extract relatively reliable labels since each label has a certain basis. Meanwhile, we borrow LLMs to disambiguate unimodal clues and generate more complete multimodal explanations. From them, we can extract richer emotions in an open-vocabulary manner. This paper presents our initial attempt at this task, including introducing a new dataset, establishing baselines, and defining evaluation metrics. In addition, EMER can serve as a benchmark task to evaluate the audio-video-text understanding performance of multimodal LLMs.
- Multimodal machine learning: A survey and taxonomy. IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(2):423–443, 2018.
- Merbench: A unified evaluation benchmark for multimodal emotion recognition. arXiv preprint arXiv:2401.03429, 2024.
- Meld: A multimodal multi-party dataset for emotion recognition in conversations. In Proceedings of the 57th Conference of the Association for Computational Linguistics, pages 527–536, 2019.
- Ctnet: Conversational transformer network for emotion recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:985–1000, 2021.
- Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 2852–2861, 2017.
- Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning. arXiv preprint arXiv:2304.08981, 2023.
- Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023.
- Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023.
- Pandagpt: One model to instruction-follow them all. In Proceedings of the 1st Workshop on Taming Large Language Models: Controllability in the era of Interactive Assistants, pages 11–23, 2023.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International conference on machine learning, pages 1–13, 2023.
- Salmonn: Towards generic hearing abilities for large language models. In The Twelfth International Conference on Learning Representations, 2023.
- Imagebind: One embedding space to bind them all. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 15180–15190, 2023.
- Valley: Video assistant with large language model enhanced ability. arXiv preprint arXiv:2306.07207, 2023.
- Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023.
- Mvbench: A comprehensive multi-modal video understanding benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
- Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
- mplug-owl: Modularization empowers large language models with multimodality, 2023.
- Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919, 2023.
- OpenAI. Gpt-4v(ision) system card, 2023.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.