Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MER 2024: Semi-Supervised Learning, Noise Robustness, and Open-Vocabulary Multimodal Emotion Recognition (2404.17113v4)

Published 26 Apr 2024 in cs.LG and cs.HC

Abstract: Multimodal emotion recognition is an important research topic in artificial intelligence. Over the past few decades, researchers have made remarkable progress by increasing the dataset size and building more effective algorithms. However, due to problems such as complex environments and inaccurate annotations, current systems are hard to meet the demands of practical applications. Therefore, we organize the MER series of competitions to promote the development of this field. Last year, we launched MER2023, focusing on three interesting topics: multi-label learning, noise robustness, and semi-supervised learning. In this year's MER2024, besides expanding the dataset size, we further introduce a new track around open-vocabulary emotion recognition. The main purpose of this track is that existing datasets usually fix the label space and use majority voting to enhance the annotator consistency. However, this process may lead to inaccurate annotations, such as ignoring non-majority or non-candidate labels. In this track, we encourage participants to generate any number of labels in any category, aiming to describe emotional states as accurately as possible. Our baseline code relies on MERTools and is available at: https://github.com/zeroQiaoba/MERTools/tree/master/MER2024.

Insights into MER 2024: Challenges in Multimodal Emotion Recognition

The paper "MER 2024: Semi-Supervised Learning, Noise Robustness, and Open-Vocabulary Multimodal Emotion Recognition" explores the development of the MER2024 challenge, an initiative focused on advancing the field of multimodal emotion recognition through diverse and targeted tasks. The research team's ongoing efforts are encapsulated in this work, which builds upon the foundation laid by the MER2023 challenge. In this paper, three distinct tracks are outlined: MER-SEMI, MER-NOISE, and the newly introduced MER-OV, each addressing specific challenges inherent to multimodal emotion recognition systems.

The MER2024 challenge aims at overcoming obstacles such as dataset limitations, noise interference, and the constrained expressive capacity of vocabulary-limited models. Here, the authors expound on the dataset expansion strategies, the methodological advancements, and the performance metrics established to foster the development of robust and effective emotion recognition systems.

Dataset and Methodology

The construction of the MER2024 dataset signifies a critical leap from its predecessor, MER2023, by incorporating a larger corpus of labeled and unlabeled samples, tailored to enhance training efficacy under semi-supervised frameworks. A novel aspect introduced in this iteration is the deployment of open-vocabulary labels to more accurately capture nuanced emotional expressions. The dataset is structured to address key technical challenges in emotion recognition: domain adaptation, label sparsity, and noise robustness.

In the MER-SEMI track, the core focus is leveraging unlabeled data using semi-supervised techniques to enhance model performance, aiming for effective domain adaptation and improved generalization on unseen data. The MER-NOISE track centers on developing systems resilient to audio-visual noise perturbations, addressing common issues encountered in practical deployment scenarios.

The MER-OV track advances the field by allowing models to predict an unbounded number of emotion labels across categories, a significant shift from traditional classification approaches. This encourages models to learn more granular emotional representations, challenging them to engage with the complexity of human affective states directly.

Baselines and Performance Metrics

The establishment of robust baselines using both traditional and state-of-the-art methods sets the groundwork for fair and comprehensive comparison. For MER-SEMI and MER-NOISE, the research utilizes MERTools with a focus on potent fusion strategies such as attention mechanisms. In MER-OV, multimodal LLMs (MLLMs) are leveraged to engage with open-vocabulary tasks, demonstrating the adaptability of these models in capturing complex sentiment and affective nuances.

Focusing on metric choice, the paper prioritizes weighted average F-score (WAF) for MER-SEMI and MER-NOISE, emphasizing its utility in evaluating class-imbalanced datasets. For MER-OV, set-level metrics extend traditional accuracy and recall to accommodate the open nature of predicted labels. This framework evaluates the intersection of predicted versus ground truth emotion sets, providing a nuanced understanding of model performance in capturing the breadth of emotional expressions inherent in human interaction.

Implications and Future Directions

The MER2024 challenge positions itself as pivotal in advancing multimodal emotion recognition research by addressing prevailing limitations in contemporary systems. By fostering innovation in semi-supervised learning, noise robustness, and open-vocabulary keyword generation, the paper propels paper towards more nuanced and human-like emotion detection systems. These contributions may significantly enhance the applicability of emotion recognition in real-world settings, from human-computer interaction to healthcare.

Future developments will likely explore further refinement of semi-supervised learning techniques, enhance noise handling algorithms, and continually expand the dimensionality of open-vocabulary recognition. Additionally, the adaptation and integration of models across diverse multimedia sources, enriched by increasingly sophisticated contextual processing, will remain a priority.

By capturing the complexity of emotional experience through a multimodal lens, initiatives like MER2024 are poised to significantly enhance our comprehension and automation of human emotionality, driving impactful innovation across a range of applications in artificial intelligence.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (46)
  1. Merbench: A unified evaluation benchmark for multimodal emotion recognition. arXiv preprint arXiv:2401.03429, 2024.
  2. Mer 2023: Multi-label learning, modality robustness, and semi-supervised learning. In Proceedings of the 31st ACM International Conference on Multimedia, pages 9610–9614, 2023.
  3. Meld: A multimodal multi-party dataset for emotion recognition in conversations. In Proceedings of the 57th Conference of the Association for Computational Linguistics, pages 527–536, 2019.
  4. Reliable crowdsourcing and deep locality-preserving learning for expression recognition in the wild. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 2852–2861, 2017.
  5. Missing modality imagination network for emotion recognition with uncertain missing modalities. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2608–2618, 2021.
  6. Analyzing modality robustness in multimodal sentiment analysis. In Proceedings of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 685–696, 2022.
  7. Efficient multimodal transformer with dual-level feature restoration for robust multimodal sentiment analysis. IEEE Transactions on Affective Computing, 2023.
  8. Gcnet: Graph completion network for incomplete multimodal learning in conversation. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(07):8419–8432, 2023.
  9. Musan: A music, speech, and noise corpus. arXiv preprint arXiv:1510.08484, 2015.
  10. Explainable multimodal emotion reasoning. arXiv preprint arXiv:2306.15401, 2023.
  11. Videochat: Chat-centric video understanding. arXiv preprint arXiv:2305.06355, 2023.
  12. Salmonn: Towards generic hearing abilities for large language models. In The Twelfth International Conference on Learning Representations, 2023.
  13. Pandagpt: One model to instruction-follow them all. In Proceedings of the 1st Workshop on Taming Large Language Models: Controllability in the era of Interactive Assistants, pages 11–23, 2023.
  14. Videomae: Masked autoencoders are data-efficient learners for self-supervised video pre-training. In Proceedings of the Advances in Neural Information Processing Systems, pages 10078–10093, 2022.
  15. Emonets: Multimodal deep learning approaches for emotion recognition in video. Journal on Multimodal User Interfaces, 10:99–111, 2016.
  16. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023.
  17. Squeeze-and-excitation networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7132–7141, 2018.
  18. Deep residual learning for image recognition. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 770–778, 2016.
  19. Learning deep global multi-scale and local attention features for facial expression recognition in the wild. IEEE Transactions on Image Processing, 30:6544–6556, 2021.
  20. Eva-02: A visual representation for neon genesis. arXiv preprint arXiv:2303.11331, 2023.
  21. Learning transferable visual models from natural language supervision. In Proceedings of the International Conference on Machine Learning, pages 8748–8763. PMLR, 2021.
  22. Voxceleb2: Deep speaker recognition. In Proceedings of the Interspeech, pages 1086–1090, 2018.
  23. The geneva minimalistic acoustic parameter set (gemaps) for voice research and affective computing. IEEE Transactions on Affective Computing, 7(2):190–202, 2016.
  24. Cnn architectures for large-scale audio classification. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 131–135. IEEE, 2017.
  25. Robust speech recognition via large-scale weak supervision. In Proceedings of the International Conference on Machine Learning, pages 28492–28518. PMLR, 2023.
  26. emotion2vec: Self-supervised pre-training for speech emotion representation. arXiv preprint arXiv:2312.15185, 2023.
  27. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Proceedings of the Advances in Neural Information Processing Systems, pages 12449–12460, 2020.
  28. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
  29. Xlnet: Generalized autoregressive pretraining for language understanding. Proceedings of the Advances in Neural Information Processing Systems, pages 5754–5764, 2019.
  30. Electra: Pre-training text encoders as discriminators rather than generators. In Proceedings of the International Conference on Learning Representations, pages 1–18, 2020.
  31. Pert: pre-training bert with permuted language model. arXiv preprint arXiv:2203.06906, 2022.
  32. Lert: A linguistically-motivated pre-trained language model. arXiv preprint arXiv:2211.05344, 2022.
  33. Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692, 2019.
  34. Revisiting pre-trained models for chinese natural language processing. In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 657–668, 2020.
  35. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335, 2022.
  36. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  37. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023.
  38. Ctnet: Conversational transformer network for emotion recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:985–1000, 2021.
  39. Valley: Video assistant with large language model enhanced ability. arXiv preprint arXiv:2306.07207, 2023.
  40. Otter: A multi-modal model with in-context instruction tuning. arXiv preprint arXiv:2305.03726, 2023.
  41. Video-llama: An instruction-tuned audio-visual language model for video understanding. arXiv preprint arXiv:2306.02858, 2023.
  42. Mvbench: A comprehensive multi-modal video understanding benchmark. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024.
  43. Video-chatgpt: Towards detailed video understanding via large vision and language models. arXiv preprint arXiv:2306.05424, 2023.
  44. mplug-owl: Modularization empowers large language models with multimodality, 2023.
  45. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919, 2023.
  46. OpenAI. Gpt-4v(ision) system card, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (18)
  1. Zheng Lian (51 papers)
  2. Haiyang Sun (45 papers)
  3. Licai Sun (19 papers)
  4. Zhuofan Wen (7 papers)
  5. Siyuan Zhang (63 papers)
  6. Shun Chen (6 papers)
  7. Hao Gu (27 papers)
  8. Jinming Zhao (26 papers)
  9. Ziyang Ma (73 papers)
  10. Xie Chen (166 papers)
  11. Jiangyan Yi (77 papers)
  12. Rui Liu (320 papers)
  13. Kele Xu (62 papers)
  14. Bin Liu (441 papers)
  15. Erik Cambria (136 papers)
  16. Guoying Zhao (103 papers)
  17. Björn W. Schuller (153 papers)
  18. Jianhua Tao (139 papers)
Citations (7)