Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

EMO-SUPERB: An In-depth Look at Speech Emotion Recognition (2402.13018v4)

Published 20 Feb 2024 in eess.AS and cs.SD

Abstract: Speech emotion recognition (SER) is a pivotal technology for human-computer interaction systems. However, 80.77% of SER papers yield results that cannot be reproduced. We develop EMO-SUPERB, short for EMOtion Speech Universal PERformance Benchmark, which aims to enhance open-source initiatives for SER. EMO-SUPERB includes a user-friendly codebase to leverage 15 state-of-the-art speech self-supervised learning models (SSLMs) for exhaustive evaluation across six open-source SER datasets. EMO-SUPERB streamlines result sharing via an online leaderboard, fostering collaboration within a community-driven benchmark and thereby enhancing the development of SER. On average, 2.58% of annotations are annotated using natural language. SER relies on classification models and is unable to process natural languages, leading to the discarding of these valuable annotations. We prompt ChatGPT to mimic annotators, comprehend natural language annotations, and subsequently re-label the data. By utilizing labels generated by ChatGPT, we consistently achieve an average relative gain of 3.08% across all settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (41)
  1. Josh Achiam et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  2. Designing and Evaluating Speech Emotion Recognition Systems: A Reality Check Case Study with IEMOCAP. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5.
  3. Bagus Tris Atmaja and Akira Sasou. 2022. Evaluating Self-Supervised Speech Representations for Speech Emotion Recognition. IEEE Access, 10:124396–124407.
  4. Xls-r: Self-supervised cross-lingual speech representation learning at scale. arXiv preprint arXiv:2111.09296.
  5. Data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning, pages 1298–1312. PMLR.
  6. vq-wav2vec: Self-supervised learning of discrete speech representations. arXiv preprint arXiv:1910.05453.
  7. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460.
  8. Increasing the Reliability of Crowdsourcing Evaluations Using Online Quality Assessment. IEEE Transactions on Affective Computing, 7(4):374–388.
  9. IEMOCAP: Interactive emotional dyadic motion capture database. Journal of Language Resources and Evaluation, 42(4):335–359.
  10. MSP-IMPROV: An Acted Corpus of Dyadic Interactions to Study Emotion Perception. IEEE Transactions on Affective Computing, 8(1):67–80.
  11. CREMA-D: Crowd-Sourced Emotional Multimodal Actors Dataset. IEEE Transactions on Affective Computing, 5(4):377–390.
  12. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518.
  13. NNIME: The NTHU-NTUA Chinese interactive multimodal emotion corpus. In 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), pages 292–298.
  14. Exploiting Annotators’ Typed Description of Emotion Perception to Maximize Utilization of Ratings for Speech Emotion Recognition. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2022), pages 7717–7721, Singapore.
  15. An unsupervised autoregressive model for speech representation learning. arXiv preprint arXiv:1904.03240.
  16. Vector-quantized autoregressive predictive coding. arXiv preprint arXiv:2005.08392.
  17. Alan S. Cowen and Dacher Keltner. 2017. Self-report captures 27 distinct categories of emotion bridged by continuous gradients. Proceedings of the National Academy of Sciences, 114(38):E7900–E7909.
  18. Alan S. Cowen and Dacher Keltner. 2021. Semantic Space Theory: A Computational Approach to Emotion. Trends in Cognitive Sciences, 25(2):124–136.
  19. Class-Balanced Loss Based on Effective Number of Samples. In 2019 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), California, USA.
  20. Superb@ slt 2022: Challenge on generalization and efficiency of self-supervised speech representation learning. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 1096–1103. IEEE.
  21. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460.
  22. Kiana Kheiri and Hamid Karimi. 2023. SentimentGPT: Exploiting GPT for Advanced Sentiment Analysis and its Departure from Current Machine Learning.
  23. Exploration of a Self-Supervised Speech Model: A Study on Emotional Corpora. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 868–875.
  24. Shaoshi Ling and Yuzong Liu. 2020. Decoar 2.0: Deep contextualized acoustic representations with vector quantization. arXiv preprint arXiv:2012.06659.
  25. Non-autoregressive predictive coding for learning speech representations from local dependencies. arXiv preprint arXiv:2011.00406.
  26. Tera: Self-supervised learning of transformer encoder representation for speech. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:2351–2366.
  27. Mockingjay: Unsupervised speech representation learning with deep bidirectional transformer encoders. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6419–6423. IEEE.
  28. Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations.
  29. Reza Lotfian and Carlos Busso. 2017. Formulating Emotion Perception as a Probabilistic Model with Application to Categorical Emotion Classification. In International Conference on Affective Computing and Intelligent Interaction (ACII 2017), pages 415–420, San Antonio, TX, USA.
  30. Reza Lotfian and Carlos Busso. 2019. Building Naturalistic Emotionally Balanced Speech Corpus by Retrieving Emotional Speech From Existing Podcast Recordings. IEEE Transactions on Affective Computing, 10(4):471–483.
  31. The Shifting Meaning of Happiness. Social Psychological and Personality Science, 2(4):395–402.
  32. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
  33. Juri Opitz and Sebastian Burst. 2019. Macro f1 and macro f1. arXiv preprint arXiv:1911.03347.
  34. Fabian Pedregosa et al. 2011. Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12:2825–2830.
  35. No Sample Left Behind: Towards a Comprehensive Evaluation of Speech Emotion Recognition Systems. In Proc. SMM19, Workshop on Speech, Music and Mind 2019, pages 11–15, Graz, Austria.
  36. wav2vec: Unsupervised pre-training for speech recognition. arXiv preprint arXiv:1904.05862.
  37. Rethinking the Inception Architecture for Computer Vision. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
  38. Superb-sg: Enhanced speech processing universal performance benchmark for semantic and generative capabilities. arXiv preprint arXiv:2203.06849.
  39. An Intelligent Infrastructure Toward Large Scale Naturalistic Affective Speech Corpora Collection. In 2023 11th International Conference on Affective Computing and Intelligent Interaction (ACII), pages 1–8. IEEE.
  40. Dawn of the Transformer Era in Speech Emotion Recognition: Closing the Valence Gap. IEEE Transactions on Pattern Analysis and Machine Intelligence, 45(9):10745–10759.
  41. Superb: Speech processing universal performance benchmark. arXiv preprint arXiv:2105.01051.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Haibin Wu (84 papers)
  2. Huang-Cheng Chou (9 papers)
  3. Kai-Wei Chang (292 papers)
  4. Lucas Goncalves (5 papers)
  5. Jiawei Du (31 papers)
  6. Jyh-Shing Roger Jang (28 papers)
  7. Chi-Chun Lee (11 papers)
  8. Hung-yi Lee (325 papers)
Citations (7)