Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Testing Correctness, Fairness, and Robustness of Speech Emotion Recognition Models (2312.06270v3)

Published 11 Dec 2023 in eess.AS and cs.SD

Abstract: Machine learning models for speech emotion recognition (SER) can be trained for different tasks and are usually evaluated based on a few available datasets per task. Tasks could include arousal, valence, dominance, emotional categories, or tone of voice. Those models are mainly evaluated in terms of correlation or recall, and always show some errors in their predictions. The errors manifest themselves in model behaviour, which can be very different along different dimensions even if the same recall or correlation is achieved by the model. This paper introduces a testing framework to investigate behaviour of speech emotion recognition models, by requiring different metrics to reach a certain threshold in order to pass a test. The test metrics can be grouped in terms of correctness, fairness, and robustness. It also provides a method for automatically specifying test thresholds for fairness tests, based on the datasets used, and recommendations on how to select the remaining test thresholds. Nine different transformer based models, an xLSTM based model and a convolutional baseline model are tested for arousal, valence, dominance, and emotional categories. The test results highlight, that models with high correlation or recall might rely on shortcuts -- such as text sentiment --, and differ in terms of fairness.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (67)
  1. “Scientific machine learning benchmarks” In Nature Reviews Physics 4.6, 2022, pp. 413–420
  2. “On the opportunities and risks of foundation models” In arXiv preprint arXiv:2108.07258, 2021
  3. “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding” In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP Brussels, Belgium: Association for Computational Linguistics, 2018, pp. 353–355 DOI: 10.18653/v1/W18-5446
  4. “HEAR 2021: Holistic Evaluation of Audio Representations” In arXiv preprint arXiv:2203.03022, 2022
  5. “Underspecification Presents Challenges for Credibility in Modern Machine Learning” In Journal of Machine Learning Research 23, 2022, pp. 1–61
  6. “Model Cards for Model Reporting” In Proceedings of the Conference on Fairness, Accountability, and Transparency New York, NY, USA: Association for Computing Machinery, 2019, pp. 220–229 DOI: 10.1145/3287560.3287596
  7. “Machine Learning Testing: Survey, Landscapes and Horizons” In IEEE Transactions on Software Engineering 48.1, 2020, pp. 1–36 DOI: 10.1109/TSE.2019.2962027
  8. “Introduction to software testing” Cambridge University Press, 2016
  9. Christian Murphy, Gail E Kaiser and Marta Arias “An approach to software testing of machine learning applications” In Proceedings of the Nineteenth International Conference on Software Engineering & Knowledge Engineering (SEKE) Boston, MA, USA: Knowledge Systems Institute Graduate School, 2007, pp. 167–172
  10. “Deeptest: Automated testing of deep-neural-network-driven autonomous cars” In Proceedings of the 40th international conference on software engineering, 2018, pp. 303–314
  11. “Beyond Accuracy: Behavioral Testing of NLP models with CheckList” In Association for Computational Linguistics (ACL), 2020
  12. “Deepbillboard: Systematic physical-world testing of autonomous driving systems” In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, 2020, pp. 347–358
  13. “Affective and behavioural computing: Lessons learnt from the first computational paralinguistics challenge” In Computer Speech & Language 53 Elsevier, 2019, pp. 156–180
  14. “SERAB: A Multi-Lingual Benchmark for Speech Emotion Recognition” In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7697–7701 DOI: 10.1109/ICASSP43922.2022.9747348
  15. Mimansa Jaiswal and Emily Mower Provost “Best Practices for Noise-Based Augmentation to Improve the Performance of Emotion Recognition ”In the Wild”” In arXiv preprint arXiv:2104.08806, 2021
  16. “Probing speech emotion recognition transformers for linguistic knowledge” In Interspeech 2022, Incheon, Korea, 18-22 September 2022, 2022, pp. 146–150 DOI: 10.21437/interspeech.2022-10371
  17. Matheus Schmitz, Rehan Ahmed and Jimi Cao “Bias and Fairness on Multimodal Emotion Detection Algorithms” In arXiv preprint arXiv:2205.08383, 2022
  18. “AI Fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias” In arXiv preprint arXiv:1810.01943, 2018
  19. “Fast Yet Effective Speech Emotion Recognition with Self-distillation” In arXiv preprint arXiv:2210.14636, 2022
  20. “Towards testing of deep learning systems with training set reduction” In arXiv preprint arXiv:1901.04169, 2019
  21. “Crema-d: Crowd-sourced emotional multimodal actors dataset” In IEEE transactions on affective computing 5.4 IEEE, 2014, pp. 377–390
  22. “Test splits for CREMA-D, emoDB, IEMOCAP, MELD, RAVDESS”, 2023 DOI: 10.5281/zenodo.10229583
  23. “Design, recording and verification of a danish emotional speech database” In Proc. 5th European Conference on Speech Communication and Technology (Eurospeech 1997), 1997, pp. 1695–1698 DOI: 10.21437/Eurospeech.1997-482
  24. “A database of German emotional speech.” In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH) 5 Lisbon, Portugal: ISCA, 2005, pp. 1517–1520
  25. “EMOVO corpus: an Italian emotional speech database” In Proceedings of the ninth international conference on language resources and evaluation (LREC’14), 2014, pp. 3501–3504 European Language Resources Association (ELRA)
  26. “IEMOCAP: Interactive emotional dyadic motion capture database” In Language resources and evaluation 42 Springer, 2008, pp. 335–359
  27. “MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations” In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 527–536
  28. “EmotionLines: An Emotion Corpus of Multi-Party Conversations” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018
  29. “Building Naturalistic Emotionally Balanced Speech Corpus by Retrieving Emotional Speech from Existing Podcast Recordings” In IEEE Transactions on Affective Computing 10.4, 2019, pp. 471–483
  30. “Polish Emotional Speech Database” Retrieved from http://www.eletel.p.lodz.pl/bronakowski/med_catalog/, 2020
  31. Steven R Livingstone and Frank A Russo “The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English” In PloS one 13.5 Public Library of Science, 2018, pp. e0196391
  32. “The World of Emotions is not Two-Dimensional” In Psychological Science 18.12, 2007, pp. 1050–1057 DOI: 10.1111/j.1467-9280.2007.02024.x
  33. “Mapping emotion terms into affective space” In Swiss Journal of Psychology Hogrefe AG, 2016
  34. “Mapping discrete emotions into the dimensional space: An empirical approach” In 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2012, pp. 3316–3320 IEEE
  35. Gyanendra K Verma and Uma Shanker Tiwary “Affect representation and recognition in 3D continuous valence–arousal–dominance space” In Multimedia Tools and Applications 76 Springer, 2017, pp. 2159–2183
  36. Dominik Maria Endres and Johannes E Schindelin “A new metric for probability distributions” In IEEE Transactions on Information theory 49.7 IEEE, 2003, pp. 1858–1860
  37. “A Survey on Bias and Fairness in Machine Learning” In ACM Comput. Surv. 54.6 New York, NY, USA: Association for Computing Machinery, 2021 DOI: 10.1145/3457607
  38. “The measure and mismeasure of fairness: A critical review of fair machine learning” In arXiv preprint arXiv:1808.00023, 2018
  39. Eustasio Del Barrio, Paula Gordaliza and Jean-Michel Loubes “Review of mathematical frameworks for fairness in machine learning” In arXiv preprint arXiv:2005.13755, 2020
  40. Alekh Agarwal, Miroslav Dudík and Zhiwei Steven Wu “Fair regression: Quantitative definitions and reduction-based algorithms” In International Conference on Machine Learning, 2019, pp. 120–129 PMLR
  41. Steven Weinberger “Speech accent archive” Retrieved from http://accent.gmu.edu, 2015
  42. “Common Voice: A Massively-Multilingual Speech Corpus” In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), 2020, pp. 4211–4215
  43. “OPUS-MT — Building open translation services for the World” In Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT), 2020
  44. P.J. Finlay and Contributors Argos Translate “Argos Translate”, 2023 URL: https://github.com/argosopentech/argos-translate
  45. Gölge Eren and The Coqui TTS Team “coqui-ai/TTS, version 0.6.1”, 2021 DOI: 10.5281/zenodo.6334862
  46. “Espnet-TTS: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit” In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7654–7658 IEEE
  47. Paul Boersma “Praat: doing phonetics by computer [Computer program]” Retrieved from http://www.praat.org/, 2023
  48. “Intriguing properties of neural networks” In arXiv preprint arXiv:1312.6199, 2013
  49. “Measuring neural net robustness with constraints” In Advances in neural information processing systems 29, 2016
  50. David Snyder, Guoguo Chen and Daniel Povey “MUSAN: A Music, Speech, and Noise Corpus” arXiv:1510.08484v1, 2015 eprint: 1510.08484
  51. “CAST a database: Rapid targeted large-scale big data acquisition via small-world modelling of social media platforms” In 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), 2017, pp. 340–345 URL: https://doi.org/10.1109/ACII.2017.8273622
  52. “Building the Singapore English National Speech Corpus” In Proc. Interspeech 2019, 2019, pp. 321–325 DOI: 10.21437/Interspeech.2019-1525
  53. “Evaluation of speech dereverberation algorithms using the MARDY database” In Proceedings of the International Workshop on Acoustic Echo and Noise Control (IWAENC), 2006, pp. 1–4
  54. Marco Jeub, Magnus Schäfer and Peter Vary “A Binaural Room Impulse Response Database for the Evaluation of Dereverberation Algorithms” In Proceedings of International Conference on Digital Signal Processing (DSP) Santorini, Greece: IEEE, 2009, pp. 1–4 IEEE, IET, EURASIP
  55. “TIMIT Acoustic-Phonetic Continuous Speech Corpus” In Linguistic Data Consortium, Philadelphia, 1983 DOI: https://doi.org/10.35111/17gk-bn40
  56. “Dawn of the transformer era in speech emotion recognition: closing the valence gap” In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, pp. 1–13
  57. “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units” In IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 IEEE, 2021, pp. 3451–3460
  58. “wav2vec 2.0: A framework for self-supervised learning of speech representations” In Advances in Neural Information Processing Systems (NeurIPS), 2020, pp. 12449–12460
  59. others “Libri-light: A benchmark for asr with limited or no supervision” In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7669–7673 IEEE
  60. “Librispeech: an asr corpus based on public domain audio books” In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2015, pp. 5206–5210 IEEE
  61. “Switchboard-1 Release 2 LDC97S62” In Linguistic Data Consortium, 1993, pp. 34
  62. “Fisher English training speech part 1 transcripts” In Philadelphia: Linguistic Data Consortium, 2004
  63. “VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) Online: Association for Computational Linguistics, 2021, pp. 993–1003 URL: https://aclanthology.org/2021.acl-long.80
  64. “MLS: A Large-Scale Multilingual Dataset for Speech Research”, 2020
  65. “VoxLingua107: a dataset for spoken language recognition” In 2021 IEEE Spoken Language Technology Workshop (SLT), 2021, pp. 652–658 IEEE
  66. “Speech recognition and keyword spotting for low-resource languages: Babel project research at cued” In Fourth International workshop on spoken language technologies for under-resourced languages (SLTU-2014), 2014, pp. 16–23 International Speech Communication Association (ISCA)
  67. Geoffrey Hinton, Oriol Vinyals and Jeff Dean “Distilling the knowledge in a neural network” NIPS 2014 Deep Learning Workshop In arXiv preprint arXiv:1503.02531, 2015
Citations (3)

Summary

We haven't generated a summary for this paper yet.