Testing Correctness, Fairness, and Robustness of Speech Emotion Recognition Models (2312.06270v3)
Abstract: Machine learning models for speech emotion recognition (SER) can be trained for different tasks and are usually evaluated based on a few available datasets per task. Tasks could include arousal, valence, dominance, emotional categories, or tone of voice. Those models are mainly evaluated in terms of correlation or recall, and always show some errors in their predictions. The errors manifest themselves in model behaviour, which can be very different along different dimensions even if the same recall or correlation is achieved by the model. This paper introduces a testing framework to investigate behaviour of speech emotion recognition models, by requiring different metrics to reach a certain threshold in order to pass a test. The test metrics can be grouped in terms of correctness, fairness, and robustness. It also provides a method for automatically specifying test thresholds for fairness tests, based on the datasets used, and recommendations on how to select the remaining test thresholds. Nine different transformer based models, an xLSTM based model and a convolutional baseline model are tested for arousal, valence, dominance, and emotional categories. The test results highlight, that models with high correlation or recall might rely on shortcuts -- such as text sentiment --, and differ in terms of fairness.
- “Scientific machine learning benchmarks” In Nature Reviews Physics 4.6, 2022, pp. 413–420
- “On the opportunities and risks of foundation models” In arXiv preprint arXiv:2108.07258, 2021
- “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding” In Proceedings of the 2018 EMNLP Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP Brussels, Belgium: Association for Computational Linguistics, 2018, pp. 353–355 DOI: 10.18653/v1/W18-5446
- “HEAR 2021: Holistic Evaluation of Audio Representations” In arXiv preprint arXiv:2203.03022, 2022
- “Underspecification Presents Challenges for Credibility in Modern Machine Learning” In Journal of Machine Learning Research 23, 2022, pp. 1–61
- “Model Cards for Model Reporting” In Proceedings of the Conference on Fairness, Accountability, and Transparency New York, NY, USA: Association for Computing Machinery, 2019, pp. 220–229 DOI: 10.1145/3287560.3287596
- “Machine Learning Testing: Survey, Landscapes and Horizons” In IEEE Transactions on Software Engineering 48.1, 2020, pp. 1–36 DOI: 10.1109/TSE.2019.2962027
- “Introduction to software testing” Cambridge University Press, 2016
- Christian Murphy, Gail E Kaiser and Marta Arias “An approach to software testing of machine learning applications” In Proceedings of the Nineteenth International Conference on Software Engineering & Knowledge Engineering (SEKE) Boston, MA, USA: Knowledge Systems Institute Graduate School, 2007, pp. 167–172
- “Deeptest: Automated testing of deep-neural-network-driven autonomous cars” In Proceedings of the 40th international conference on software engineering, 2018, pp. 303–314
- “Beyond Accuracy: Behavioral Testing of NLP models with CheckList” In Association for Computational Linguistics (ACL), 2020
- “Deepbillboard: Systematic physical-world testing of autonomous driving systems” In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering, 2020, pp. 347–358
- “Affective and behavioural computing: Lessons learnt from the first computational paralinguistics challenge” In Computer Speech & Language 53 Elsevier, 2019, pp. 156–180
- “SERAB: A Multi-Lingual Benchmark for Speech Emotion Recognition” In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7697–7701 DOI: 10.1109/ICASSP43922.2022.9747348
- Mimansa Jaiswal and Emily Mower Provost “Best Practices for Noise-Based Augmentation to Improve the Performance of Emotion Recognition ”In the Wild”” In arXiv preprint arXiv:2104.08806, 2021
- “Probing speech emotion recognition transformers for linguistic knowledge” In Interspeech 2022, Incheon, Korea, 18-22 September 2022, 2022, pp. 146–150 DOI: 10.21437/interspeech.2022-10371
- Matheus Schmitz, Rehan Ahmed and Jimi Cao “Bias and Fairness on Multimodal Emotion Detection Algorithms” In arXiv preprint arXiv:2205.08383, 2022
- “AI Fairness 360: An extensible toolkit for detecting, understanding, and mitigating unwanted algorithmic bias” In arXiv preprint arXiv:1810.01943, 2018
- “Fast Yet Effective Speech Emotion Recognition with Self-distillation” In arXiv preprint arXiv:2210.14636, 2022
- “Towards testing of deep learning systems with training set reduction” In arXiv preprint arXiv:1901.04169, 2019
- “Crema-d: Crowd-sourced emotional multimodal actors dataset” In IEEE transactions on affective computing 5.4 IEEE, 2014, pp. 377–390
- “Test splits for CREMA-D, emoDB, IEMOCAP, MELD, RAVDESS”, 2023 DOI: 10.5281/zenodo.10229583
- “Design, recording and verification of a danish emotional speech database” In Proc. 5th European Conference on Speech Communication and Technology (Eurospeech 1997), 1997, pp. 1695–1698 DOI: 10.21437/Eurospeech.1997-482
- “A database of German emotional speech.” In Proceedings of the Annual Conference of the International Speech Communication Association (INTERSPEECH) 5 Lisbon, Portugal: ISCA, 2005, pp. 1517–1520
- “EMOVO corpus: an Italian emotional speech database” In Proceedings of the ninth international conference on language resources and evaluation (LREC’14), 2014, pp. 3501–3504 European Language Resources Association (ELRA)
- “IEMOCAP: Interactive emotional dyadic motion capture database” In Language resources and evaluation 42 Springer, 2008, pp. 335–359
- “MELD: A Multimodal Multi-Party Dataset for Emotion Recognition in Conversations” In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, 2019, pp. 527–536
- “EmotionLines: An Emotion Corpus of Multi-Party Conversations” In Proceedings of the Eleventh International Conference on Language Resources and Evaluation (LREC 2018), 2018
- “Building Naturalistic Emotionally Balanced Speech Corpus by Retrieving Emotional Speech from Existing Podcast Recordings” In IEEE Transactions on Affective Computing 10.4, 2019, pp. 471–483
- “Polish Emotional Speech Database” Retrieved from http://www.eletel.p.lodz.pl/bronakowski/med_catalog/, 2020
- Steven R Livingstone and Frank A Russo “The Ryerson Audio-Visual Database of Emotional Speech and Song (RAVDESS): A dynamic, multimodal set of facial and vocal expressions in North American English” In PloS one 13.5 Public Library of Science, 2018, pp. e0196391
- “The World of Emotions is not Two-Dimensional” In Psychological Science 18.12, 2007, pp. 1050–1057 DOI: 10.1111/j.1467-9280.2007.02024.x
- “Mapping emotion terms into affective space” In Swiss Journal of Psychology Hogrefe AG, 2016
- “Mapping discrete emotions into the dimensional space: An empirical approach” In 2012 IEEE International Conference on Systems, Man, and Cybernetics (SMC), 2012, pp. 3316–3320 IEEE
- Gyanendra K Verma and Uma Shanker Tiwary “Affect representation and recognition in 3D continuous valence–arousal–dominance space” In Multimedia Tools and Applications 76 Springer, 2017, pp. 2159–2183
- Dominik Maria Endres and Johannes E Schindelin “A new metric for probability distributions” In IEEE Transactions on Information theory 49.7 IEEE, 2003, pp. 1858–1860
- “A Survey on Bias and Fairness in Machine Learning” In ACM Comput. Surv. 54.6 New York, NY, USA: Association for Computing Machinery, 2021 DOI: 10.1145/3457607
- “The measure and mismeasure of fairness: A critical review of fair machine learning” In arXiv preprint arXiv:1808.00023, 2018
- Eustasio Del Barrio, Paula Gordaliza and Jean-Michel Loubes “Review of mathematical frameworks for fairness in machine learning” In arXiv preprint arXiv:2005.13755, 2020
- Alekh Agarwal, Miroslav Dudík and Zhiwei Steven Wu “Fair regression: Quantitative definitions and reduction-based algorithms” In International Conference on Machine Learning, 2019, pp. 120–129 PMLR
- Steven Weinberger “Speech accent archive” Retrieved from http://accent.gmu.edu, 2015
- “Common Voice: A Massively-Multilingual Speech Corpus” In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), 2020, pp. 4211–4215
- “OPUS-MT — Building open translation services for the World” In Proceedings of the 22nd Annual Conferenec of the European Association for Machine Translation (EAMT), 2020
- P.J. Finlay and Contributors Argos Translate “Argos Translate”, 2023 URL: https://github.com/argosopentech/argos-translate
- Gölge Eren and The Coqui TTS Team “coqui-ai/TTS, version 0.6.1”, 2021 DOI: 10.5281/zenodo.6334862
- “Espnet-TTS: Unified, reproducible, and integratable open source end-to-end text-to-speech toolkit” In Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7654–7658 IEEE
- Paul Boersma “Praat: doing phonetics by computer [Computer program]” Retrieved from http://www.praat.org/, 2023
- “Intriguing properties of neural networks” In arXiv preprint arXiv:1312.6199, 2013
- “Measuring neural net robustness with constraints” In Advances in neural information processing systems 29, 2016
- David Snyder, Guoguo Chen and Daniel Povey “MUSAN: A Music, Speech, and Noise Corpus” arXiv:1510.08484v1, 2015 eprint: 1510.08484
- “CAST a database: Rapid targeted large-scale big data acquisition via small-world modelling of social media platforms” In 2017 Seventh International Conference on Affective Computing and Intelligent Interaction (ACII), 2017, pp. 340–345 URL: https://doi.org/10.1109/ACII.2017.8273622
- “Building the Singapore English National Speech Corpus” In Proc. Interspeech 2019, 2019, pp. 321–325 DOI: 10.21437/Interspeech.2019-1525
- “Evaluation of speech dereverberation algorithms using the MARDY database” In Proceedings of the International Workshop on Acoustic Echo and Noise Control (IWAENC), 2006, pp. 1–4
- Marco Jeub, Magnus Schäfer and Peter Vary “A Binaural Room Impulse Response Database for the Evaluation of Dereverberation Algorithms” In Proceedings of International Conference on Digital Signal Processing (DSP) Santorini, Greece: IEEE, 2009, pp. 1–4 IEEE, IET, EURASIP
- “TIMIT Acoustic-Phonetic Continuous Speech Corpus” In Linguistic Data Consortium, Philadelphia, 1983 DOI: https://doi.org/10.35111/17gk-bn40
- “Dawn of the transformer era in speech emotion recognition: closing the valence gap” In IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023, pp. 1–13
- “HuBERT: Self-supervised speech representation learning by masked prediction of hidden units” In IEEE/ACM Transactions on Audio, Speech, and Language Processing 29 IEEE, 2021, pp. 3451–3460
- “wav2vec 2.0: A framework for self-supervised learning of speech representations” In Advances in Neural Information Processing Systems (NeurIPS), 2020, pp. 12449–12460
- others “Libri-light: A benchmark for asr with limited or no supervision” In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7669–7673 IEEE
- “Librispeech: an asr corpus based on public domain audio books” In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2015, pp. 5206–5210 IEEE
- “Switchboard-1 Release 2 LDC97S62” In Linguistic Data Consortium, 1993, pp. 34
- “Fisher English training speech part 1 transcripts” In Philadelphia: Linguistic Data Consortium, 2004
- “VoxPopuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers) Online: Association for Computational Linguistics, 2021, pp. 993–1003 URL: https://aclanthology.org/2021.acl-long.80
- “MLS: A Large-Scale Multilingual Dataset for Speech Research”, 2020
- “VoxLingua107: a dataset for spoken language recognition” In 2021 IEEE Spoken Language Technology Workshop (SLT), 2021, pp. 652–658 IEEE
- “Speech recognition and keyword spotting for low-resource languages: Babel project research at cued” In Fourth International workshop on spoken language technologies for under-resourced languages (SLTU-2014), 2014, pp. 16–23 International Speech Communication Association (ISCA)
- Geoffrey Hinton, Oriol Vinyals and Jeff Dean “Distilling the knowledge in a neural network” NIPS 2014 Deep Learning Workshop In arXiv preprint arXiv:1503.02531, 2015