Lost in Transcription: Identifying and Quantifying the Accuracy Biases of Automatic Speech Recognition Systems Against Disfluent Speech (2405.06150v1)
Abstract: Automatic speech recognition (ASR) systems, increasingly prevalent in education, healthcare, employment, and mobile technology, face significant challenges in inclusivity, particularly for the 80 million-strong global community of people who stutter. These systems often fail to accurately interpret speech patterns deviating from typical fluency, leading to critical usability issues and misinterpretations. This study evaluates six leading ASRs, analyzing their performance on both a real-world dataset of speech samples from individuals who stutter and a synthetic dataset derived from the widely-used LibriSpeech benchmark. The synthetic dataset, uniquely designed to incorporate various stuttering events, enables an in-depth analysis of each ASR's handling of disfluent speech. Our comprehensive assessment includes metrics such as word error rate (WER), character error rate (CER), and semantic accuracy of the transcripts. The results reveal a consistent and statistically significant accuracy bias across all ASRs against disfluent speech, manifesting in significant syntactical and semantic inaccuracies in transcriptions. These findings highlight a critical gap in current ASR technologies, underscoring the need for effective bias mitigation strategies. Addressing this bias is imperative not only to improve the technology's usability for people who stutter but also to ensure their equitable and inclusive participation in the rapidly evolving digital landscape.
- Nicoline Grinager Ambrose and Ehud Yairi. 1999. Normative disfluency data for early childhood stuttering. Journal of Speech, Language, and Hearing Research, 42(4):895–909.
- SpeechT5: Unified-modal encoder-decoder pre-training for spoken language processing. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5723–5738.
- SpeechT5: Unified-modal encoder-decoder pre-training for spoken language processing. arXiv preprint arXiv:2110.07205.
- Microsoft Azure. Azure Speech to Text. https://azure.microsoft.com/en-us/products/ai-services/speech-to-text.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460.
- Electronic Privacy Information Center. 2019. EPIC - EPIC Files Complaint with FTC about Employment Screening Firm HireVue.
- Google Cloud. Google Cloud Text-to-Speech AI. https://cloud.google.com/text-to-speech.
- U.S. Equal Employment Opportunity Commision. Federal law prohibiting job discrmination questions and answers. https://www.eeoc.gov/fact-sheet/federal-laws-prohibiting-job- discrimination-questions- and-answers.
- Jóhanna Einarsdóttir and Roger J Ingham. 2005. Have disfluency-type measures contributed to the understanding and treatment of developmental stuttering?
- Facebook. wav2vec2-base-960h. https://huggingface.co/facebook/wav2vec2-base-960h.
- Quantifying bias in automatic speech recognition. arXiv preprint arXiv:2103.15122.
- John S Garofolo. 1983. TIMIT acoustic-phonetic continuous speech corpus.
- Stuttering and labor market outcomes in the united states. Journal of Speech, Language, and Hearing Research, 61(7):1649–1663.
- Matt Gonzales. 2022. Stuttering discrmination and the workplace. https://www.shrm.org/resourcesandtools/hr-topics/behavioral-competencies/global-and-cultural-effectiveness/pages/stuttering-discrimination-and-the-workplace.aspx.
- Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100.
- DeBERTa: Decoding-enhanced BERT with disentangled attention. In International Conference on Learning Representations.
- Mathijs Hollemans. Mattijs/cmu-arctic-xvectors. https://huggingface.co/datasets/Matthijs/cmu-arctic-xvectors.
- Eric Horvitz. 2016. One hundred year study on artificial intelligence.
- HuggingFace. Normalization and pre-tokenization. https://huggingface.co/learn/nlp-course/chapter6/4.
- IBM. IBM Watson Speech to Text. https://www.ibm.com/products/speech-to-text.
- Fortune Business Insights. 2022. The global speech and voice recognition market size. https://www.fortunebusinessinsights.com/industry-reports/speech-and-voice-recognition-market-101382.
- Wendell Johnson. 1959. The onset of stuttering: Research findings and implications. U of Minnesota Press.
- Person perception biases exposed: Revisiting the first impressions dataset. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 13–21.
- Dysarthric speech database for universal access research. In Ninth Annual Conference of the International Speech Communication Association.
- FluentNet: End-to-end detection of speech disfluency with deep learning. arXiv preprint arXiv:2009.11394.
- Counterfactual fairness. Advances in neural information processing systems, 30.
- From user perceptions to technical improvement: Enabling people who stutter to better use speech recognition. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–16.
- SEP-28k: A dataset for stuttering event detection from podcasts with people who stutter. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6798–6802. IEEE.
- Towards measuring fairness in speech recognition: Casual conversations dataset transcriptions. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6162–6166. IEEE.
- Disordered speech data collection: lessons learned at 1 million utterances from project euphonia.
- Brian MacWhinney. 2021. Tools for analyzing talk, part 1: The CHAT transcription format (2019). DOI, 10:66.
- Automatic speech recognition: a survey. Multimedia Tools and Applications, 80:9411–9457.
- David Meyer. 2018. Amazon reportedly killed an AI recruitment system because it couldn’t stop the tool from discriminating against women. Fortune. Tillgänglig online: https://fortune. com/2018/10/10/amazon-ai-recruitment-bias-women-sexist/(2019-09-27).
- Analysis and tuning of a voice assistant system for dysfluent speech. arXiv preprint arXiv:2106.11759.
- Uncommonvoice: A crowdsourced dataset of dysphonic speech. In Interspeech, pages 2532–2536.
- Whistle-blowing asrs: Evaluating the need for more inclusive speech recognition systems. Interspeech 2018.
- Dena F Mujtaba and Nihar R Mahapatra. 2019. Ethical considerations in AI-based recruitment. In 2019 IEEE International Symposium on Technology and Society (ISTAS), pages 1–7. IEEE.
- NIST. AI risk management framework. https://www.nist.gov/itl/ai-risk-management-framework.
- LibriSpeech: an ASR corpus based on public domain audio books. In Acoustics, Speech and Signal Processing (ICASSP), 2015 IEEE International Conference on, pages 5206–5210. IEEE.
- The influence of workplace discrimination and vigilance on job satisfaction with people who stutter. Journal of Fluency Disorders, 62:105725.
- End-to-end speech recognition: A survey. arXiv preprint arXiv:2303.03329.
- Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR.
- Nan Bernstein Ratner and Brian MacWhinney. 2018. Fluency Bank: A new resource for fluency research and practice. Journal of fluency disorders, 56:69–80.
- Rev.AI. The world’s most accurate API for AI- and human-generated transcripts. https://www.rev.ai/.
- Machine learning for stuttering identification: Review, challenges and future directions. Neurocomputing.
- Enhancing ASR for stuttered speech with limited data using detect and pass. arXiv preprint arXiv:2202.05396.
- Personalizing ASR for dysarthric and accented speech with limited data. arXiv preprint arXiv:1907.13511.
- Seth E Tichenor and J Scott Yaruss. 2019. Stuttering as defined by adults who stutter. Journal of Speech, Language, and Hearing Research, 62(12):4356–4369.
- Seth E Tichenor and J Scott Yaruss. 2021. Variability of stuttering: Behavior and impact. American Journal of Speech-Language Pathology, 30(1):75–88.
- Assessing ASR model quality on disordered speech using BERTScore. arXiv preprint arXiv:2209.10591.
- Jimmy Tobin and Katrin Tomanek. 2022. Personalized automatic speech recognition trained on small disordered speech datasets. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6637–6641. IEEE.
- Counterfactual explanations without opening the black box: Automated decisions and the GDPR. Harv. JL & Tech., 31:841.
- Kevin Wheeler. For people who stutter, the convenience of voice assistant technology remains out of reach. https://techxplore.com/news/2020-01-people-stutter-convenience- voice-technology.html.
- Shaomei Wu. 2023. “The World is Designed for Fluent People”: Benefits and challenges of videoconferencing technologies for people who stutter. In Proceedings of the 2023 CHI Conference on Human Factors in Computing Systems, pages 1–17.
- Ehud Yairi and Nicoline Ambrose. 2013. Epidemiology of stuttering: 21st century advances. Journal of fluency disorders, 38(2):66–87.
- Disfluency detection using a bidirectional LSTM. arXiv preprint arXiv:1604.03209.
- BERTScore: Evaluating text generation with BERT. arXiv preprint arXiv:1904.09675.
- CTC forced alignment API tutorial. https://pytorch.org/audio/main/tutorials/ctc_forced_alignment_api_tutorial.html.
- Dena Mujtaba (3 papers)
- Nihar R. Mahapatra (4 papers)
- Megan Arney (2 papers)
- J. Scott Yaruss (2 papers)
- Hope Gerlach-Houck (1 paper)
- Caryn Herring (2 papers)
- Jia Bin (3 papers)