Twists, Humps, and Pebbles: Multilingual Speech Recognition Models Exhibit Gender Performance Gaps (2402.17954v3)
Abstract: Current automatic speech recognition (ASR) models are designed to be used across many languages and tasks without substantial changes. However, this broad language coverage hides performance gaps within languages, for example, across genders. Our study systematically evaluates the performance of two widely used multilingual ASR models on three datasets, encompassing 19 languages from eight language families and two speaking conditions. Our findings reveal clear gender disparities, with the advantaged group varying across languages and models. Surprisingly, those gaps are not explained by acoustic or lexical properties. However, probing internal model states reveals a correlation with gendered performance gap. That is, the easier it is to distinguish speaker gender in a language using probes, the more the gap reduces, favoring female speakers. Our results show that gender disparities persist even in state-of-the-art models. Our findings have implications for the improvement of multilingual ASR systems, underscoring the importance of accessibility to training data and nuanced evaluation to predict and mitigate gender gaps. We release all code and artifacts at https://github.com/g8a9/multilingual-asr-gender-gap.
- Martine Adda-Decker and Lori Lamel. 2005. Do speech recognizers prefer female speakers? In Proc. Interspeech 2005, pages 2205–2208.
- Guillaume Alain and Yoshua Bengio. 2016. Understanding intermediate layers using linear classifier probes. arXiv preprint arXiv:1610.01644.
- H Samy Alim. 2004. You know my steez: An ethnographic and sociolinguistic study of styleshifting in a Black American speech community. Stanford University.
- Common voice: A massively-multilingual speech corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218–4222.
- David Azul. 2015. On the varied and complex factors affecting gender diverse people’s vocal situations: Implications for clinical practice. Perspectives on Voice and Voice Disorders, 25(2):75–86.
- Beyond binary gender: creaky voice, gender, and the variationist enterprise. Language Variation and Change, 34(2):215–238.
- Yonatan Belinkov and James Glass. 2019. Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7:49–72.
- Alan W Black. 2019. Cmu wilderness multilingual speech dataset. In ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5971–5975.
- Paul Boersma and David Weenink. 2001. Praat, a system for doing phonetics by computer. Glot International, 5(9/10):341–345.
- John Van Borsel and Dorothy De Maesschalck. 2008. Speech rate in males, females, and male-to-female transsexuals. Clinical Linguistics & Phonetics, 22(9):679–685. PMID: 18608249.
- Peter A Busby and Geoff L Plant. 1995. Formant frequency values of vowels produced by preadolescent boys and girls. The Journal of the Acoustical Society of America, 97(4):2603–2606.
- Density-based clustering based on hierarchical density estimates. In Pacific-Asia conference on knowledge discovery and data mining, pages 160–172. Springer.
- Alexandra Chouldechova and Aaron Roth. 2020. A snapshot of the frontiers of fairness in machine learning. Communications of the ACM, 63(5):82–89.
- Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919.
- Ralph O. Coleman. 1976. A comparison of the contributions of two voice quality characteristics to the perception of maleness and femaleness in the voice. Journal of Speech & Hearing Research, 19(1):168–180.
- Seamless: Multilingual Expressive and Streaming Speech Translation.
- Fleurs: Few-shot learning evaluation of universal representations of speech. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 798–805. IEEE.
- No language left behind: Scaling human-centered machine translation. arXiv preprint arXiv:2207.04672.
- Different languages, similar encoding efficiency: Comparable information rates across the human communicative niche. Science Advances, 5(9):eaaw2594.
- Quantifying social biases in NLP: A generalization and empirical comparison of extrinsic fairness metrics. Transactions of the Association for Computational Linguistics, 9:1249–1267.
- Measuring fairness with biased rulers: A comparative study on bias metrics for pre-trained language models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1693–1706, Seattle, United States. Association for Computational Linguistics.
- Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226.
- CERES: Pretraining of graph-conditioned transformer for semi-structured session data. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 219–230, Seattle, United States. Association for Computational Linguistics.
- Quantifying bias in automatic speech recognition.
- No pitch left behind: Addressing gender unbalance in automatic speech recognition through pitch manipulation. arXiv preprint arXiv:2310.06590.
- Susanne Fuchs and Martine Toda. 2010. Do differences in male versus female/s/reflect biological or sociophonetic factors. Turbulent sounds: An interdisciplinary guide, 21:281–302.
- Reconsidering Read and Spontaneous Speech: Causal Perspectives on the Generation of Training Data for Automatic Speech Recognition. Information, 14(2):137.
- Breeding gender-aware direct speech translation systems. In Proceedings of the 28th International Conference on Computational Linguistics, pages 3951–3964, Barcelona, Spain (Online). International Committee on Computational Linguistics.
- Speech recognition and keyword spotting for low-resource languages: Babel project research at cued. In Fourth International workshop on spoken language technologies for under-resourced languages (SLTU-2014), pages 16–23. International Speech Communication Association (ISCA).
- Gender Representation in French Broadcast Corpora and Its Impact on ASR Performance. In 1st International Workshop on AI for Smart TV Content Production, Access and Delivery, page 3–9.
- Gender representation in open source speech resources. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6599–6605, Marseille, France. European Language Resources Association.
- Investigating the Impact of Gender Representation in ASR Training Data: a Case Study on Librispeech. In 3rd Workshop on Gender Bias in Natural Language Processing, pages 86–92.
- Towards understanding gender bias in relation extraction. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 2943–2953, Online. Association for Computational Linguistics.
- Intrinsic bias metrics do not correlate with application bias. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 1926–1940, Online. Association for Computational Linguistics.
- John Hewitt and Percy Liang. 2019. Designing and interpreting probes with control tasks. In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2733–2743, Hong Kong, China. Association for Computational Linguistics.
- John Hewitt and Christopher D. Manning. 2019. A structural probe for finding syntax in word representations. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 4129–4138, Minneapolis, Minnesota. Association for Computational Linguistics.
- James M. Hillenbrand and Michael J. Clark. 2009. The role of f0 and formant frequencies in distinguishing the voices of men and women. Attention Perception & Psychophysics, 71(5):1150–1166.
- Daniel J. Hirst and Céline de Looze. 2021. Measuring Speech. Fundamental frequency and pitch. In Rachael-Anne Knight and Jane Setter, editors, Cambridge Handbook of Phonetics, 1, pages 336–361. Cambridge University Press.
- MLLP-VRAIN UPV systems for the IWSLT 2022 simultaneous speech translation and speech-to-speech translation tasks. In Proceedings of the 19th International Conference on Spoken Language Translation (IWSLT 2022), pages 255–264, Dublin, Ireland (in-person and online). Association for Computational Linguistics.
- Debiasing isn’t enough! – on the effectiveness of debiasing MLMs and their social biases in downstream tasks. In Proceedings of the 29th International Conference on Computational Linguistics, pages 1299–1310, Gyeongju, Republic of Korea. International Committee on Computational Linguistics.
- Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences, 117(14):7684–7689.
- Jody Kreiman and Diana Sidtis. 2011. Foundations of voice studies: An interdisciplinary approach to voice production and perception. John Wiley & Sons.
- Hermann J. Künzel. 2013. Some general phonetic and forensic aspects of speaking tempo. International Journal of Speech Language and The Law, 4:48–83.
- William Labov. 1964. The social stratification of English in New York city. Ph.D. thesis, Columbia University.
- Fangfang Li. 2017. The development of gender-specific patterns in the production of voiceless sibilant fricatives in Mandarin Chinese. Linguistics, 55(5):1021–1044.
- Jinyu Li et al. 2022. Recent advances in end-to-end automatic speech recognition. APSIPA Transactions on Signal and Information Processing, 11(1).
- Towards measuring fairness in speech recognition: Casual conversations dataset transcriptions. In ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6162–6166.
- Leo Loveday. 1981. Pitch, politeness and sexual role: An exploratory investigation into the pitch correlates of English and Japanese politeness formulae. Language and Speech, 24(1):71–89.
- Youri Maryn and Andrzej Zarowski. 2015. Calibration of clinical audio recording and analysis systems for sound intensity measurement. American Journal of Speech-Language Pathology, 24(4):608–618.
- Japanese large-vocabulary continuous-speech recognition using a business-newspaper corpus. In 1997 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 3, pages 1803–1806. IEEE.
- Umap: Uniform manifold approximation and projection for dimension reduction. arXiv preprint arXiv:1802.03426.
- Locating and editing factual associations in gpt. Advances in Neural Information Processing Systems, 35:17359–17372.
- “I don’t Think These Devices are Very Culturally Sensitive.”—Impact of Automated Speech Recognition Errors on African Americans. Frontiers in Artificial Intelligence, 4.
- Artie bias corpus: An open dataset for detecting demographic bias in speech applications. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 6462–6468.
- Homophone disambiguation reveals patterns of context mixing in speech transformers. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 8249–8260, Singapore. Association for Computational Linguistics.
- Differences between acoustic characteristics of spontaneous and read speech and their effects on speech recognition performance. Computer Speech & Language, 22(2):171–184.
- Jennifer Oates and Georgia Dacakis. 2015. Transgender voice and communication: Research evidence underpinning voice intervention for male-to-female transsexual women. Perspectives on Voice and Voice Disorders, 25(2):48–58.
- Hadas Orgad and Yonatan Belinkov. 2022. Choose your lenses: Flaws in gender bias evaluation. In Proceedings of the 4th Workshop on Gender Bias in Natural Language Processing (GeBNLP), pages 151–167, Seattle, Washington. Association for Computational Linguistics.
- How gender debiasing affects internal model representations, and why it matters. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2602–2628, Seattle, United States. Association for Computational Linguistics.
- Explaining speech classification models via word-level audio segments and paralinguistic features. arXiv preprint arXiv:2309.07733.
- Marylou Pausewang Gelfer and Shannon Ryan Young. 1997. Comparisons of intensity measures and their stability in male and female sneakers. Journal of Voice, 11(2):178–186.
- Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pages 2463–2473, Hong Kong, China. Association for Computational Linguistics.
- Indexical meanings of [s ] among Copenhagen youth: Social perception of a phonetic variant in different prosodic contexts. Language in Society, 43(1):1–31.
- S/exuality in small-town California: Gender normativity and the acoustic realization of/s. Language, sexuality, and power: Studies in intersectional linguistics, pages 16–88.
- Mls: A large-scale multilingual dataset for speech research. ArXiv, abs/2012.03411.
- Robust Speech Recognition via Large-Scale Weak Supervision. ArXiv:2212.04356 [cs, eess].
- Aequevox: Automated fairness testing of speech recognition systems. In International Conference on Fundamental Approaches to Software Engineering, pages 245–267. Springer International Publishing Cham.
- SpeechBrain: A general-purpose speech toolkit. ArXiv:2106.04624.
- Null it out: Guarding protected attributes by iterative nullspace projection. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7237–7256, Online. Association for Computational Linguistics.
- Evan C Rowlands. 1954. Types of word junction in yoruba. Bulletin of the School of Oriental and African Studies, 16(2):376–388.
- Majdi Sawalha and Mohammad Abu Shariah. 2013. The effects of speakers’ gender, age, and region on overall performance of Arabic automatic speech recognition systems using the phonetically rich and balanced Modern Standard Arabic speech corpus. In Proceedings of the 2nd Workshop of Arabic Corpus Linguistics WACL-2. Leeds.
- What’s in a p-value in NLP? In Proceedings of the Eighteenth Conference on Computational Natural Language Learning, pages 1–10, Ann Arbor, Michigan. Association for Computational Linguistics.
- Same neurons, different languages: Probing morphosyntax in multilingual pre-trained models. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1589–1598, Seattle, United States. Association for Computational Linguistics.
- Jiao Sun and Nanyun Peng. 2021. Men are elected, women are married: Events gender bias on wikipedia. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers), pages 350–360.
- Mitigating gender bias in natural language processing: Literature review. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1630–1640, Florence, Italy. Association for Computational Linguistics.
- Tina Tallon. 2019. A Century of “Shrill”: How Bias in Technology Has Hurt Women’s Voices. The New Yorker.
- Rachael Tatman. 2017. Gender and dialect bias in YouTube’s automatic captions. In Proceedings of the First ACL Workshop on Ethics in Natural Language Processing, pages 53–59, Valencia, Spain. Association for Computational Linguistics.
- BERT rediscovers the classical NLP pipeline. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4593–4601, Florence, Italy. Association for Computational Linguistics.
- Jörgen Valk and Tanel Alumäe. 2021. Voxlingua107: a dataset for spoken language recognition. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 652–658. IEEE.
- Attention is all you need. Advances in neural information processing systems, 30.
- Elena Voita and Ivan Titov. 2020. Information-theoretic probing with minimum description length. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 183–196, Online. Association for Computational Linguistics.
- VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 993–1003, Online. Association for Computational Linguistics.
- Transformers: State-of-the-art natural language processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 38–45, Online. Association for Computational Linguistics.
- Walt Wolfram. 2004. Urban african american vernacular english: Morphology and syntax. A handbook of varieties of English, 1:319–340.
- Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453.
- Ikuko Patricia Yuasa. 2008. Culture and gender of voice pitch: A sociophonetic comparison of the Japanese and Americans. Equinox Publishing.
- Speech resources in the Tamasheq language. In Proceedings of the Thirteenth Language Resources and Evaluation Conference, pages 2066–2071, Marseille, France. European Language Resources Association.
- Learning gender-neutral word embeddings. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 4847–4853, Brussels, Belgium. Association for Computational Linguistics.
- Lal Zimman. 2013. Hegemonic masculinity and the variability of gay-sounding speech: The perceived sexuality of transgender men. Journal of Language and Sexuality, 2(1):1–39.
- Lal Zimman. 2017. Gender as stylistic bricolage: Transmasculine voices and the relationship between fundamental frequency and/s. Language in Society, 46(3):339–370.
- Lal Zimman. 2018. Transgender voices: Insights on identity, embodiment, and the gender of the voice. Language and Linguistics Compass, 12(8):e12284.
- Lal Zimman. 2020. Sociophonetics. The International Encyclopedia of Linguistic Anthropology, pages 1–5.
- Lal Zimman. 2021. Gender diversity and the voice. In The Routledge handbook of language, gender, and sexuality, pages 69–90. Routledge.
- Giuseppe Attanasio (21 papers)
- Beatrice Savoldi (19 papers)
- Dennis Fucci (11 papers)
- Dirk Hovy (57 papers)