Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Careless Whisper: Speech-to-Text Hallucination Harms (2402.08021v2)

Published 12 Feb 2024 in cs.CL and cs.CY

Abstract: Speech-to-text services aim to transcribe input audio as accurately as possible. They increasingly play a role in everyday life, for example in personal voice assistants or in customer-company interactions. We evaluate Open AI's Whisper, a state-of-the-art automated speech recognition service outperforming industry competitors, as of 2023. While many of Whisper's transcriptions were highly accurate, we find that roughly 1\% of audio transcriptions contained entire hallucinated phrases or sentences which did not exist in any form in the underlying audio. We thematically analyze the Whisper-hallucinated content, finding that 38\% of hallucinations include explicit harms such as perpetuating violence, making up inaccurate associations, or implying false authority. We then study why hallucinations occur by observing the disparities in hallucination rates between speakers with aphasia (who have a lowered ability to express themselves using speech and voice) and a control group. We find that hallucinations disproportionately occur for individuals who speak with longer shares of non-vocal durations -- a common symptom of aphasia. We call on industry practitioners to ameliorate these language-model-based hallucinations in Whisper, and to raise awareness of potential biases amplified by hallucinations in downstream applications of speech-to-text models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (23)
  1. Fairness in machine learning. Nips tutorial 1 (2017), 2017.
  2. Fairness and Machine Learning: Limitations and Opportunities. MIT Press.
  3. David Frank Benson and Alfredo Ardila. 1996. Aphasia: A clinical perspective. Oxford University Press, USA.
  4. Hervé Bredin and Antoine Laurent. 2021. End-to-end speaker segmentation for overlap-aware resegmentation. In Proc. Interspeech 2021. Brno, Czech Republic.
  5. Chris Code and Brian Petheram. 2011. Delivering for aphasia. International Journal of Speech-Language Pathology 13, 1 (Feb. 2011), 3–10. https://doi.org/10.3109/17549507.2010.520090
  6. Antonio R. Damasio. 1992. Aphasia. New England Journal of Medicine 326, 8 (Feb. 1992), 531–539. https://doi.org/10.1056/nejm199202203260806
  7. Charles Ellis and Stephanie Urban. 2016. Age and aphasia: a review of presence, type, recovery and clinical outcomes. Topics in Stroke Rehabilitation 23, 6 (2016), 430–439. https://doi.org/10.1080/10749357.2016.1150412 arXiv:https://doi.org/10.1080/10749357.2016.1150412 PMID: 26916396.
  8. Gunther Eysenbach. 2023. The Role of ChatGPT, Generative Language Models, and Artificial Intelligence in Medical Education: A Conversation With ChatGPT and a Call for Papers. JMIR Medical Education 9 (March 2023), e46885. https://doi.org/10.2196/46885
  9. Graham R Gibbs. 2007. Thematic coding and categorizing. Analyzing qualitative data 703 (2007), 38–56.
  10. Survey of hallucination in natural language generation. Comput. Surveys 55, 12 (2023), 1–38.
  11. Racial disparities in automated speech recognition. Proceedings of the National Academy of Sciences 117, 14 (2020), 7684–7689.
  12. Quantifying and Improving the Performance of Speech Recognition Systems on Dysphonic Speech. Otolaryngology–Head and Neck Surgery 168, 5 (Jan. 2023), 1130–1138. https://doi.org/10.1002/ohn.170
  13. AphasiaBank: Methods for studying discourse. Aphasiology 25 (2011), 1286–1307.
  14. John Markoff. 2019. From Your Mouth to Your Screen, Transcribing Takes the Next Step. New York Times (October 2019).
  15. OpenAI. 2023a. GPT 3.5. https://platform.openai.com/docs/models/gpt-3-5. Accessed: 2023-11-25.
  16. OpenAI. 2023b. Speech to text. https://platform.openai.com/docs/guides/speech-to-text. Accessed: 2023-11-25.
  17. Augmented Datasheets for Speech Datasets and Ethical Decision-Making. In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency. 881–904.
  18. Alexis Plaquet and Hervé Bredin. 2023. Powerset multi-class cross entropy loss for neural speaker diarization. In Proc. INTERSPEECH 2023.
  19. Robust Speech Recognition via Large-Scale Weak Supervision. arXiv (2022). arXiv:arXiv:2212.04356
  20. Donald B Rubin. 1980. Bias reduction using Mahalanobis-metric matching. Biometrics (1980), 293–298.
  21. Hallucination of Speech Recognition Errors With Sequence to Sequence Learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2022), 890–900. https://doi.org/10.1109/taslp.2022.3145313
  22. David Sherfinski and Avi Asher-Schapiro. 2021. U.S. prisons mull AI to analyze inmate phone calls. Thomson Reuters Foundation News (August 2021).
  23. Silero Team. 2021. Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier. https://github.com/snakers4/silero-vad.
Citations (15)

Summary

  • The paper reveals that approximately 1% of Whisper transcriptions include hallucinated phrases, with higher occurrence in aphasic speakers.
  • It employs logistic regression on 13,140 audio segments to link speech disfluencies and non-vocal pauses to these hallucinations.
  • The analysis advocates for design adjustments to reduce harmful outputs, contrasting Whisper’s approach with other speech-to-text systems.

Analyzing Speech-to-Text Hallucination Harms in OpenAI's Whisper

The proliferation of automated speech-to-text systems has marked a significant technological advancement, leveraging deep learning to transcribe audio with increasing accuracy. In this context, OpenAI's Whisper represents a state-of-the-art solution, purporting superior transcription capabilities relative to its industry competitors. However, as this paper by Koenecke et al. reveals, Whisper is not without its limitations and introduces a critical discussion on the phenomenon of 'hallucinations' in speech-to-text transcriptions. These hallucinations are defined as segments of text that do not correspond to any input audio, posing potential risks and implications across various applications.

Experimental Analysis of Whisper Transcriptions

Koenecke et al.'s paper is centered on evaluating the transcription performance of Whisper, focusing specifically on the hallucinations that unexpectedly occur during the transcription process. According to their findings, approximately 1% of transcriptions contain hallucinated phrases, with a notable portion possessing harmful content. The research involved processing 13,140 audio segments through Whisper, sourced from the AphasiaBank repository, which includes audio of aphasia speakers and control speakers, demonstrating significant reliability in identifying hallucination phenomena.

A striking observation is that hallucination occurrences are not uniformly distributed but are more prevalent in segments involving speakers with aphasia. Specifically, 1.7% of aphasia audio files contained hallucinations, surpassing the 1.2% rate among control speakers. This raises substantial concerns about biases, which may exacerbate existing inequalities for individuals with speech disorders, potentially influencing decisions in high-stakes environments such as healthcare or legal settings.

Investigation of Hallucination Content and Causes

The analysis identifies 312 hallucination instances, categorizing them into themes such as violence, demographic stereotypes, false associations, and fraudulent information. This categorization underscores the potential for such hallucinations to perpetuate damaging stereotypes or misinformation. Furthermore, the paper hypothesizes that these hallucinations may result from the fusion of audio and LLMing in Whisper, akin to that seen in generative LLMs like ChatGPT.

The disparities in hallucinations observed suggest that factors such as non-vocal durations within audio segments significantly contribute to hallucination occurrences. Speech disfluencies—extended pauses typical of aphasic speech—appear particularly predictive of hallucination instances, as confirmed by logistic regression analysis. This correlation with non-verbal audio content underscores the need for nuanced considerations in speech-to-text system design and deployment.

Comparative Analysis and Implications

A notable contrast arises when comparing Whisper's performance with other speech-to-text systems, such as those offered by Google and other major tech companies. The absence of hallucinations in the latter systems suggests fundamental differences in modeling approaches, providing a crucial insight into potential avenues for improving Whisper's fidelity and reliability. The paper suggests mitigating strategies, such as calibrating Whisper’s settings to filter out non-essential randomness or implementing user guidance on handling potential hallucinations.

Conclusion and Future Considerations

This paper elevates the discourse on automated speech-to-text systems, emphasizing the necessity of accounting for inaccuracies like hallucinations that can undermine the utility and trustworthiness of AI technologies. The implications are far-reaching, not only highlighting potential biases against users with inherent speech disfluencies but also accentuating the broader ethical considerations in AI deployments. Future research is encouraged to delve further into the intersectionality of hallucination impacts, aiming to refine models for better inclusivity and fairness.

The work by Koenecke et al. presents a compelling narrative that urges industry practitioners to prioritize transparency and ethical responsibility in addressing the intricacies of AI hallucinations. This ongoing dialogue will be crucial in fostering advancements that respect the diverse needs of all users, ensuring the broader societal benefits of AI technologies are equitably realized.