Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
125 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Hallucinations in Neural Automatic Speech Recognition: Identifying Errors and Hallucinatory Models (2401.01572v1)

Published 3 Jan 2024 in cs.CL, cs.SD, and eess.AS

Abstract: Hallucinations are a type of output error produced by deep neural networks. While this has been studied in natural language processing, they have not been researched previously in automatic speech recognition. Here, we define hallucinations in ASR as transcriptions generated by a model that are semantically unrelated to the source utterance, yet still fluent and coherent. The similarity of hallucinations to probable natural language outputs of the model creates a danger of deception and impacts the credibility of the system. We show that commonly used metrics, such as word error rates, cannot differentiate between hallucinatory and non-hallucinatory models. To address this, we propose a perturbation-based method for assessing the susceptibility of an automatic speech recognition (ASR) model to hallucination at test time, which does not require access to the training dataset. We demonstrate that this method helps to distinguish between hallucinatory and non-hallucinatory models that have similar baseline word error rates. We further explore the relationship between the types of ASR errors and the types of dataset noise to determine what types of noise are most likely to create hallucinatory outputs. We devise a framework for identifying hallucinations by analysing their semantic connection with the ground truth and their fluency. Finally, we discover how to induce hallucinations with a random noise injection to the utterance.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. Hussam Alkaissi and Samy I McFarlane. 2023. Artificial hallucinations in chatgpt: Implications in scientific writing. Cureus, 15.
  2. Detection of confusable words in automatic speech recognition. IEEE Signal Processing Letters, 12(8):585–588.
  3. Common voice: A massively-multilingual speech corpus. In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218–4222, Marseille, France. European Language Resources Association.
  4. wav2vec 2.0: A framework for self-supervised learning of speech representations. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual.
  5. Scaling instruction-finetuned language models.
  6. Automatic speech recognition errors detection and correction: A review. Procedia Computer Science, 128:32–37. 1st International Conference on Natural Language and Speech Processing.
  7. Non-autoregressive Chinese ASR error correction with phonological training. In Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 5907–5917, Seattle, United States. Association for Computational Linguistics.
  8. Vitaly Feldman. 2020. Does learning require memorization? a short tale about a long tail. In Proceedings of the 52nd Annual ACM SIGACT Symposium on Theory of Computing, STOC 2020, page 954–959, New York, NY, USA. Association for Computing Machinery.
  9. Topic model robustness to automatic speech recognition errors in podcast transcripts. ArXiv, abs/2109.12306.
  10. State-of-the-art generalisation research in nlp: A taxonomy and review.
  11. How bad are artifacts?: Analyzing the impact of speech enhancement errors on asr. In Interspeech.
  12. Survey of hallucination in natural language generation. ACM Comput. Surv., 55(12).
  13. Preethi Jyothi and Eric Fosler-Lussier. 2010. Discriminative language modeling using simulated asr errors. In Interspeech.
  14. Philipp Koehn and Rebecca Knowles. 2017. Six challenges for neural machine translation. In Proceedings of the First Workshop on Neural Machine Translation, pages 28–39, Vancouver. Association for Computational Linguistics.
  15. Hallucinations in neural machine translation.
  16. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  17. Auto-avsr: Audio-visual speech recognition with automatic labels. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5.
  18. Identifying fluently inadequate output in neural and statistical machine translation. In Proceedings of Machine Translation Summit XVII: Research Track, pages 233–243, Dublin, Ireland. European Association for Machine Translation.
  19. Adding noise to improve noise robustness in speech recognition. In Eighth Annual Conference of the International Speech Communication Association.
  20. GLEU: Automatic evaluation of sentence-level fluency. In Proceedings of the 45th Annual Meeting of the Association of Computational Linguistics, pages 344–351, Prague, Czech Republic. Association for Computational Linguistics.
  21. fairseq: A fast, extensible toolkit for sequence modeling. In Proceedings of NAACL-HLT 2019: Demonstrations.
  22. Librispeech: An asr corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210.
  23. Specaugment: A simple data augmentation method for automatic speech recognition. Interspeech 2019.
  24. The curious case of hallucinations in neural machine translation. pages 1172–1183.
  25. Hallucinated n-best lists for discriminative language modeling. In 2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5001–5004.
  26. Hallucination of speech recognition errors with sequence to sequence learning. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:890–900.
  27. Learning from past mistakes: improving automatic speech recognition output via noisy-clean phrase context modeling. APSIPA Transactions on Signal and Information Processing, 8:e8.
  28. Synthetic data augmentation for improving low-resource asr. In 2019 IEEE Western New York Image and Signal Processing Workshop (WNYISPW), pages 1–9.
Citations (5)

Summary

  • The paper introduces a novel framework that uses a perturbation-based method to identify hallucinations in neural ASR outputs.
  • It reveals that traditional metrics like WER do not differentiate between hallucinatory and non-hallucinatory models.
  • The study highlights the influence of dataset noise on inducing errors, emphasizing the need for further research on mitigation strategies.

Introduction

Automatic Speech Recognition (ASR) systems have significantly improved with the advent of neural architectures. However, one of the lesser-discussed challenges they face is the output of 'hallucinations'. Hallucinations within the context of neural models refer to outputs that are coherent and fluently structured yet have no semantic connection to the input utterance. This issue could potentially mislead users, introducing concerns about the system's reliability and credibility.

Hallucination Detection

A novel aspect of this paper is its investigation into ASR hallucinations, developing a framework to detect and analyze them. An intriguing finding of the paper is that commonly used metrics like Word Error Rate (WER) fail to discern between models prone to hallucinations and those that are not. To tackle this, the researchers propose a perturbation-based method, applied during testing, which does not require access to the original training data. This method is adept at identifying models susceptible to hallucinations even when they have similar WERs to non-hallucinatory models.

Understanding Hallucination-Prone Models

By inducing hallucinations through the injection of random noise into utterances, the paper examines how various factors like dataset noise contribute to ASR errors, elucidating the connection between the types of dataset noises and the likelihood of producing hallucinatory outputs. A comprehensive analysis of these outputs' semantic content and fluency helps to distinguish them from other ASR errors like phonetic substitutions or oscillations, which involve repeated n-grams.

Implications and Future Research

While the paper makes significant headway into understanding and detecting ASR hallucinations, it also points out limitations and necessary further research. The created methodologies are currently applied to English ASR, implying the need for broader linguistic applications. Furthermore, while the paper delineates hallucinations, it does not address their mitigation — a critical next step in enhancing ASR system security and performance.