Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition (2405.15216v1)

Published 24 May 2024 in cs.LG, cs.CL, cs.SD, and eess.AS

Abstract: LLMs (LMs) have long been used to improve results of automatic speech recognition (ASR) systems, but they are unaware of the errors that ASR systems make. Error correction models are designed to fix ASR errors, however, they showed little improvement over traditional LMs mainly due to the lack of supervised training data. In this paper, we present Denoising LM (DLM), which is a $\textit{scaled}$ error correction model trained with vast amounts of synthetic data, significantly exceeding prior attempts meanwhile achieving new state-of-the-art ASR performance. We use text-to-speech (TTS) systems to synthesize audio, which is fed into an ASR system to produce noisy hypotheses, which are then paired with the original texts to train the DLM. DLM has several $\textit{key ingredients}$: (i) up-scaled model and data; (ii) usage of multi-speaker TTS systems; (iii) combination of multiple noise augmentation strategies; and (iv) new decoding techniques. With a Transformer-CTC ASR, DLM achieves 1.5% word error rate (WER) on $\textit{test-clean}$ and 3.3% WER on $\textit{test-other}$ on Librispeech, which to our knowledge are the best reported numbers in the setting where no external audio data are used and even match self-supervised methods which use external audio data. Furthermore, a single DLM is applicable to different ASRs, and greatly surpassing the performance of conventional LM based beam-search rescoring. These results indicate that properly investigated error correction models have the potential to replace conventional LMs, holding the key to a new level of accuracy in ASR systems.

Denoising LLMs for Superior ASR Error Correction Performance

The paper presents a novel Denoising LLM (DLM) that significantly advances the state-of-the-art in automatic speech recognition (ASR) performance by effectively addressing ASR errors. Unlike traditional LLMs (LMs) that are oblivious to the specific errors made by ASR systems, the DLM uses vast amounts of synthetic training data to correct these errors, thereby enhancing ASR accuracy.

Key Contributions

  1. Error Correction with Synthetic Data: The DLM is trained on a large-scale synthetic dataset generated using text-to-speech (TTS) systems. This data creation involves several steps:
    • Text from a LLM corpus is converted into audio via TTS systems.
    • The audio is processed by an ASR system to generate noisy hypotheses.
    • These noisy hypotheses are paired with the original clean text for training the DLM.
  2. Model Composition: The DLM is characterized by:
    • Extensive dataset and model scaling.
    • The use of multi-speaker TTS to introduce audio variability.
    • The application of various noise augmentation techniques, such as frequency masking and random character substitutions.
    • Novel decoding strategies to enhance performance.
  3. Empirical Evidence: The model achieves remarkable results on the LibriSpeech benchmark:
    • 1.5% word error rate (WER) on the test-clean subset.
    • 3.3% WER on the test-other subset. These results are achieved without using external audio data, and they even match those of methods utilizing self-supervised pretraining with additional audio data.

Universal Utility

The DLM demonstrates universal applicability, functioning effectively across different ASR architectures, including Transformer-CTC, Quartznet, and Whisper models. Moreover, it shows robust performance even when evaluated on out-of-domain datasets such as TED-LIUM.

Scalability

The paper explores various dimensions of scalability:

  • Model Size: Larger DLMs consistently outperform smaller counterparts, indicating the benefits of model scaling.
  • Text Corpus Size: Increasing the size of the training text corpus leads to performance gains, although the marginal benefit decreases at higher corpus sizes.
  • Speaker Variability: Training with audio from a larger number of speakers improves the DLM’s efficacy, bolstering its ability to correct diverse types of ASR errors.

Data Generation Strategies

Various data augmentation strategies are explored to fine-tune the training distribution to better reflect real-world scenarios:

  • Character Substitutions: Introducing random character substitutions in the training data helps the DLM learn to correct more varied errors.
  • Frequency Masking: Adding frequency masking consistent with SpecAugment techniques improves performance.
  • Mixing Real and Synthetic Data: The inclusion of real noisy hypotheses alongside synthetic data aids in anchoring the training distribution closer to reality.

Comparison with High-Quality TTS

Interestingly, the quality of the TTS system used to generate synthetic data is not directly correlated with the performance improvements observed. For instance, although Tacotron produces higher-quality audio, the DLM trained on its outputs does not perform as well as those trained on noisier TTS outputs from YourTTS or RichTTS. This suggests that diversified and noisier training data might be more beneficial for error correction tasks.

Practical and Theoretical Implications

The practical implications of deploying DLMs include significant enhancements in ASR systems across various domains without the need for extensive real-world audio data. The ability of DLMs to generalize and apply to different ASR architectures makes them a versatile tool in the field of speech recognition.

Theoretically, this research underscores the potential of using error correction models over traditional LMs in ASR systems. By focusing on correcting the specific types of errors made by ASR systems, DLMs offer a deeper integration between acoustic and linguistic modeling.

Future Directions

Potential future directions for research include:

  • Variational Approaches: Exploring variational methods to refine the error correction mechanism.
  • Enhanced Noise Models: Developing more sophisticated noise models to generate synthetic training data that better capture real-world variability.
  • Larger and Diverse Corpses: Scaling up the text corpus to more varied domains to further enhance the generalization capability of DLMs.

Conclusion

The Denoising LLM presented in this paper marks a significant advancement in ASR performance. By leveraging synthetic data and sophisticated error correction techniques, the DLM achieves state-of-the-art results, demonstrating the potential of error correction models to surpass traditional LLMs in practical ASR applications. This work lays the groundwork for future explorations in model scalability, data augmentation strategies, and the potential utility of DLMs in diverse ASR tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (52)
  1. https://github.com/coqui-ai/TTS.
  2. Whisper pretrained models. https://github.com/openai/whisper.
  3. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
  4. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International Conference on Machine Learning, pages 2709–2720. PMLR, 2022.
  5. Listen, attend and spell. arXiv preprint arXiv:1508.01211, 2015.
  6. Latticebart: Lattice-to-lattice pre-training for speech recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6112–6116. IEEE, 2022.
  7. Error correction in asr using sequence-to-sequence models. arXiv preprint arXiv:2202.01157, 2022.
  8. Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of the 23rd International Conference on Machine Learning, pages 369–376, 2006.
  9. Conformer: Convolution-augmented transformer for speech recognition. In Proc. Interspeech, 2020.
  10. On using monolingual corpora in neural machine translation. arXiv preprint arXiv:1503.03535, 2015.
  11. A spelling correction model for end-to-end speech recognition. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5651–5655. IEEE, 2019.
  12. Contextnet: Improving convolutional neural networks for automatic speech recognition with global context. In Proc. Interspeech, 2020.
  13. Ted-lium 3: Twice as much data and corpus repartition for experiments on speaker adaptation. In Speech and Computer: 20th International Conference, SPECOM 2018, Leipzig, Germany, September 18–22, 2018, Proceedings 20, pages 198–208. Springer, 2018.
  14. Minimum word error training of long short-term memory recurrent neural network language models for speech recognition. In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5990–5994, 2016.
  15. Correction of automatic speech recognition with transformer sequence-to-sequence model. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7074–7078. IEEE, 2020.
  16. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
  17. Synt++: Utilizing imperfect synthetic data to improve speech recognition. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7682–7686. IEEE, 2022.
  18. Application of pretrained deep neural networks to large vocabulary speech recognition. In Proc. Interspeech, pages 2578–2581, 2012.
  19. Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7669–7673. IEEE, 2020.
  20. Libriheavy: a 50,000 hours asr corpus with punctuation casing and context. arXiv preprint arXiv:2309.08105, 2023.
  21. An analysis of incorporating an external language model into a sequence-to-sequence model. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5828. IEEE, 2018.
  22. E-branchformer: Branchformer with enhanced merging for speech recognition. In 2022 IEEE Spoken Language Technology Workshop (SLT), pages 84–91. IEEE, 2023.
  23. Brian Kingsbury. Lattice-based optimization of sequence classification criteria for neural-network acoustic modeling. In 2009 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 3761–3764. IEEE, 2009.
  24. Quartznet: Deep automatic speech recognition with 1d time-channel separable convolutions. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6124–6128. IEEE, 2020.
  25. Fastcorrect 2: Fast error correction on multiple candidates for automatic speech recognition. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 4328–4337, 2021.
  26. Bart: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 7871–7880, 2020.
  27. N-best t5: Robust asr error correction using multiple input hypotheses and constrained decoding space. arXiv preprint arXiv:2303.00456, 2023.
  28. Neural lattice search for speech recognition. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7794–7798. IEEE, 2020.
  29. Can generative large language models perform asr error correction? arXiv preprint arXiv:2307.04172, 2023.
  30. A density ratio approach to language model fusion in end-to-end automatic speech recognition. 2019 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 434–441, 2019.
  31. Internal language model estimation for domain-adaptive end-to-end speech recognition. In 2021 IEEE Spoken Language Technology Workshop (SLT), pages 243–250. IEEE, 2021.
  32. Reward augmented maximum likelihood for neural structured prediction. Advances In Neural Information Processing Systems, 29, 2016.
  33. Asapp-asr: Multistream cnn and self-attentive sru for sota speech recognition. In Proc. Interspeech, 2020.
  34. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015.
  35. SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition. In Proc. Interspeech, pages 2613–2617, 2019.
  36. Robust speech recognition via large-scale weak supervision. In International Conference on Machine Learning, pages 28492–28518. PMLR, 2023.
  37. Whispering LLaMA: A cross-modal generative error correction framework for speech recognition. In The 2023 Conference on Empirical Methods in Natural Language Processing, 2023.
  38. Natural tts synthesis by conditioning wavenet on mel spectrogram predictions. In 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4779–4783. IEEE, 2018.
  39. Learning from past mistakes: improving automatic speech recognition output via noisy-clean phrase context modeling. APSIPA Transactions on Signal and Information Processing, 8:e8, 2019.
  40. Contextual spelling correction with large language models. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 1–8. IEEE, 2023.
  41. Cold fusion: Training seq2seq models together with language models, 2018.
  42. End-to-end ASR: from supervised to semi-supervised learning with modern architectures. In ICML 2020 Workshop on Self-supervision in Audio and Speech, 2020.
  43. Neural error corrective language models for automatic speech recognition. In Proc. Interspeech, pages 401–405, 2018.
  44. A comparison of techniques for language model integration in encoder-decoder speech recognition. In 2018 IEEE spoken language technology workshop (SLT), pages 369–375. IEEE, 2018.
  45. Deep neural networks for small footprint text-dependent speaker verification. In 2014 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 4052–4056. IEEE, 2014.
  46. Hybrid autoregressive transducer (hat). In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6139–6143, 2020.
  47. Sequence-discriminative training of deep neural networks. In Proc. Interspeech, pages 2345–2349, 2013.
  48. Asr error correction with augmented transformer for entity retrieval. In Proc. Interspeech, 2020.
  49. Iterative Pseudo-Labeling for Speech Recognition. In Proc. Interspeech, pages 1006–1010, 2020.
  50. Investigation of Transformer Based Spelling Correction Model for CTC-Based End-to-End Mandarin Speech Recognition. In Proc. Interspeech, pages 2180–2184, 2019.
  51. Pushing the limits of semi-supervised learning for automatic speech recognition. arXiv preprint arXiv:2010.10504, 2020.
  52. Bart based semantic correction for mandarin automatic speech recognition system. arXiv preprint arXiv:2104.05507, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Zijin Gu (9 papers)
  2. Tatiana Likhomanenko (41 papers)
  3. He Bai (50 papers)
  4. Erik McDermott (9 papers)
  5. Ronan Collobert (55 papers)
  6. Navdeep Jaitly (67 papers)
Citations (2)
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com