Denoising LM: Pushing the Limits of Error Correction Models for Speech Recognition (2405.15216v1)

Published 24 May 2024 in cs.LG, cs.CL, cs.SD, and eess.AS

Abstract: LLMs (LMs) have long been used to improve results of automatic speech recognition (ASR) systems, but they are unaware of the errors that ASR systems make. Error correction models are designed to fix ASR errors, however, they showed little improvement over traditional LMs mainly due to the lack of supervised training data. In this paper, we present Denoising LM (DLM), which is a $\textit{scaled}$ error correction model trained with vast amounts of synthetic data, significantly exceeding prior attempts meanwhile achieving new state-of-the-art ASR performance. We use text-to-speech (TTS) systems to synthesize audio, which is fed into an ASR system to produce noisy hypotheses, which are then paired with the original texts to train the DLM. DLM has several $\textit{key ingredients}$: (i) up-scaled model and data; (ii) usage of multi-speaker TTS systems; (iii) combination of multiple noise augmentation strategies; and (iv) new decoding techniques. With a Transformer-CTC ASR, DLM achieves 1.5% word error rate (WER) on $\textit{test-clean}$ and 3.3% WER on $\textit{test-other}$ on Librispeech, which to our knowledge are the best reported numbers in the setting where no external audio data are used and even match self-supervised methods which use external audio data. Furthermore, a single DLM is applicable to different ASRs, and greatly surpassing the performance of conventional LM based beam-search rescoring. These results indicate that properly investigated error correction models have the potential to replace conventional LMs, holding the key to a new level of accuracy in ASR systems.

PDF HTML Abstract

Denoising LLMs for Superior ASR Error Correction Performance

The paper presents a novel Denoising LLM (DLM) that significantly advances the state-of-the-art in automatic speech recognition (ASR) performance by effectively addressing ASR errors. Unlike traditional LLMs (LMs) that are oblivious to the specific errors made by ASR systems, the DLM uses vast amounts of synthetic training data to correct these errors, thereby enhancing ASR accuracy.

Key Contributions

Error Correction with Synthetic Data: The DLM is trained on a large-scale synthetic dataset generated using text-to-speech (TTS) systems. This data creation involves several steps:
- Text from a LLM corpus is converted into audio via TTS systems.
- The audio is processed by an ASR system to generate noisy hypotheses.
- These noisy hypotheses are paired with the original clean text for training the DLM.
Model Composition: The DLM is characterized by:
- Extensive dataset and model scaling.
- The use of multi-speaker TTS to introduce audio variability.
- The application of various noise augmentation techniques, such as frequency masking and random character substitutions.
- Novel decoding strategies to enhance performance.
Empirical Evidence: The model achieves remarkable results on the LibriSpeech benchmark:
- 1.5% word error rate (WER) on the test-clean subset.
- 3.3% WER on the test-other subset. These results are achieved without using external audio data, and they even match those of methods utilizing self-supervised pretraining with additional audio data.

Universal Utility

The DLM demonstrates universal applicability, functioning effectively across different ASR architectures, including Transformer-CTC, Quartznet, and Whisper models. Moreover, it shows robust performance even when evaluated on out-of-domain datasets such as TED-LIUM.

Scalability

The paper explores various dimensions of scalability:

Model Size: Larger DLMs consistently outperform smaller counterparts, indicating the benefits of model scaling.
Text Corpus Size: Increasing the size of the training text corpus leads to performance gains, although the marginal benefit decreases at higher corpus sizes.
Speaker Variability: Training with audio from a larger number of speakers improves the DLM’s efficacy, bolstering its ability to correct diverse types of ASR errors.

Data Generation Strategies

Various data augmentation strategies are explored to fine-tune the training distribution to better reflect real-world scenarios:

Character Substitutions: Introducing random character substitutions in the training data helps the DLM learn to correct more varied errors.
Frequency Masking: Adding frequency masking consistent with SpecAugment techniques improves performance.
Mixing Real and Synthetic Data: The inclusion of real noisy hypotheses alongside synthetic data aids in anchoring the training distribution closer to reality.

Comparison with High-Quality TTS

Interestingly, the quality of the TTS system used to generate synthetic data is not directly correlated with the performance improvements observed. For instance, although Tacotron produces higher-quality audio, the DLM trained on its outputs does not perform as well as those trained on noisier TTS outputs from YourTTS or RichTTS. This suggests that diversified and noisier training data might be more beneficial for error correction tasks.

Practical and Theoretical Implications

The practical implications of deploying DLMs include significant enhancements in ASR systems across various domains without the need for extensive real-world audio data. The ability of DLMs to generalize and apply to different ASR architectures makes them a versatile tool in the field of speech recognition.

Theoretically, this research underscores the potential of using error correction models over traditional LMs in ASR systems. By focusing on correcting the specific types of errors made by ASR systems, DLMs offer a deeper integration between acoustic and linguistic modeling.

Future Directions

Potential future directions for research include:

Variational Approaches: Exploring variational methods to refine the error correction mechanism.
Enhanced Noise Models: Developing more sophisticated noise models to generate synthetic training data that better capture real-world variability.
Larger and Diverse Corpses: Scaling up the text corpus to more varied domains to further enhance the generalization capability of DLMs.

Conclusion

The Denoising LLM presented in this paper marks a significant advancement in ASR performance. By leveraging synthetic data and sophisticated error correction techniques, the DLM achieves state-of-the-art results, demonstrating the potential of error correction models to surpass traditional LLMs in practical ASR applications. This work lays the groundwork for future explorations in model scalability, data augmentation strategies, and the potential utility of DLMs in diverse ASR tasks.