Denoising LLMs for Superior ASR Error Correction Performance
The paper presents a novel Denoising LLM (DLM) that significantly advances the state-of-the-art in automatic speech recognition (ASR) performance by effectively addressing ASR errors. Unlike traditional LLMs (LMs) that are oblivious to the specific errors made by ASR systems, the DLM uses vast amounts of synthetic training data to correct these errors, thereby enhancing ASR accuracy.
Key Contributions
- Error Correction with Synthetic Data: The DLM is trained on a large-scale synthetic dataset generated using text-to-speech (TTS) systems. This data creation involves several steps:
- Text from a LLM corpus is converted into audio via TTS systems.
- The audio is processed by an ASR system to generate noisy hypotheses.
- These noisy hypotheses are paired with the original clean text for training the DLM.
- Model Composition: The DLM is characterized by:
- Extensive dataset and model scaling.
- The use of multi-speaker TTS to introduce audio variability.
- The application of various noise augmentation techniques, such as frequency masking and random character substitutions.
- Novel decoding strategies to enhance performance.
- Empirical Evidence: The model achieves remarkable results on the LibriSpeech benchmark:
- 1.5% word error rate (WER) on the test-clean subset.
- 3.3% WER on the test-other subset. These results are achieved without using external audio data, and they even match those of methods utilizing self-supervised pretraining with additional audio data.
Universal Utility
The DLM demonstrates universal applicability, functioning effectively across different ASR architectures, including Transformer-CTC, Quartznet, and Whisper models. Moreover, it shows robust performance even when evaluated on out-of-domain datasets such as TED-LIUM.
Scalability
The paper explores various dimensions of scalability:
- Model Size: Larger DLMs consistently outperform smaller counterparts, indicating the benefits of model scaling.
- Text Corpus Size: Increasing the size of the training text corpus leads to performance gains, although the marginal benefit decreases at higher corpus sizes.
- Speaker Variability: Training with audio from a larger number of speakers improves the DLM’s efficacy, bolstering its ability to correct diverse types of ASR errors.
Data Generation Strategies
Various data augmentation strategies are explored to fine-tune the training distribution to better reflect real-world scenarios:
- Character Substitutions: Introducing random character substitutions in the training data helps the DLM learn to correct more varied errors.
- Frequency Masking: Adding frequency masking consistent with SpecAugment techniques improves performance.
- Mixing Real and Synthetic Data: The inclusion of real noisy hypotheses alongside synthetic data aids in anchoring the training distribution closer to reality.
Comparison with High-Quality TTS
Interestingly, the quality of the TTS system used to generate synthetic data is not directly correlated with the performance improvements observed. For instance, although Tacotron produces higher-quality audio, the DLM trained on its outputs does not perform as well as those trained on noisier TTS outputs from YourTTS or RichTTS. This suggests that diversified and noisier training data might be more beneficial for error correction tasks.
Practical and Theoretical Implications
The practical implications of deploying DLMs include significant enhancements in ASR systems across various domains without the need for extensive real-world audio data. The ability of DLMs to generalize and apply to different ASR architectures makes them a versatile tool in the field of speech recognition.
Theoretically, this research underscores the potential of using error correction models over traditional LMs in ASR systems. By focusing on correcting the specific types of errors made by ASR systems, DLMs offer a deeper integration between acoustic and linguistic modeling.
Future Directions
Potential future directions for research include:
- Variational Approaches: Exploring variational methods to refine the error correction mechanism.
- Enhanced Noise Models: Developing more sophisticated noise models to generate synthetic training data that better capture real-world variability.
- Larger and Diverse Corpses: Scaling up the text corpus to more varied domains to further enhance the generalization capability of DLMs.
Conclusion
The Denoising LLM presented in this paper marks a significant advancement in ASR performance. By leveraging synthetic data and sophisticated error correction techniques, the DLM achieves state-of-the-art results, demonstrating the potential of error correction models to surpass traditional LLMs in practical ASR applications. This work lays the groundwork for future explorations in model scalability, data augmentation strategies, and the potential utility of DLMs in diverse ASR tasks.