Introduction
LLMs have successfully demonstrated significant capabilities across numerous natural language processing tasks. This advancement has spurred research into leveraging LLMs for Automatic Speech Recognition (ASR), particularly in recognition error correction using Generative Error Correction (GER). While GER has shown promise in improving recognition results by finetuning LLMs on transcribed N-best hypotheses from ASR decoding, performance in noisy environments—a common real-world challenge—has not received sufficient focus. Against this backdrop, the authors of this paper address the deficit by extending the GER benchmark to noisy conditions, introducing the novel Robust HyPoradise (RobustHP) dataset.
Methodology
The authors contend with the challenge of noise-robust GER through extracting a noise embedding in language space from N-best hypotheses. Their insight is predicated upon the hypothesis that more adverse noise conditions yield greater diversity within the N-best hypotheses, which can then be represented as a noise embedding for the denoising process. The paper proposes a Knowledge Distillation (KD) strategy leveraging Mutual Information Estimation (MIE) to distill real noise information from audio embeddings into the language-space noise embedding, enhancing its representational capacity.
Experimental Results
Applying recent LLMs, including LLaMA-2, LLaMA, and Falcon, the proposed approach termed RobustGER is demonstrated to achieve significant performance improvements. Specifically, it garners up to a 53.9% reduction in Word Error Rate (WER) on the RobustHP test sets. Furthermore, ablation studies explore the relative contributions of utterance-level versus token-level information contained within the noise embedding, corroborating the essential role of the latter in denoising efficacy for GER.
Analysis
A closer examination unveils that while the abstracted language embedding can represent certain noise types adequately, others remain entangled with clean speech representations. The KD technique enhances noise distinguishability, leading to improved noise-representativeness and WER outcomes. Moreover, data efficiency is established through sustained GER performance despite substantial reductions in the training data volume, highlighting the robustness and generalizability of RobustGER. Lastly, cases illustrating the GER capabilities underscore its proficiency in rectifying transcription errors that carry significant semantic implications.
Conclusion
This paper significantly augments the utility of GER for ASR under noisy conditions by implementing a refined, noise-aware correction method. By deploying a language-space noise embedding and finetuning it via KD from audio embeddings, the proposed method not only represents audio noise more effectively but also instructs LLMs efficiently, advancing the state of GER in noisy environments without a heavy training data dependency. This milestone likely paves the way for advanced, practical ASR systems robust against real-world acoustic disturbances. The open-sourced work invites further enhancements and adaptations within the speech processing community.