Phonetically-Augmented Discriminative Rescoring for Voice Search Error Correction

Published 6 Jun 2025 in cs.CL, cs.AI, and cs.IR | (2506.06117v1)

Abstract: End-to-end (E2E) Automatic Speech Recognition (ASR) models are trained using paired audio-text samples that are expensive to obtain, since high-quality ground-truth data requires human annotators. Voice search applications, such as digital media players, leverage ASR to allow users to search by voice as opposed to an on-screen keyboard. However, recent or infrequent movie titles may not be sufficiently represented in the E2E ASR system's training data, and hence, may suffer poor recognition. In this paper, we propose a phonetic correction system that consists of (a) a phonetic search based on the ASR model's output that generates phonetic alternatives that may not be considered by the E2E system, and (b) a rescorer component that combines the ASR model recognition and the phonetic alternatives, and select a final system output. We find that our approach improves word error rate between 4.4 and 7.6% relative on benchmarks of popular movie titles over a series of competitive baselines.

Abstract PDF Chat (Pro)

Summary

The paper introduces a phonetic-augmented discriminative rescoring method that enhances ASR accuracy for voice search by reducing errors in recognizing entity names.
It combines an HMM-driven phonetic search with a rescoring step that integrates phoneme-based language models to optimize transcription selection.
Empirical results show a 4.4% to 7.6% relative reduction in word error rate on movie queries, demonstrating clear advantages over traditional correction methods.

Phonetically-Augmented Discriminative Rescoring for Voice Search Error Correction

The paper "Phonetically-Augmented Discriminative Rescoring for Voice Search Error Correction" introduces a novel strategy for improving Automatic Speech Recognition (ASR) performance in voice search applications, specifically addressing the challenge of accurately recognizing and transcribing entity names such as movie titles. The authors propose an innovative phonetic correction system that augments the discriminative rescoring process by leveraging a phonetic search approach alongside the existing ASR model's outputs.

Overview of the Proposed Methodology

The central mechanism of the proposed system comprises two main components: the phonetic search for alternative transcriptions and a rescoring step to select the most accurate transcription. After the initial ASR process, a phonetic transcription is derived from the decoded speech, which is then used to search for phonetic alternatives across a lexicon by employing a Hidden Markov Model (HMM). This phonetic search is driven by both the acoustic similarity to the original ASR output and the likelihood of the word sequences under a phoneme-based LLM (LM).

Subsequently, the phonetic alternatives are combined with the original ASR output using a discriminative rescoring model. This rescorer integrates various features, including phonetic and acoustic alignment costs and LLM probabilities, to weigh each hypothesis in the N-best list. This approach notably diverges from traditional token-to-token (T2T) correction models that require extensive data and retraining, thereby providing practical advantages in industrial applications, such as those involving embedded virtual assistants.

Numerical Results

The empirical results demonstrate significant improvements in word error rate (WER) on targeted voice search tasks, with improvements ranging from 4.4% to 7.6% relative WER reduction on movies-related queries when compared to state-of-the-art second pass methods that include only LLM rescoring or acoustic model fusion. This indicates that phonetic correction effectively addresses recognition errors involving entity names that are absent or underrepresented in the training data.

Implications and Future Directions

The paper's findings underscore the viability of leveraging phonetic corrections and discriminative rescoring to improve the ASR system's robustness in scenarios where training data is limited or incomplete, particularly for emerging or infrequent entities like recent movie titles. The decoupled nature of the correction components from the main ASR model allows for seamless updates and incorporation of new linguistic data without the necessity of full-scale retraining, which is crucial for adaptive systems in dynamic content environments.

Future research directions may explore the enhancement of phonetic search algorithms to further optimize performance, potentially by integrating more nuanced models of phonetic similarity or by leveraging advances in unsupervised learning to naturalize the rescorer's training process. Additionally, widening the scope of the correction system to handle multi-lingual or code-switched environments could expand its applicability to a broader range of user scenarios.

In conclusion, this paper contributes a substantial advancement to the domain of speech recognition, offering a compelling method for overcoming typical data limitations in end-to-end ASR frameworks, with promising implications for improving the accuracy and efficiency of voice-driven search technologies.