Adaptive Contextual Biasing for Streaming Speech Recognition
Speech recognition systems have increasingly adopted end-to-end (E2E) frameworks to achieve high accuracy and efficiency. This paper explores an innovative approach to enhancing speech recognition systems' ability to handle rare and personalized words using contextual information. The authors propose an adaptive contextual biasing technique for Transducer models, addressing a key challenge in real-world voice assistant applications: the degradation of recognition performance for common words due to over-biasing towards personalized vocabulary.
The proposed method utilizes a Context-Aware Transformer Transducer (CATT) model, which employs biased encoder and predictor embeddings to predict the occurrence of contextual phrases. A distinctive feature of this method is its ability to dynamically toggle the contextual bias list on and off based on the predicted presence of contextual words, thereby adapting effectively to both personalized and generic speech recognition scenarios.
Key Contributions
- Adaptive Contextual Biasing:
- The paper introduces an Entity Detector (ED) module leveraging multi-head attention to predict contextual phrase occurrences. This facilitates a dynamic adjustment of the bias list, maintaining model adaptability across different recognition contexts.
- Mitigation of Common Case Degradation:
- By effectively filtering irrelevant contextual phrases, the approach reduces the negative impact on the recognition accuracy for frequently occurring words.
- Implementation Variants:
- Two strategies for the ED module are explored: Predictor-based ED (P-ED) and Encoder-Predictor-based ED (EP-ED). These vary mainly in their complexity and reliance on different model outputs (predictor versus encoder).
Experimental Results
Significant improvements are observed in both personalized and common test scenarios using Librispeech and an internal voice assistant dataset. The proposed method achieves:
- Up to 6.7% and 20.7% relative reduction in Word Error Rate (WER) for Librispeech and the voice assistant dataset, respectively.
- Up to 96.7% and 84.9% reduction in WER increase for common cases, demonstrating robust performance by minimizing false alarms associated with recognizing non-entity names.
Practical and Theoretical Implications
The practical implications of this research lie in its potential to improve real-world applicability of speech recognition systems, especially in environments where personalization is key but should not compromise general recognition tasks. Theoretically, this work contributes to the understanding of dynamic biasing in neural network-based ASR, presenting a balance between specificity and generalization.
Future Directions
The paper highlights avenues for further investigation, including the exploration of more sophisticated attention mechanisms within the ED module and the evaluation of the approach across additional languages and dialects. Future work could also delve into the optimization of the bias list and its integration with unsupervised learning techniques to adaptively learn context without predefined lists.
In summary, this paper presents a methodologically sound approach to addressing contextual biasing in speech recognition, demonstrating substantial improvements while maintaining real-time inference capability. This makes it a beneficial advancement for deploying adaptive speech recognition systems in diverse application settings.