Adaptive Contextual Biasing for Transducer Based Streaming Speech Recognition (2306.00804v3)

Published 1 Jun 2023 in cs.SD, cs.CL, and eess.AS

Abstract: By incorporating additional contextual information, deep biasing methods have emerged as a promising solution for speech recognition of personalized words. However, for real-world voice assistants, always biasing on such personalized words with high prediction scores can significantly degrade the performance of recognizing common words. To address this issue, we propose an adaptive contextual biasing method based on Context-Aware Transformer Transducer (CATT) that utilizes the biased encoder and predictor embeddings to perform streaming prediction of contextual phrase occurrences. Such prediction is then used to dynamically switch the bias list on and off, enabling the model to adapt to both personalized and common scenarios. Experiments on Librispeech and internal voice assistant datasets show that our approach can achieve up to 6.7% and 20.7% relative reduction in WER and CER compared to the baseline respectively, mitigating up to 96.7% and 84.9% of the relative WER and CER increase for common cases. Furthermore, our approach has a minimal performance impact in personalized scenarios while maintaining a streaming inference pipeline with negligible RTF increase.

Authors (9)

Tianyi Xu (39 papers)
Zhanheng Yang (7 papers)
Kaixun Huang (8 papers)
Pengcheng Guo (55 papers)
Ao Zhang (45 papers)
Biao Li (41 papers)
Changru Chen (1 paper)
Chao Li (429 papers)
Lei Xie (337 papers)

Citations (10)

View on Semantic Scholar

Summary

Adaptive Contextual Biasing for Streaming Speech Recognition

Speech recognition systems have increasingly adopted end-to-end (E2E) frameworks to achieve high accuracy and efficiency. This paper explores an innovative approach to enhancing speech recognition systems' ability to handle rare and personalized words using contextual information. The authors propose an adaptive contextual biasing technique for Transducer models, addressing a key challenge in real-world voice assistant applications: the degradation of recognition performance for common words due to over-biasing towards personalized vocabulary.

The proposed method utilizes a Context-Aware Transformer Transducer (CATT) model, which employs biased encoder and predictor embeddings to predict the occurrence of contextual phrases. A distinctive feature of this method is its ability to dynamically toggle the contextual bias list on and off based on the predicted presence of contextual words, thereby adapting effectively to both personalized and generic speech recognition scenarios.

Key Contributions

Adaptive Contextual Biasing:
- The paper introduces an Entity Detector (ED) module leveraging multi-head attention to predict contextual phrase occurrences. This facilitates a dynamic adjustment of the bias list, maintaining model adaptability across different recognition contexts.
Mitigation of Common Case Degradation:
- By effectively filtering irrelevant contextual phrases, the approach reduces the negative impact on the recognition accuracy for frequently occurring words.
Implementation Variants:
- Two strategies for the ED module are explored: Predictor-based ED (P-ED) and Encoder-Predictor-based ED (EP-ED). These vary mainly in their complexity and reliance on different model outputs (predictor versus encoder).

Experimental Results

Significant improvements are observed in both personalized and common test scenarios using Librispeech and an internal voice assistant dataset. The proposed method achieves:

Up to 6.7% and 20.7% relative reduction in Word Error Rate (WER) for Librispeech and the voice assistant dataset, respectively.
Up to 96.7% and 84.9% reduction in WER increase for common cases, demonstrating robust performance by minimizing false alarms associated with recognizing non-entity names.

Practical and Theoretical Implications

The practical implications of this research lie in its potential to improve real-world applicability of speech recognition systems, especially in environments where personalization is key but should not compromise general recognition tasks. Theoretically, this work contributes to the understanding of dynamic biasing in neural network-based ASR, presenting a balance between specificity and generalization.

Future Directions

The paper highlights avenues for further investigation, including the exploration of more sophisticated attention mechanisms within the ED module and the evaluation of the approach across additional languages and dialects. Future work could also delve into the optimization of the bias list and its integration with unsupervised learning techniques to adaptively learn context without predefined lists.

In summary, this paper presents a methodologically sound approach to addressing contextual biasing in speech recognition, demonstrating substantial improvements while maintaining real-time inference capability. This makes it a beneficial advancement for deploying adaptive speech recognition systems in diverse application settings.

PDF Markdown

Related Papers

YouTube

Show All Videos