A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency
This paper presents a compelling advancement in the field of Automatic Speech Recognition (ASR) by introducing a two-pass end-to-end (E2E) model that performs favorably against state-of-the-art conventional models in both quality and latency. The paper focuses on a Recurrent Neural Network Transducer (RNN-T) model complemented by a Listen, Attend, and Spell (LAS) rescorer, both running on device and anticipated to replace conventional server-side architecture.
Model Architecture and Training
The architecture proposed comprises a first-pass RNN-T, which is a hybrid neural transducer used to produce initial hypotheses, followed by a finely tuned second-pass LAS rescorer providing refined actions. The model ingests raw acoustic frames, processed via stacked log-mel filterbank energies, into its multi-layer architecture of Long Short-Term Memory (LSTM) units for encoding. The outputs facilitate a streaming transcription by predicting word-piece tokens within a lattice structure for rescoring.
Training employs a large and diverse corpus spanning voice search, telephony, and YouTube sources, thus inducing both acoustic and linguistic diversity to E2E models. To reinforce robustness, the training includes speech with varied pronunciations from locales with accented English, an innovative approach compared to traditional lexicon-based methods. A constant learning rate schedule is explored as opposed to a decaying one, supported by empirical findings that suggest enhanced convergence with large data sets.
Quality and Latency Improvements
The results showcased significant enhancements over prior methodologies. By utilizing multi-domain data, the model yields improvements on core test sets, notably seeing a reduction in Word Error Rate (WER) by about 8% relative to the baseline. Moreover, the inclusion of accented data reduces WER on corresponding test sets, illustrating the model’s ability to generalize across diverse speech patterns.
The paper makes strides in latency optimization—a pivotal metric in real-time ASR—by integrating the end-of-query decision directly into the RNN-T model, thus harmonizing recognition and endpointing processes. This optimization facilitates a low endpointer (EP) latency, reducing 90-percentile endpoint latency (EP90) to competitive levels with traditional server-side models without compromising on recognition quality.
Implications and Future Directions
The implications of this research are multifold. E2E models of this nature bolster the case for on-device ASR solutions, offering a fortified balance of computational efficiency, privacy preservation, and reduced latency—critical elements for user-centric applications. Furthermore, surpassing conventional systems in scenarios involving diverse accents and multi-domain tasks marks a significant milestone in ASR research.
Looking forward, incremental enhancements in model size and computational efficiency, possibly through further exploration of quantization techniques or more advanced learning rate schedules, could see such models deployed across broader device spectrums. This could bring about an era of ubiquitous, on-device ASR systems provided by reduced dependencies on server infrastructure. Subsequent research could also explore end-to-end learning dynamics, potentially integrating unsupervised learning mechanisms to boost performance in data-scarce environments.
This paper serves as a pivotal reference in the evolution of ASR architecture, blending advancements in neural network designs with practical deployment considerations to deliver marked improvements in ASR capabilities.