A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency (2003.12710v2)

Published 28 Mar 2020 in cs.CL, cs.LG, and cs.SD

Abstract: Thus far, end-to-end (E2E) models have not been shown to outperform state-of-the-art conventional models with respect to both quality, i.e., word error rate (WER), and latency, i.e., the time the hypothesis is finalized after the user stops speaking. In this paper, we develop a first-pass Recurrent Neural Network Transducer (RNN-T) model and a second-pass Listen, Attend, Spell (LAS) rescorer that surpasses a conventional model in both quality and latency. On the quality side, we incorporate a large number of utterances across varied domains to increase acoustic diversity and the vocabulary seen by the model. We also train with accented English speech to make the model more robust to different pronunciations. In addition, given the increased amount of training data, we explore a varied learning rate schedule. On the latency front, we explore using the end-of-sentence decision emitted by the RNN-T model to close the microphone, and also introduce various optimizations to improve the speed of LAS rescoring. Overall, we find that RNN-T+LAS offers a better WER and latency tradeoff compared to a conventional model. For example, for the same latency, RNN-T+LAS obtains a 8% relative improvement in WER, while being more than 400-times smaller in model size.

PDF Abstract

A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency

This paper presents a compelling advancement in the field of Automatic Speech Recognition (ASR) by introducing a two-pass end-to-end (E2E) model that performs favorably against state-of-the-art conventional models in both quality and latency. The paper focuses on a Recurrent Neural Network Transducer (RNN-T) model complemented by a Listen, Attend, and Spell (LAS) rescorer, both running on device and anticipated to replace conventional server-side architecture.

Model Architecture and Training

The architecture proposed comprises a first-pass RNN-T, which is a hybrid neural transducer used to produce initial hypotheses, followed by a finely tuned second-pass LAS rescorer providing refined actions. The model ingests raw acoustic frames, processed via stacked log-mel filterbank energies, into its multi-layer architecture of Long Short-Term Memory (LSTM) units for encoding. The outputs facilitate a streaming transcription by predicting word-piece tokens within a lattice structure for rescoring.

Training employs a large and diverse corpus spanning voice search, telephony, and YouTube sources, thus inducing both acoustic and linguistic diversity to E2E models. To reinforce robustness, the training includes speech with varied pronunciations from locales with accented English, an innovative approach compared to traditional lexicon-based methods. A constant learning rate schedule is explored as opposed to a decaying one, supported by empirical findings that suggest enhanced convergence with large data sets.

Quality and Latency Improvements

The results showcased significant enhancements over prior methodologies. By utilizing multi-domain data, the model yields improvements on core test sets, notably seeing a reduction in Word Error Rate (WER) by about 8% relative to the baseline. Moreover, the inclusion of accented data reduces WER on corresponding test sets, illustrating the model’s ability to generalize across diverse speech patterns.

The paper makes strides in latency optimization—a pivotal metric in real-time ASR—by integrating the end-of-query decision directly into the RNN-T model, thus harmonizing recognition and endpointing processes. This optimization facilitates a low endpointer (EP) latency, reducing 90-percentile endpoint latency (EP90) to competitive levels with traditional server-side models without compromising on recognition quality.

Implications and Future Directions

The implications of this research are multifold. E2E models of this nature bolster the case for on-device ASR solutions, offering a fortified balance of computational efficiency, privacy preservation, and reduced latency—critical elements for user-centric applications. Furthermore, surpassing conventional systems in scenarios involving diverse accents and multi-domain tasks marks a significant milestone in ASR research.

Looking forward, incremental enhancements in model size and computational efficiency, possibly through further exploration of quantization techniques or more advanced learning rate schedules, could see such models deployed across broader device spectrums. This could bring about an era of ubiquitous, on-device ASR systems provided by reduced dependencies on server infrastructure. Subsequent research could also explore end-to-end learning dynamics, potentially integrating unsupervised learning mechanisms to boost performance in data-scarce environments.

This paper serves as a pivotal reference in the evolution of ASR architecture, blending advancements in neural network designs with practical deployment considerations to deliver marked improvements in ASR capabilities.

PDF Markdown Bookmark Chat (Pro)

Authors (29)

Tara N. Sainath (79 papers)
Yanzhang He (41 papers)
Bo Li (1107 papers)
Arun Narayanan (34 papers)
Ruoming Pang (59 papers)
Antoine Bruguier (10 papers)
Shuo-yiin Chang (25 papers)
Wei Li (1121 papers)
Raziel Alvarez (9 papers)
Zhifeng Chen (65 papers)
Chung-Cheng Chiu (48 papers)
David Garcia (52 papers)
Alex Gruenstein (1 paper)
Ke Hu (57 papers)
Minho Jin (6 papers)
Anjuli Kannan (19 papers)
Qiao Liang (26 papers)
Ian McGraw (18 papers)
Cal Peyser (14 papers)
Rohit Prabhavalkar (59 papers)

Citations (213)

View on Semantic Scholar

A Streaming On-Device End-to-End Model Surpassing Server-Side Conventional Model Quality and Latency (2003.12710v2)