Improving RNN Transducer Modeling for End-to-End Speech Recognition (1909.12415v1)

Published 26 Sep 2019 in cs.CL and eess.AS

Abstract: In the last few years, an emerging trend in automatic speech recognition research is the study of end-to-end (E2E) systems. Connectionist Temporal Classification (CTC), Attention Encoder-Decoder (AED), and RNN Transducer (RNN-T) are the most popular three methods. Among these three methods, RNN-T has the advantages to do online streaming which is challenging to AED and it doesn't have CTC's frame-independence assumption. In this paper, we improve the RNN-T training in two aspects. First, we optimize the training algorithm of RNN-T to reduce the memory consumption so that we can have larger training minibatch for faster training speed. Second, we propose better model structures so that we obtain RNN-T models with the very good accuracy but small footprint. Trained with 30 thousand hours anonymized and transcribed Microsoft production data, the best RNN-T model with even smaller model size (216 Megabytes) achieves up-to 11.8% relative word error rate (WER) reduction from the baseline RNN-T model. This best RNN-T model is significantly better than the device hybrid model with similar size by achieving up-to 15.0% relative WER reduction, and obtains similar WERs as the server hybrid model of 5120 Megabytes in size.

PDF Abstract

Improving RNN Transducer Modeling for End-to-End Speech Recognition

The paper "Improving RNN Transducer Modeling for End-to-End Speech Recognition" presents significant advancements in the domain of end-to-end (E2E) automatic speech recognition (ASR) systems, focused primarily on enhancing the efficiency and effectiveness of RNN Transducer (RNN-T) models. RNN-T is appreciating growing interest due to its capability to perform online streaming without the frame-independence assumption inherent in Connectionist Temporal Classification (CTC) models, making it a suitable choice for real-time applications over the Attention Encoder-Decoder (AED) approach.

Enhancements in RNN-T Training

Memory Optimization and Training Speed:

The authors tackle the substantial memory consumption, which is a known constraint for RNN-T models, given their complex training process involving a three-dimensional grid of time and label alignments. By redesigning the training algorithms to combine encoder and prediction network outputs more efficiently and revising gradient calculations, the paper reports a significant reduction in memory requirements. This allows for larger training minibatches, thereby accelerating training without compromising model accuracy or increasing hardware demands.

Model Architecture Improvements:

Beyond algorithmic optimizations, the paper introduces novel model structures. By leveraging Layer Trajectory LSTM (ltLSTM) and its contextual variant (cltLSTM), which incorporates future frame context at each layer, the authors report improvements in audio transcription accuracy. Similarly, they propose an Element-wise Contextual Layer Trajectory GRU (ecltGRU) which combines the advantages of using future context for improved prediction accuracy while maintaining a reduced model size, favoring deployment in memory-constrained environments such as on-device ASR systems.

Experimental Validation

Trained on a large dataset comprising 30,000 hours of transcribed speech, the proposed models achieved notable reductions in word error rate (WER) across several domains, including Cortana, Conversation, and DMA test sets. For example, the ecltGRU model achieved an 11.8% relative WER reduction compared to the baseline. Moreover, the model size was reduced to 216 MB, allowing it to surpass a comparable-size hybrid model by approximately 15% in WER, thereby demonstrating strong robustness and deployment-readiness for real-time applications. Notably, this model also showed competitive WERs comparable to much larger hybrid models used in server setups.

Theoretical and Practical Implications

The presented advancements have dual significance. Theoretical contributions include proposing efficient RNN-T structures like ltLSTM and cltLSTM, which separate temporal and classification tasks and incorporate future context to enhance predictive efficacy. Practically, these insights enable ASR systems to be more efficiently deployed on devices with limited resources without sacrificing quality, meeting increasing demands for ubiquitous, responsive speech interfaces.

Future Considerations

Given the demonstrated reduction in WER and resource requirements, future research might focus on further reducing latency and exploring more sophisticated sequence discriminative training methodologies. Additionally, extending these methods to broader language contexts and integrating multi-modal input could further enhance real-world applicability and performance of E2E ASR systems. The paper lays a strong foundation for such explorations, underscoring the potential for RNN-T models to shape the frontier of ASR technologies.

PDF Markdown Bookmark Chat (Pro)

Authors (4)

Jinyu Li (164 papers)
Rui Zhao (241 papers)
Hu Hu (18 papers)
Yifan Gong (82 papers)

Citations (166)

View on Semantic Scholar

Improving RNN Transducer Modeling for End-to-End Speech Recognition (1909.12415v1)