Exploring Architectures, Data and Units For Streaming End-to-End Speech Recognition with RNN-Transducer (1801.00841v1)

Published 2 Jan 2018 in cs.CL, cs.SD, and eess.AS

Abstract: We investigate training end-to-end speech recognition models with the recurrent neural network transducer (RNN-T): a streaming, all-neural, sequence-to-sequence architecture which jointly learns acoustic and LLM components from transcribed acoustic data. We explore various model architectures and demonstrate how the model can be improved further if additional text or pronunciation data are available. The model consists of an encoder', which is initialized from a connectionist temporal classification-based (CTC) acoustic model, and adecoder' which is partially initialized from a recurrent neural network LLM trained on text data alone. The entire neural network is trained with the RNN-T loss and directly outputs the recognized transcript as a sequence of graphemes, thus performing end-to-end speech recognition. We find that performance can be improved further through the use of sub-word units (`wordpieces') which capture longer context and significantly reduce substitution errors. The best RNN-T system, a twelve-layer LSTM encoder with a two-layer LSTM decoder trained with 30,000 wordpieces as output targets achieves a word error rate of 8.5\% on voice-search and 5.2\% on voice-dictation tasks and is comparable to a state-of-the-art baseline at 8.3\% on voice-search and 5.4\% voice-dictation.

Authors (3)

Kanishka Rao (31 papers)
Rohit Prabhavalkar (59 papers)
Haşim Sak (3 papers)

Citations (340)

View on Semantic Scholar

Summary

The paper shows that pre-trained encoder initialization and deeper architectures yield 5–10% relative word error rate improvements.
The paper demonstrates effective use of combined acoustic and text data, integrating phoneme targets and LSTM language model pre-training to reduce errors.
The paper finds that employing a large vocabulary of wordpieces outperforms grapheme-based systems, enhancing context modeling and reducing substitution errors.

Overview of Streaming End-to-End Speech Recognition with RNN-Transducer

The paper "Exploring Architectures, Data and Units For Streaming End-to-End Speech Recognition with RNN-Transducer" presents an investigation into the application of the Recurrent Neural Network Transducer (RNN-T) for automatic speech recognition (ASR), specifically targeting streaming applications. This work is notable in its exploration of how different model architectures, data configurations, and unit representations can influence the effectiveness of end-to-end speech recognition systems.

The focus of this research is the RNN-T model, an architecture that integrates the acoustic and LLMing aspects into a single framework capable of producing text transcripts from audio inputs in a streaming manner. Key components of the RNN-T model include an encoder, typically pre-trained with a Connectionist Temporal Classification (CTC) loss, and a decoder that can be initialized from a recurrent neural network LLM trained on text data.

Key Findings and Contributions

Model Architecture and Initialization:
- The authors demonstrate that initializing the encoder with CTC pre-trained weights leads to improvements in model performance, achieving approximately 5% relative word error rate (WER) reduction.
- Using deeper architectures, such as an 8-layer encoder, further enhances the recognition accuracy, granting a 10% relative improvement compared to shallower 5-layer configurations.
Data Utilization:
- The paper highlights the role of using both acoustic and text-based data for training the system. Incorporating pronunciation data via phoneme targets within a hierarchical-CTC framework provides noticeable WER enhancements, particularly for voice-search tasks.
- Text data is leveraged by pre-training the decoder as an LSTM-based LLM. This integration of language data results in an additional 5% relative WER reduction.
Use of Wordpieces:
- The paper shows that employing sub-word units such as wordpieces, especially a large vocabulary (e.g., 30,000 wordpieces), can significantly outperform grapheme-based systems. Wordpiece models benefit from longer context modeling and are more effective in reducing substitution errors.

Numerical Performance

The paper reports a leading system with a 12-layer LSTM encoder and a 2-layer LSTM decoder using 30,000 wordpieces, achieving a 8.5% WER on voice-search tasks and a 5.2% WER on voice-dictation tasks. These results are closely competitive with a baseline state-of-the-art conventional ASR system exhibiting WERs of 8.3% and 5.4% on similar tasks.

Implications and Future Prospects

The findings suggest that RNN-T models, when optimally configured and pre-trained, can achieve performance metrics comparable to conventional ASR systems while maintaining the benefits of end-to-end training. The research further underscores the potential for such models to simplify the ASR process by eliminating the need for distinct acoustic, pronunciation, and LLMs.

For future developments, the exploration of even deeper network architectures or alternative pre-training methods might yield further performance enhancements. Integrating advancements in unsupervised learning could enhance the utilization of text and acoustic data, driving improvements in LLMing aptitude. As ASR technologies advance, the deployment of RNN-T models in real-time applications could transform voice-interface functionalities, enhancing their robustness and accuracy across diverse operating environments.

PDF Markdown