On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition (2005.14327v2)

Published 28 May 2020 in eess.AS and cs.CL

Abstract: Recently, there has been a strong push to transition from hybrid models to end-to-end (E2E) models for automatic speech recognition. Currently, there are three promising E2E methods: recurrent neural network transducer (RNN-T), RNN attention-based encoder-decoder (AED), and Transformer-AED. In this study, we conduct an empirical comparison of RNN-T, RNN-AED, and Transformer-AED models, in both non-streaming and streaming modes. We use 65 thousand hours of Microsoft anonymized training data to train these models. As E2E models are more data hungry, it is better to compare their effectiveness with large amount of training data. To the best of our knowledge, no such comprehensive study has been conducted yet. We show that although AED models are stronger than RNN-T in the non-streaming mode, RNN-T is very competitive in streaming mode if its encoder can be properly initialized. Among all three E2E models, transformer-AED achieved the best accuracy in both streaming and non-streaming mode. We show that both streaming RNN-T and transformer-AED models can obtain better accuracy than a highly-optimized hybrid model.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (6)

Jinyu Li (164 papers)
Yu Wu (196 papers)
Yashesh Gaur (43 papers)
Chengyi Wang (32 papers)
Rui Zhao (241 papers)
Shujie Liu (101 papers)

Citations (130)

View on Semantic Scholar

On the Comparison of Popular End-to-End Models for Large Scale Speech Recognition (2005.14327v2)

Related Papers