Exploring Pre-training with Alignments for RNN Transducer based End-to-End Speech Recognition (2005.00572v1)

Published 1 May 2020 in cs.CL and eess.AS

Abstract: Recently, the recurrent neural network transducer (RNN-T) architecture has become an emerging trend in end-to-end automatic speech recognition research due to its advantages of being capable for online streaming speech recognition. However, RNN-T training is made difficult by the huge memory requirements, and complicated neural structure. A common solution to ease the RNN-T training is to employ connectionist temporal classification (CTC) model along with RNN LLM (RNNLM) to initialize the RNN-T parameters. In this work, we conversely leverage external alignments to seed the RNN-T model. Two different pre-training solutions are explored, referred to as encoder pre-training, and whole-network pre-training respectively. Evaluated on Microsoft 65,000 hours anonymized production data with personally identifiable information removed, our proposed methods can obtain significant improvement. In particular, the encoder pre-training solution achieved a 10% and a 8% relative word error rate reduction when compared with random initialization and the widely used CTC+RNNLM initialization strategy, respectively. Our solutions also significantly reduce the RNN-T model latency from the baseline.

Authors (5)

Hu Hu (18 papers)
Rui Zhao (241 papers)
Jinyu Li (164 papers)
Liang Lu (42 papers)
Yifan Gong (82 papers)

Citations (27)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Exploring Pre-training with Alignments for RNN Transducer based End-to-End Speech Recognition (2005.00572v1)

Summary

Related Papers