Effective Decoder Masking for Transformer Based End-to-End Speech Recognition

Published 27 Oct 2020 in eess.AS | (2010.14764v2)

Abstract: The attention-based encoder-decoder modeling paradigm has achieved promising results on a variety of speech processing tasks like automatic speech recognition (ASR), text-to-speech (TTS) and among others. This paradigm takes advantage of the generalization ability of neural networks to learn a direct mapping from an input sequence to an output sequence, without recourse to prior knowledge such as audio-text alignments or pronunciation lexicons. However, ASR models stemming from this paradigm are prone to overfitting, especially when the training data is limited. Inspired by SpecAugment and BERT-like masked language modeling, we propose in the paper a decoder masking based training approach for end-to-end (E2E) ASR models. During the training phase we randomly replace some portions of the decoder's historical text input with the symbol [mask], in order to encourage the decoder to robustly output a correct token even when parts of its decoding history are masked or corrupted. The proposed approach is instantiated with the top-of-the-line transformer-based E2E ASR model. Extensive experiments on the Librispeech960h and TedLium2 benchmark datasets demonstrate the superior performance of our approach in comparison to some existing strong E2E ASR systems.