An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition (2204.08858v2)

Published 19 Apr 2022 in eess.AS and cs.SD

Abstract: The two most popular loss functions for streaming end-to-end automatic speech recognition (ASR) are RNN-Transducer (RNN-T) and connectionist temporal classification (CTC). Between these two loss types we can classify the monotonic RNN-T (MonoRNN-T) and the recently proposed CTC-like Transducer (CTC-T). Monotonic transducers have a few advantages. First, RNN-T can suffer from runaway hallucination, where a model keeps emitting non-blank symbols without advancing in time. Secondly, monotonic transducers consume exactly one model score per time step and are therefore more compatible with traditional FST-based ASR decoders. However, the MonoRNN-T so far has been found to have worse accuracy than RNN-T. It does not have to be that way: By regularizing the training via joint LAS training or parameter initialization from RNN-T, both MonoRNN-T and CTC-T perform as well or better than RNN-T. This is demonstrated for LibriSpeech and for a large-scale in-house data set.

Authors (5)

Niko Moritz (23 papers)
Frank Seide (16 papers)
Duc Le (46 papers)
Jay Mahadeokar (36 papers)
Christian Fuegen (36 papers)

Citations (7)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

An Investigation of Monotonic Transducers for Large-Scale Automatic Speech Recognition (2204.08858v2)

Summary

Related Papers