Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Alignment Restricted Streaming Recurrent Neural Network Transducer (2011.03072v1)

Published 5 Nov 2020 in cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: There is a growing interest in the speech community in developing Recurrent Neural Network Transducer (RNN-T) models for automatic speech recognition (ASR) applications. RNN-T is trained with a loss function that does not enforce temporal alignment of the training transcripts and audio. As a result, RNN-T models built with uni-directional long short term memory (LSTM) encoders tend to wait for longer spans of input audio, before streaming already decoded ASR tokens. In this work, we propose a modification to the RNN-T loss function and develop Alignment Restricted RNN-T (Ar-RNN-T) models, which utilize audio-text alignment information to guide the loss computation. We compare the proposed method with existing works, such as monotonic RNN-T, on LibriSpeech and in-house datasets. We show that the Ar-RNN-T loss provides a refined control to navigate the trade-offs between the token emission delays and the Word Error Rate (WER). The Ar-RNN-T models also improve downstream applications such as the ASR End-pointing by guaranteeing token emissions within any given range of latency. Moreover, the Ar-RNN-T loss allows for bigger batch sizes and 4 times higher throughput for our LSTM model architecture, enabling faster training and convergence on GPUs.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Jay Mahadeokar (36 papers)
  2. Yuan Shangguan (25 papers)
  3. Duc Le (46 papers)
  4. Gil Keren (22 papers)
  5. Hang Su (224 papers)
  6. Thong Le (3 papers)
  7. Ching-Feng Yeh (22 papers)
  8. Christian Fuegen (36 papers)
  9. Michael L. Seltzer (34 papers)
Citations (63)