Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Language model fusion for streaming end to end speech recognition (2104.04487v1)

Published 9 Apr 2021 in cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: Streaming processing of speech audio is required for many contemporary practical speech recognition tasks. Even with the large corpora of manually transcribed speech data available today, it is impossible for such corpora to cover adequately the long tail of linguistic content that's important for tasks such as open-ended dictation and voice search. We seek to address both the streaming and the tail recognition challenges by using a LLM (LM) trained on unpaired text data to enhance the end-to-end (E2E) model. We extend shallow fusion and cold fusion approaches to streaming Recurrent Neural Network Transducer (RNNT), and also propose two new competitive fusion approaches that further enhance the RNNT architecture. Our results on multiple languages with varying training set sizes show that these fusion methods improve streaming RNNT performance through introducing extra linguistic features. Cold fusion works consistently better on streaming RNNT with up to a 8.5% WER improvement.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Rodrigo Cabrera (3 papers)
  2. Xiaofeng Liu (124 papers)
  3. Mohammadreza Ghodsi (3 papers)
  4. Zebulun Matteson (1 paper)
  5. Eugene Weinstein (5 papers)
  6. Anjuli Kannan (19 papers)
Citations (14)

Summary

We haven't generated a summary for this paper yet.