Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Bridging the gap between streaming and non-streaming ASR systems bydistilling ensembles of CTC and RNN-T models (2104.14346v1)

Published 25 Apr 2021 in cs.CL, cs.SD, and eess.AS

Abstract: Streaming end-to-end automatic speech recognition (ASR) systems are widely used in everyday applications that require transcribing speech to text in real-time. Their minimal latency makes them suitable for such tasks. Unlike their non-streaming counterparts, streaming models are constrained to be causal with no future context and suffer from higher word error rates (WER). To improve streaming models, a recent study [1] proposed to distill a non-streaming teacher model on unsupervised utterances, and then train a streaming student using the teachers' predictions. However, the performance gap between teacher and student WERs remains high. In this paper, we aim to close this gap by using a diversified set of non-streaming teacher models and combining them using Recognizer Output Voting Error Reduction (ROVER). In particular, we show that, despite being weaker than RNN-T models, CTC models are remarkable teachers. Further, by fusing RNN-T and CTC models together, we build the strongest teachers. The resulting student models drastically improve upon streaming models of previous work [1]: the WER decreases by 41% on Spanish, 27% on Portuguese, and 13% on French.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Thibault Doutre (3 papers)
  2. Wei Han (202 papers)
  3. Chung-Cheng Chiu (48 papers)
  4. Ruoming Pang (59 papers)
  5. Olivier Siohan (13 papers)
  6. Liangliang Cao (52 papers)
Citations (5)

Summary

We haven't generated a summary for this paper yet.