Streaming Attention-Based Models with Augmented Memory for End-to-End Speech Recognition (2011.07120v1)

Published 3 Nov 2020 in cs.CL

Abstract: Attention-based models have been gaining popularity recently for their strong performance demonstrated in fields such as machine translation and automatic speech recognition. One major challenge of attention-based models is the need of access to the full sequence and the quadratically growing computational cost concerning the sequence length. These characteristics pose challenges, especially for low-latency scenarios, where the system is often required to be streaming. In this paper, we build a compact and streaming speech recognition system on top of the end-to-end neural transducer architecture with attention-based modules augmented with convolution. The proposed system equips the end-to-end models with the streaming capability and reduces the large footprint from the streaming attention-based model using augmented memory. On the LibriSpeech dataset, our proposed system achieves word error rates 2.7% on test-clean and 5.8% on test-other, to our best knowledge the lowest among streaming approaches reported so far.

PDF Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

Authors (7)

Ching-Feng Yeh (22 papers)
Yongqiang Wang (92 papers)
Yangyang Shi (53 papers)
Chunyang Wu (24 papers)
Frank Zhang (22 papers)
Julian Chan (11 papers)
Michael L. Seltzer (34 papers)

Citations (8)

View on Semantic Scholar

Streaming Attention-Based Models with Augmented Memory for End-to-End Speech Recognition (2011.07120v1)

Related Papers