Mask-CTC-based Encoder Pre-training for Streaming End-to-End Speech Recognition (2309.04654v1)

Published 9 Sep 2023 in cs.SD and eess.AS

Abstract: Achieving high accuracy with low latency has always been a challenge in streaming end-to-end automatic speech recognition (ASR) systems. By attending to more future contexts, a streaming ASR model achieves higher accuracy but results in larger latency, which hurts the streaming performance. In the Mask-CTC framework, an encoder network is trained to learn the feature representation that anticipates long-term contexts, which is desirable for streaming ASR. Mask-CTC-based encoder pre-training has been shown beneficial in achieving low latency and high accuracy for triggered attention-based ASR. However, the effectiveness of this method has not been demonstrated for various model architectures, nor has it been verified that the encoder has the expected look-ahead capability to reduce latency. This study, therefore, examines the effectiveness of Mask-CTCbased pre-training for models with different architectures, such as Transformer-Transducer and contextual block streaming ASR. We also discuss the effect of the proposed pre-training method on obtaining accurate output spike timing.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (5)

Huaibo Zhao (3 papers)
Yosuke Higuchi (23 papers)
Yusuke Kida (10 papers)
Tetsuji Ogawa (22 papers)
Tetsunori Kobayashi (15 papers)

Citations (1)

View on Semantic Scholar

Mask-CTC-based Encoder Pre-training for Streaming End-to-End Speech Recognition (2309.04654v1)

Related Papers