VAD-free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording (2107.07509v1)

Published 15 Jul 2021 in eess.AS, cs.CL, and cs.SD

Abstract: In this work, we propose novel decoding algorithms to enable streaming automatic speech recognition (ASR) on unsegmented long-form recordings without voice activity detection (VAD), based on monotonic chunkwise attention (MoChA) with an auxiliary connectionist temporal classification (CTC) objective. We propose a block-synchronous beam search decoding to take advantage of efficient batched output-synchronous and low-latency input-synchronous searches. We also propose a VAD-free inference algorithm that leverages CTC probabilities to determine a suitable timing to reset the model states to tackle the vulnerability to long-form data. Experimental evaluations demonstrate that the block-synchronous decoding achieves comparable accuracy to the label-synchronous one. Moreover, the VAD-free inference can recognize long-form speech robustly for up to a few hours.

Authors (2)

Hirofumi Inaguma (42 papers)
Tatsuya Kawahara (61 papers)

Citations (2)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

VAD-free Streaming Hybrid CTC/Attention ASR for Unsegmented Recording (2107.07509v1)

Summary

Related Papers