Streaming Multi-Talker ASR with Token-Level Serialized Output Training (2202.00842v5)

Published 2 Feb 2022 in eess.AS, cs.CL, and cs.SD

Abstract: This paper proposes a token-level serialized output training (t-SOT), a novel framework for streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi-talker ASR models using multiple output branches, the t-SOT model has only a single output branch that generates recognition tokens (e.g., words, subwords) of multiple speakers in chronological order based on their emission times. A special token that indicates the change of ``virtual'' output channels is introduced to keep track of the overlapping utterances. Compared to the prior streaming multi-talker ASR models, the t-SOT model has the advantages of less inference cost and a simpler model architecture. Moreover, in our experiments with LibriSpeechMix and LibriCSS datasets, the t-SOT-based transformer transducer model achieves the state-of-the-art word error rates by a significant margin to the prior results. For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost, opening the door for deploying one model for both single- and multi-talker scenarios.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (10)

Naoyuki Kanda (61 papers)
Jian Wu (314 papers)
Yu Wu (196 papers)
Xiong Xiao (35 papers)
Zhong Meng (53 papers)
Xiaofei Wang (138 papers)
Yashesh Gaur (43 papers)
Zhuo Chen (319 papers)
Jinyu Li (164 papers)
Takuya Yoshioka (77 papers)

Citations (50)

View on Semantic Scholar

Streaming Multi-Talker ASR with Token-Level Serialized Output Training (2202.00842v5)

Related Papers