t-SOT FNT: Streaming Multi-talker ASR with Text-only Domain Adaptation Capability (2309.08131v1)

Published 15 Sep 2023 in eess.AS and cs.SD

Abstract: Token-level serialized output training (t-SOT) was recently proposed to address the challenge of streaming multi-talker automatic speech recognition (ASR). T-SOT effectively handles overlapped speech by representing multi-talker transcriptions as a single token stream with $\langle \text{cc}\rangle$ symbols interspersed. However, the use of a naive neural transducer architecture significantly constrained its applicability for text-only adaptation. To overcome this limitation, we propose a novel t-SOT model structure that incorporates the idea of factorized neural transducers (FNT). The proposed method separates a LLM (LM) from the transducer's predictor and handles the unnatural token order resulting from the use of $\langle \text{cc}\rangle$ symbols in t-SOT. We achieve this by maintaining multiple hidden states and introducing special handling of the $\langle \text{cc}\rangle$ tokens within the LM. The proposed t-SOT FNT model achieves comparable performance to the original t-SOT model while retaining the ability to reduce word error rate (WER) on both single and multi-talker datasets through text-only adaptation.

PDF Abstract

Summarize Bookmark Chat (Pro)

Authors (6)

Jian Wu (314 papers)
Naoyuki Kanda (61 papers)
Takuya Yoshioka (77 papers)
Rui Zhao (241 papers)
Zhuo Chen (319 papers)
Jinyu Li (164 papers)

Citations (5)

View on Semantic Scholar

t-SOT FNT: Streaming Multi-talker ASR with Text-only Domain Adaptation Capability (2309.08131v1)

Related Papers