Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Streaming Multi-Talker ASR with Token-Level Serialized Output Training (2202.00842v5)

Published 2 Feb 2022 in eess.AS, cs.CL, and cs.SD

Abstract: This paper proposes a token-level serialized output training (t-SOT), a novel framework for streaming multi-talker automatic speech recognition (ASR). Unlike existing streaming multi-talker ASR models using multiple output branches, the t-SOT model has only a single output branch that generates recognition tokens (e.g., words, subwords) of multiple speakers in chronological order based on their emission times. A special token that indicates the change of ``virtual'' output channels is introduced to keep track of the overlapping utterances. Compared to the prior streaming multi-talker ASR models, the t-SOT model has the advantages of less inference cost and a simpler model architecture. Moreover, in our experiments with LibriSpeechMix and LibriCSS datasets, the t-SOT-based transformer transducer model achieves the state-of-the-art word error rates by a significant margin to the prior results. For non-overlapping speech, the t-SOT model is on par with a single-talker ASR model in terms of both accuracy and computational cost, opening the door for deploying one model for both single- and multi-talker scenarios.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Naoyuki Kanda (61 papers)
  2. Jian Wu (314 papers)
  3. Yu Wu (196 papers)
  4. Xiong Xiao (35 papers)
  5. Zhong Meng (53 papers)
  6. Xiaofei Wang (138 papers)
  7. Yashesh Gaur (43 papers)
  8. Zhuo Chen (319 papers)
  9. Jinyu Li (164 papers)
  10. Takuya Yoshioka (77 papers)
Citations (50)