Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Streaming Speaker-Attributed ASR with Token-Level Speaker Embeddings (2203.16685v2)

Published 30 Mar 2022 in eess.AS, cs.CL, and cs.SD

Abstract: This paper presents a streaming speaker-attributed automatic speech recognition (SA-ASR) model that can recognize ``who spoke what'' with low latency even when multiple people are speaking simultaneously. Our model is based on token-level serialized output training (t-SOT) which was recently proposed to transcribe multi-talker speech in a streaming fashion. To further recognize speaker identities, we propose an encoder-decoder based speaker embedding extractor that can estimate a speaker representation for each recognized token not only from non-overlapping speech but also from overlapping speech. The proposed speaker embedding, named t-vector, is extracted synchronously with the t-SOT ASR model, enabling joint execution of speaker identification (SID) or speaker diarization (SD) with the multi-talker transcription with low latency. We evaluate the proposed model for a joint task of ASR and SID/SD by using LibriSpeechMix and LibriCSS corpora. The proposed model achieves substantially better accuracy than a prior streaming model and shows comparable or sometimes even superior results to the state-of-the-art offline SA-ASR model.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Naoyuki Kanda (61 papers)
  2. Jian Wu (314 papers)
  3. Yu Wu (196 papers)
  4. Xiong Xiao (35 papers)
  5. Zhong Meng (53 papers)
  6. Xiaofei Wang (138 papers)
  7. Yashesh Gaur (43 papers)
  8. Zhuo Chen (319 papers)
  9. Jinyu Li (164 papers)
  10. Takuya Yoshioka (77 papers)
Citations (24)

Summary

We haven't generated a summary for this paper yet.