Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

DiariST: Streaming Speech Translation with Speaker Diarization (2309.08007v2)

Published 14 Sep 2023 in eess.AS, cs.CL, and cs.SD

Abstract: End-to-end speech translation (ST) for conversation recordings involves several under-explored challenges such as speaker diarization (SD) without accurate word time stamps and handling of overlapping speech in a streaming fashion. In this work, we propose DiariST, the first streaming ST and SD solution. It is built upon a neural transducer-based streaming ST system and integrates token-level serialized output training and t-vector, which were originally developed for multi-talker speech recognition. Due to the absence of evaluation benchmarks in this area, we develop a new evaluation dataset, DiariST-AliMeeting, by translating the reference Chinese transcriptions of the AliMeeting corpus into English. We also propose new metrics, called speaker-agnostic BLEU and speaker-attributed BLEU, to measure the ST quality while taking SD accuracy into account. Our system achieves a strong ST and SD capability compared to offline systems based on Whisper, while performing streaming inference for overlapping speech. To facilitate the research in this new direction, we release the evaluation data, the offline baseline systems, and the evaluation code.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Mu Yang (35 papers)
  2. Naoyuki Kanda (61 papers)
  3. Xiaofei Wang (138 papers)
  4. Junkun Chen (27 papers)
  5. Peidong Wang (33 papers)
  6. Jian Xue (30 papers)
  7. Jinyu Li (164 papers)
  8. Takuya Yoshioka (77 papers)
Citations (4)

Summary

We haven't generated a summary for this paper yet.