Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TS-SEP: Joint Diarization and Separation Conditioned on Estimated Speaker Embeddings (2303.03849v3)

Published 7 Mar 2023 in eess.AS and cs.SD

Abstract: Since diarization and source separation of meeting data are closely related tasks, we here propose an approach to perform the two objectives jointly. It builds upon the target-speaker voice activity detection (TS-VAD) diarization approach, which assumes that initial speaker embeddings are available. We replace the final combined speaker activity estimation network of TS-VAD with a network that produces speaker activity estimates at a time-frequency resolution. Those act as masks for source extraction, either via masking or via beamforming. The technique can be applied both for single-channel and multi-channel input and, in both cases, achieves a new state-of-the-art word error rate (WER) on the LibriCSS meeting data recognition task. We further compute speaker-aware and speaker-agnostic WERs to isolate the contribution of diarization errors to the overall WER performance.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Christoph Boeddeker (36 papers)
  2. Aswin Shanmugam Subramanian (20 papers)
  3. Gordon Wichern (51 papers)
  4. Reinhold Haeb-Umbach (60 papers)
  5. Jonathan Le Roux (82 papers)
Citations (20)

Summary

We haven't generated a summary for this paper yet.