Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Target-speaker Voice Activity Detection with Improved I-Vector Estimation for Unknown Number of Speaker (2108.03342v1)

Published 7 Aug 2021 in eess.AS

Abstract: Target-speaker voice activity detection (TS-VAD) has recently shown promising results for speaker diarization on highly overlapped speech. However, the original model requires a fixed (and known) number of speakers, which limits its application to real conversations. In this paper, we extend TS-VAD to speaker diarization with unknown numbers of speakers. This is achieved by two steps: first, an initial diarization system is applied for speaker number estimation, followed by TS-VAD network output masking according to this estimate. We further investigate different diarization methods, including clustering-based and region proposal networks, for estimating the initial i-vectors. Since these systems have complementary strengths, we propose a fusion-based method to combine frame-level decisions from the systems for an improved initialization. We demonstrate through experiments on variants of the LibriCSS meeting corpus that our proposed approach can improve the DER by up to 50\% relative across varying numbers of speakers. This improvement also results in better downstream ASR performance approaching that using oracle segments.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Maokui He (8 papers)
  2. Desh Raj (32 papers)
  3. Zili Huang (18 papers)
  4. Jun Du (130 papers)
  5. Zhuo Chen (319 papers)
  6. Shinji Watanabe (416 papers)
Citations (32)

Summary

We haven't generated a summary for this paper yet.