Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Target-Speaker Voice Activity Detection via Sequence-to-Sequence Prediction (2210.16127v3)

Published 28 Oct 2022 in eess.AS and cs.SD

Abstract: Target-speaker voice activity detection is currently a promising approach for speaker diarization in complex acoustic environments. This paper presents a novel Sequence-to-Sequence Target-Speaker Voice Activity Detection (Seq2Seq-TSVAD) method that can efficiently address the joint modeling of large-scale speakers and predict high-resolution voice activities. Experimental results show that larger speaker capacity and higher output resolution can significantly reduce the diarization error rate (DER), which achieves the new state-of-the-art performance of 4.55% on the VoxConverse test set and 10.77% on Track 1 of the DIHARD-III evaluation set under the widely-used evaluation metrics.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Ming Cheng (69 papers)
  2. Weiqing Wang (54 papers)
  3. Yucong Zhang (6 papers)
  4. Xiaoyi Qin (27 papers)
  5. Ming Li (787 papers)
Citations (32)

Summary

We haven't generated a summary for this paper yet.