Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
110 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Speaker Embeddings With Weakly Supervised Voice Activity Detection For Efficient Speaker Diarization (2405.09142v1)

Published 15 May 2024 in eess.AS and cs.SD

Abstract: Current speaker diarization systems rely on an external voice activity detection model prior to speaker embedding extraction on the detected speech segments. In this paper, we establish that the attention system of a speaker embedding extractor acts as a weakly supervised internal VAD model and performs equally or better than comparable supervised VAD systems. Subsequently, speaker diarization can be performed efficiently by extracting the VAD logits and corresponding speaker embedding simultaneously, alleviating the need and computational overhead of an external VAD model. We provide an extensive analysis of the behavior of the frame-level attention system in current speaker verification models and propose a novel speaker diarization pipeline using ECAPA2 speaker embeddings for both VAD and embedding extraction. The proposed strategy gains state-of-the-art performance on the AMI, VoxConverse and DIHARD III diarization benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. “The dku-msxf speaker verification system for the voxceleb speaker recognition challenge 2023,” VoxCeleb Speaker Recognition Challenge 2023 Workshop, 2023.
  2. Grigor Kirakosyan Davit Karamyan, “The krisp diarization system for the voxceleb speaker recognition challenge 2023,” VoxCeleb Speaker Recognition Challenge 2023 Workshop, 2023.
  3. “End-to-End Neural Speaker Diarization with Permutation-free Objectives,” in Proc. INTERSPEECH 2019, 2019, pp. 4300–4304.
  4. “From Simulated Mixtures to Simulated Conversations as Training Data for End-to-End Neural Diarization,” in Proc. INTERSPEECH 2022, 2022, pp. 5095–5099.
  5. “Multi-speaker and wide-band simulated conversations as training data for end-to-end neural diarization,” in ICASSP 2023, 2023, pp. 1–5.
  6. “SpeechBrain: A general-purpose speech toolkit,” 2021, arXiv:2106.04624.
  7. “Librispeech: An asr corpus based on public domain audio books,” in Proc. ICASSP, 2015, pp. 5206–5210.
  8. “The qut-noise-timit corpus for the evaluation of voice activity detection algorithms,” in Proc. INTERSPEECH 2010, 2010.
  9. “VoxCeleb: A large-scale speaker identification dataset,” in INTERSPEECH 2017, 2017, pp. 2616–2620.
  10. “VoxCeleb2: Deep speaker recognition,” in INTERSPEECH 2018, 2018, pp. 1086–1090.
  11. “Musan: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
  12. “End-to-end speaker segmentation for overlap-aware resegmentation,” in Proc. INTERSPEECH 2021, 08 2021, pp. 3111–3115.
  13. “Powerset multi-class cross entropy loss for neural speaker diarization,” in Proc. INTERSPEECH 2023, 2023.
  14. “X-vectors: Robust dnn embeddings for speaker recognition,” in ICASSP 2018, 2018, pp. 5329–5333.
  15. “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in TDNN based speaker verification,” in INTERSPEECH 2020, 2020, pp. 3830–3834.
  16. “Integrating frequency translational invariance in TDNNs and frequency positional information in 2D ResNets to enhance speaker verification,” in INTERSPEECH 2021, 2021, pp. 2302–2306.
  17. “Magneto: X-vector magnitude estimation network plus offset for improved speaker recognition,” in Proc. Odyssey 2020, 2020, pp. 1–8.
  18. “The NetEase Games system description for text-dependent sub-challenge of SDSVC 2020,” .
  19. “Ecapa2: A hybrid neural network architecture and training strategy for robust speaker embeddings,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2023, pp. 1–8.
  20. “Timit acoustic-phonetic continuous speech corpus,” Linguistic Data Consortium, 11 1992.
  21. “Meta-learning with latent space clustering in generative adversarial network for speaker diarization,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1204–1219, 2020.
  22. “Ecapa-tdnn embeddings for speaker diarization,” in Proc. INTERSPEECH 2021, 2021.
  23. “Titanet: Neural model for speaker representation with 1d depth-wise separable convolutions and global context,” in ICASSP 2022, 2022, pp. 8102–8106.
  24. “Advancing the dimensionality reduction of speaker embeddings for speaker diarisation: Disentangling noise and informing speech activity,” in ICASSP 2023, 2023, pp. 1–5.
  25. “On spectral clustering: Analysis and an algorithm,” in ADVANCES IN NEURAL INFORMATION PROCESSING SYSTEMS. 2001, pp. 849–856, MIT Press.
  26. “The third dihard diarization challenge,” ArXiv, vol. abs/2012.01477, 2020.
  27. “Acoustic beamforming for speaker diarization of meetings,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 7, pp. 2011–2021, September 2007.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (2)
  1. Jenthe Thienpondt (13 papers)
  2. Kris Demuynck (20 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.