Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
184 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

TOLD: A Novel Two-Stage Overlap-Aware Framework for Speaker Diarization (2303.05397v2)

Published 8 Mar 2023 in cs.SD, cs.AI, and eess.AS

Abstract: Recently, end-to-end neural diarization (EEND) is introduced and achieves promising results in speaker-overlapped scenarios. In EEND, speaker diarization is formulated as a multi-label prediction problem, where speaker activities are estimated independently and their dependency are not well considered. To overcome these disadvantages, we employ the power set encoding to reformulate speaker diarization as a single-label classification problem and propose the overlap-aware EEND (EEND-OLA) model, in which speaker overlaps and dependency can be modeled explicitly. Inspired by the success of two-stage hybrid systems, we further propose a novel Two-stage OverLap-aware Diarization framework (TOLD) by involving a speaker overlap-aware post-processing (SOAP) model to iteratively refine the diarization results of EEND-OLA. Experimental results show that, compared with the original EEND, the proposed EEND-OLA achieves a 14.39% relative improvement in terms of diarization error rates (DER), and utilizing SOAP provides another 19.33% relative improvement. As a result, our method TOLD achieves a DER of 10.14% on the CALLHOME dataset, which is a new state-of-the-art result on this benchmark to the best of our knowledge.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (31)
  1. “New features in the cu-htk system for transcription of conversational telephone speech,” in ICASSP, 2001, vol. 1, pp. 57–60.
  2. “Acoustic beamforming for speaker diarization of meetings,” TASLP, vol. 15, no. 7, pp. 2011–2022, 2007.
  3. “Automatic turn segmentation for movie & tv subtitles,” in SLT, 2016, pp. 245–252.
  4. “Speech recognition and multi-speaker diarization of long conversations,” INTERSPEECH, pp. 691–695, 2020.
  5. “Optimization of rnn-based speech activity detection,” TASLP, vol. 26, no. 3, pp. 646–656, 2017.
  6. “Discriminatively trained probabilistic linear discriminant analysis for speaker verification,” in ICASSP, 2011, pp. 4832–4835.
  7. “Deep neural networks for small footprint text-dependent speaker verification,” in ICASSP, 2014, pp. 4052–4056.
  8. “X-vectors: Robust dnn embeddings for speaker recognition,” in ICASSP, 2018, pp. 5329–5333.
  9. “Speaker diarisation using 2d self-attentive combination of embeddings,” in ICASSP, 2019, pp. 5801–5805.
  10. “Speaker diarization with lstm,” in ICASSP, 2018, pp. 5239–5243.
  11. “A spectral clustering approach to speaker diarization,” in INTERSPEECH, 2006.
  12. “Priors for speaker counting and diarization with ahc.,” in INTERSPEECH, 2016, pp. 2194–2198.
  13. “Discriminative neural clustering for speaker diarisation,” in SLT, 2021, pp. 574–581.
  14. “Towards end-to-end speaker diarization with generalized neural speaker clustering,” in ICASSP, 2022, pp. 8372–8376.
  15. “Speaker diarization: A review of recent research,” TASLP, vol. 20, no. 2, pp. 356–370, 2012.
  16. “End-to-end neural speaker diarization with permutation-free objectives,” INTERSPEECH, pp. 4300–4304, 2019.
  17. “Multitalker speech separation with utterance-level permutation invariant training of deep recurrent neural networks,” TASLP, vol. 25, no. 10, pp. 1901–1913, 2017.
  18. “End-to-end neural speaker diarization with self-attention,” in ASRU, 2019, pp. 296–303.
  19. “End-to-end neural diarization: From transformer to conformer,” arXiv preprint arXiv:2106.07167, 2021.
  20. “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  21. “End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors,” in INTERSPEECH, 2020, pp. 269–273.
  22. “Target-speaker voice activity detection: A novel approach for multi-speaker diarization in a dinner party scenario,” in INTERSPEECH, 2020, pp. 274–278.
  23. “End-to-end speaker diarization as post-processing,” in ICASSP, 2021, pp. 7188–7192.
  24. “Speaker overlap-aware neural diarization for multi-party meeting analysis,” in EMNLP, 2022.
  25. “Towards neural diarization for unlimited numbers of speakers using global and local attractors,” in ASRU, 2021, pp. 98–105.
  26. “Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech,” arXiv preprint arXiv:2105.09040, 2021.
  27. “Deep residual learning for image recognition,” in CVPR, 2016, pp. 770–778.
  28. “Arcface: Additive angular margin loss for deep face recognition,” in CVPR, 2019, pp. 4690–4699.
  29. “Musan: A music, speech, and noise corpus,” arXiv preprint arXiv:1510.08484, 2015.
  30. “A study on data augmentation of reverberant speech for robust speech recognition,” in ICASSP, 2017, pp. 5220–5224.
  31. “Bayesian HMM based x-vector clustering for speaker diarization,” in INTERSPEECH, 2019, pp. 346–350.
Citations (5)

Summary

We haven't generated a summary for this paper yet.