Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
157 tokens/sec
GPT-4o
43 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

End-to-End Supervised Hierarchical Graph Clustering for Speaker Diarization (2401.12850v2)

Published 23 Jan 2024 in eess.AS, cs.AI, and cs.SD

Abstract: Speaker diarization, the task of segmenting an audio recording based on speaker identity, constitutes an important speech pre-processing step for several downstream applications.The conventional approach to diarization involves multiple steps of embedding extraction and clustering, which are often optimized in an isolated fashion. While end-to-end diarization systems attempt to learn a single model for the task, they are often cumbersome to train and require large supervised datasets. In this paper, we propose an end-to-end supervised hierarchical clustering algorithm based on graph neural networks (GNN), called End-to-end Supervised HierARchical Clustering (E-SHARC). The embedding extractor is initialized using a pre-trained x-vector model while the GNN model is trained initially using the x-vector embeddings from the pre-trained model. Finally, the E-SHARC model uses the front-end mel-filterbank features as input and jointly optimizes the embedding extractor and the GNN clustering module, performing representation learning, metric learning, and clustering with end-to-end optimization. Further, with additional inputs from an external overlap detector, the E-SHARC approach is capable of predicting the speakers in the overlapping speech regions. The experimental evaluation on benchmark datasets like AMI, Voxconverse and DISPLACE, illustrates that the proposed E-SHARC framework provides competitive diarization results using graph based clustering methods.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (45)
  1. N. Dehak, P. J. Kenny, R. Dehak, P. Dumouchel, and P. Ouellet, “Front-end factor analysis for speaker verification,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 4, pp. 788–798, 2010.
  2. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust DNN embeddings for speaker recognition,” in IEEE ICASSP, 2018, pp. 5329–5333.
  3. N. Dawalatabad, M. Ravanelli, F. Grondin, J. Thienpondt, B. Desplanques, and H. Na, “ECAPA-TDNN Embeddings for Speaker Diarization,” in Interspeech, 2021, pp. 3560–3564.
  4. S. Ioffe, “Probabilistic linear discriminant analysis,” in European Conference on Computer Vision.   Springer, 2006, pp. 531–542.
  5. G. Sell and D. Garcia-Romero, “Speaker diarization with PLDA i-vector scoring and unsupervised calibration,” in IEEE Spoken Language Technology Workshop (SLT), 2014, pp. 413–417.
  6. Q. Lin, R. Yin, M. Li, H. Bredin, and C. Barras, “LSTM Based Similarity Measurement with Spectral Clustering for Speaker Diarization,” in Proc. Interspeech, 2019, pp. 366–370.
  7. V. S. Narayanaswamy, J. J. Thiagarajan, H. Song, and A. Spanias, “Designing an effective metric learning pipeline for speaker diarization,” in IEEE ICASSP, 2019, pp. 5806–5810.
  8. P. Singh and S. Ganapathy, “Self-supervised representation learning with path integral clustering for speaker diarization,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, p. 1639–1649, 2021. [Online]. Available: http://dx.doi.org/10.1109/TASLP.2021.3075100
  9. ——, “Self-supervised metric learning with graph clustering for speaker diarization,” in IEEE ASRU, 2021, pp. 90–97.
  10. Y. Fujita, S. Watanabe, S. Horiguchi, Y. Xue, J. Shi, and K. Nagamatsu, “Neural speaker diarization with speaker-wise chain rule,” arXiv preprint arXiv:2006.01796, 2020.
  11. S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue, and K. Nagamatsu, “End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors,” in Interspeech, 2020.
  12. ——, “End-to-end speaker diarization for an unknown number of speakers with encoder-decoder based attractors,” arXiv preprint arXiv:2005.09921, 2020.
  13. S. Horiguchi, S. Watanabe, P. García, Y. Xue, Y. Takashima, and Y. Kawaguchi, “Towards neural diarization for unlimited numbers of speakers using global and local attractors,” in IEEE ASRU, 2021, pp. 98–105.
  14. P. Singh, A. Kaul, and S. Ganapathy, “Supervised hierarchical clustering using graph neural networks for speaker diarization,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  15. Y. Xing, T. He, T. Xiao, Y. Wang, Y. Xiong, W. Xia, D. Wipf, Z. Zhang, and S. Soatto, “Learning hierarchical graph neural networks for image clustering,” in Proc. IEEE ICCV, 2021, pp. 3467–3477.
  16. L. Bullock, H. Bredin, and L. P. Garcia-Perera, “Overlap-aware diarization: Resegmentation using neural end-to-end overlapped speech detection,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 7114–7118.
  17. H. Bredin, R. Yin, J. M. Coria, G. Gelly, P. Korshunov, M. Lavechin, D. Fustes, H. Titeux, W. Bouaziz, and M.-P. Gill, “pyannote.audio: neural building blocks for speaker diarization,” in ICASSP 2020, IEEE International Conference on Acoustics, Speech, and Signal Processing, Barcelona, Spain, May 2020.
  18. A. Ng, M. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” Advances in neural information processing systems, vol. 14, 2001.
  19. W. Zhang, D. Zhao, and X. Wang, “Agglomerative clustering via maximum incremental path integral,” Pattern Recognition, vol. 46, no. 11, pp. 3056–3065, 2013. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0031320313001830
  20. I. Medennikov, M. Korenevsky, T. Prisyach, Y. Khokhlov, M. Korenevskaya, I. Sorokin, T. Timofeeva, A. Mitrofanov, A. Andrusenko, I. Podluzhny, A. Laptev, and A. Romanenko, “Target-Speaker Voice Activity Detection: A Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario,” in Proc. Interspeech 2020, 2020, pp. 274–278. [Online]. Available: http://dx.doi.org/10.21437/Interspeech.2020-1602
  21. A. Majeed and I. Rauf, “Graph theory: A comprehensive survey about graph theory applications in computer science and social networks,” Inventions, vol. 5, no. 1, p. 10, 2020.
  22. T. N. Kipf and M. Welling, “Semi-supervised classification with graph convolutional networks,” arXiv preprint arXiv:1609.02907, 2016.
  23. F. Tong et al., “Graph convolutional network based semi-supervised learning on multi-speaker meeting data,” in IEEE ICASSP, 2022, pp. 6622–6626.
  24. J. Wang et al., “Speaker diarization with session-level speaker embedding refinement using graph neural networks,” in IEEE ICASSP, 2020, pp. 7109–7113.
  25. W. Hamilton, Z. Ying, and J. Leskovec, “Inductive representation learning on large graphs,” Advances in neural information processing systems, vol. 30, 2017.
  26. D. Snyder, D. Garcia-Romero, G. Sell, A. McCree, D. Povey, and S. Khudanpur, “Speaker recognition for multi-speaker conversations using x-vectors,” in IEEE ICASSP, 2019, pp. 5796–5800.
  27. D. Povey, G. Cheng, Y. Wang, K. Li, H. Xu, M. Yarmohammadi, and S. Khudanpur, “Semi-Orthogonal Low-Rank Matrix Factorization for Deep Neural Networks,” in Proc. Interspeech 2018, 2018, pp. 3743–3747.
  28. H. Zeinali et al., “BUT system description to voxceleb speaker recognition challenge 2019,” arXiv preprint arXiv:1910.12592, 2019.
  29. I. McCowan, J. Carletta, W. Kraaij, S. Ashby, S. Bourban, M. Flynn, M. Guillemot, T. Hain, J. Kadlec, V. Karaiskos et al., “The AMI meeting corpus,” in International Conference on Methods and Techniques in Behavioral Research, 2005, pp. 137–140.
  30. J. S. Chung et al., “Spot the Conversation: Speaker Diarisation in the Wild,” in Proc. Interspeech 2020, 2020, pp. 299–303.
  31. A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: A large-scale speaker identification dataset,” Proc. of Interspeech, pp. 2616–2620, 2017.
  32. J. S. Chung, A. Nagrani, and A. Zisserman, “Voxceleb2: Deep speaker recognition,” in Proc. of Interspeech, 2018, pp. 1086–1090.
  33. V. Panayotov et al., “Librispeech: an asr corpus based on public domain audio books,” in IEEE ICASSP, 2015, pp. 5206–5210.
  34. Y. Fujita et al., “End-to-End Neural Speaker Diarization with Self-attention,” in ASRU, 2019, pp. 296–303.
  35. S. Baghel, S. Ramoji, P. Singh, S. Jain, P. R. Chowdhuri, K. Kulkarni, S. Padhi, D. Vijayasenan, S. Ganapathy et al., “Displace challenge: Diarization of speaker and language in conversational environments,” arXiv preprint arXiv:2303.00830, 2023.
  36. N. Ryant et al., “The Third DIHARD Diarization Challenge,” in Proc. Interspeech, 2021, pp. 3570–3574.
  37. W. H. E. Day and H. Edelsbrunner, “Efficient algorithms for agglomerative hierarchical clustering methods,” Journal of Classification, vol. 1, pp. 7–24, 1984.
  38. J. G. Fiscus, J. Ajot, M. Michel, and J. S. Garofolo, “The rich transcription 2006 spring meeting recognition evaluation,” in Machine Learning for Multimodal Interaction: Third International Workshop, MLMI 2006, Bethesda, MD, USA, May 1-4, 2006, Revised Selected Papers 3.   Springer, 2006, pp. 309–322.
  39. H. Bredin, “pyannote.metrics: a toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems,” in Interspeech 2017, 18th Annual Conference of the International Speech Communication Association, Stockholm, Sweden, August 2017. [Online]. Available: http://pyannote.github.io/pyannote-metrics
  40. F. Landini et al., “Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks,” arXiv preprint arXiv:2012.14952, 2020.
  41. H. Bredin and A. Laurent, “End-To-End Speaker Segmentation for Overlap-Aware Resegmentation,” in Proc. Interspeech 2021, 2021, pp. 3111–3115.
  42. D. Raj, D. Povey, and S. Khudanpur, “Gpu-accelerated guided source separation for meeting transcription,” arXiv preprint arXiv:2212.05271, 2022.
  43. A. Plaquet and H. Bredin, “Powerset multi-class cross entropy loss for neural speaker diarization,” in Proc. Interspeech 2023, 2023, pp. 3222–3226.
  44. Y. Wei, H. Guo, Z. Ge, and Z. Yang, “Graph attention-based deep embedded clustering for speaker diarization,” Speech Communication, vol. 155, p. 102991, 2023.
  45. S. Baghel, S. Ramoji, S. Jain, P. R. Chowdhuri, P. Singh, D. Vijayasenan, and S. Ganapathy, “Summary of the displace challenge 2023–diarization of speaker and language in conversational environments,” arXiv preprint arXiv:2311.12564, 2023.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets