Assessing the Robustness of Spectral Clustering for Deep Speaker Diarization (2403.14286v1)
Abstract: Clustering speaker embeddings is crucial in speaker diarization but hasn't received as much focus as other components. Moreover, the robustness of speaker diarization across various datasets hasn't been explored when the development and evaluation data are from different domains. To bridge this gap, this study thoroughly examines spectral clustering for both same-domain and cross-domain speaker diarization. Our extensive experiments on two widely used corpora, AMI and DIHARD, reveal the performance trend of speaker diarization in the presence of domain mismatch. We observe that the performance difference between two different domain conditions can be attributed to the role of spectral clustering. In particular, keeping other modules unchanged, we show that differences in optimal tuning parameters as well as speaker count estimation originates due to the mismatch. This study opens several future directions for speaker diarization research.
- X. Anguera et al., “Speaker diarization: A review of recent research,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 20, no. 2, pp. 356–370, 2012.
- T. J. Park et al., “A review of speaker diarization: Recent advances with deep learning,” Computer Speech & Language, vol. 72, p. 101317, 2022.
- N. Dawalatabad et al., “ECAPA-TDNN embeddings for speaker diarization,” in Proc. Interspeech, 2021.
- Q. Wang et al., “DiarizationLM: Speaker diarization post-processing with large language models,” arXiv preprint arXiv:2401.03506, 2024.
- J. Carletta et al., “The AMI meeting corpus: A pre-announcement,” in International Workshop on Machine Learning for Multimodal Interaction, pp. 28–39, Springer, 2005.
- “2000 NIST Speaker Recognition Evaluation.” https://catalog.ldc.upenn.edu/LDC2001S97. Accessed: 2024-02-27.
- “VoxConverse: A large scale audio-visual diarisation dataset.” https://www.robots.ox.ac.uk/~vgg/data/voxconverse/. Accessed: 2024-02-27.
- N. Ryant et al., “Third DIHARD challenge evaluation plan,” arXiv preprint arXiv:2006.05815, 2020.
- Q. Wang et al., “Speaker diarization with LSTM,” in Proc. ICASSP, 2018.
- I. Salmun et al., “PLDA-based mean shift speakers’ short segments clustering,” Computer Speech & Language, vol. 45, pp. 411–436, 2017.
- H. Ning et al., “A spectral clustering approach to speaker diarization,” in Proc. Interspeech, 2006.
- T. J. Park et al., “Auto-tuning spectral clustering for speaker diarization using normalized maximum eigengap,” IEEE Signal Processing Letters, vol. 27, pp. 381–385, 2020.
- U. Von Luxburg, “A tutorial on spectral clustering,” Statistics and Computing, vol. 17, pp. 395–416, 2007.
- American Mathematical Society, 1997.
- Z. Bai and X.-L. Zhang, “Speaker recognition based on deep learning: An overview,” Neural Networks, vol. 140, pp. 65–99, 2021.
- M. Benzeghiba et al., “Automatic speech recognition and speech variability: A review,” Speech Communication, vol. 49, no. 10-11, pp. 763–786, 2007.
- J. S. Perkell and D. H. Klatt, Invariance and Variability in Speech Processes. Psychology Press, 2014.
- L. Deng and X. Li, “Machine learning paradigms for speech recognition: An overview,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 5, pp. 1060–1089, 2013.
- J. Villalba, N. Brummer, and N. Dehak, “End-to-end versus embedding neural networks for language recognition in mismatched conditions,” in Proc. Odyssey, pp. 112–119, 06 2018.
- S. Dey, M. Sahidullah, and G. Saha, “Cross-corpora spoken language identification with domain diversification and generalization,” Computer Speech & Language, vol. 81, p. 101489, 2023.
- J. Richter et al., “Speech enhancement and dereverberation with diffusion-based generative models,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
- M. Gerczuk et al., “EmoNet: A transfer learning framework for multi-corpus speech emotion recognition,” IEEE Transactions on Affective Computing, vol. 14, no. 2, pp. 1472–1487, 2021.
- R. K. Das et al., “Assessing the scope of generalized countermeasures for anti-spoofing,” in Proc. ICASSP, 2020.
- J. Shi and J. Malik, “Normalized cuts and image segmentation,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 22, no. 8, pp. 888–905, 2000.
- A. Ng, M. Jordan, and Y. Weiss, “On spectral clustering: Analysis and an algorithm,” in Proc. Advances in Neural Information Processing Systems, vol. 14, 2001.
- H. Ning et al., “A spectral clustering approach to speaker diarization.,” in Proc. Interspeech, 2006.
- N. Bassiou et al., “Speaker diarization exploiting the eigengap criterion and cluster ensembles,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 8, pp. 2134–2144, 2010.
- K.-i. Iso, “Speaker clustering using vector quantization and spectral clustering,” in Proc. ICASSP, 2010.
- S. Shum, N. Dehak, and J. Glass, “On the use of spectral and iterative methods for speaker diarization,” in Proc. Interspeech, 2012.
- G. Sell and D. Garcia-Romero, “Speaker diarization with PLDA i-vector scoring and unsupervised calibration,” in Proc. SLT, pp. 413–417, IEEE, 2014.
- H. Aronowitz et al., “New advances in speaker diarization,” in Proc. Interspeech, 2020.
- Q. Lin et al., “LSTM based similarity measurement with spectral clustering for speaker diarization,” in Proc. Interspeech 2019, pp. 366–370, 2019.
- N. Ryant et al., “The Third DIHARD Diarization Challenge,” in Proc. Interspeech, 2021.
- M. Ravanelli et al., “SpeechBrain: A general-purpose speech toolkit,” 2021.
- A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: A large-scale speaker identification dataset,” in Proc. Interspeech, 2017.
- J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep speaker recognition,” in Proc. Interspeech, 2018.
- J. G. Fiscus and others., “The rich transcription 2006 spring meeting recognition evaluation,” in Machine Learning for Multimodal Interaction (S. Renals, S. Bengio, and J. G. Fiscus, eds.), (Berlin, Heidelberg), pp. 309–322, Springer Berlin Heidelberg, 2006.
- H. W. Kuhn, “The Hungarian method for the assignment problem,” Naval Research Logistics Quarterly, vol. 2, no. 1-2, pp. 83–97, 1955.
- X. Anguera Miró, Robust speaker diarization for meetings. PhD thesis, Universitat Politècnica de Catalunya, 2006.