Discriminative Training of VBx Diarization (2310.02732v1)
Abstract: Bayesian HMM clustering of x-vector sequences (VBx) has become a widely adopted diarization baseline model in publications and challenges. It uses an HMM to model speaker turns, a generatively trained probabilistic linear discriminant analysis (PLDA) for speaker distribution modeling, and Bayesian inference to estimate the assignment of x-vectors to speakers. This paper presents a new framework for updating the VBx parameters using discriminative training, which directly optimizes a predefined loss. We also propose a new loss that better correlates with the diarization error rate compared to binary cross-entropy $\unicode{x2013}$ the default choice for diarization end-to-end systems. Proof-of-concept results across three datasets (AMI, CALLHOME, and DIHARD II) demonstrate the method's capability of automatically finding hyperparameters, achieving comparable performance to those found by extensive grid search, which typically requires additional hyperparameter behavior knowledge. Moreover, we show that discriminative fine-tuning of PLDA can further improve the model's performance. We release the source code with this publication.
- “Auto-tuning spectral clustering for speaker diarization using normalized maximum eigengap,” IEEE Signal Processing Letters, vol. 27, pp. 381–385, 2019.
- “Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: Theory, implementation and analysis on standard tasks,” Computer Speech & Language, vol. 71, pp. 101254, 2022.
- “End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors,” in Proc. Interspeech 2020, 2020, pp. 269–273.
- “Told: a novel two-stage overlap-aware framework for speaker diarization,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023.
- Hervé Bredin, “Pyannote. audio 2.1 speaker diarization pipeline: Principle, benchmark, and recipe,” in Proceedings for Interspeech, 2023.
- Marc Delcroix et.al., “Multi-Stream Extension of Variational Bayesian HMM Clustering (MS-VBx) for Combined End-to-End and Vector Clustering-based Diarization,” in Proc. INTERSPEECH 2023, 2023, pp. 3477–3481.
- “Multi-class spectral clustering with overlaps for speaker diarization,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 582–589.
- “M2met: The icassp 2022 multi-channel multi-party meeting transcription challenge,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6167–6171.
- “Ego4d: Around the world in 3,000 hours of egocentric video,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 18995–19012.
- G. Cheng et.al., “The conversational short-phrase speaker diarization (cssd) task: Dataset, evaluation metric and baselines,” in 2022 13th International Symposium on Chinese Spoken Language Processing (ISCSLP). IEEE, 2022, pp. 488–492.
- “End-to-end speaker diarization as post-processing,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 7188–7192.
- “Supervised hierarchical clustering using graph neural networks for speaker diarization,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- “X-vectors: Robust DNN embeddings for speaker recognition,” in International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2018, pp. 5329–5333.
- “The speaker partitioning problem,” Proc. of Odyssey 2010, 01 2010.
- Christopher M. Bishop, Pattern Recognition and Machine Learning, Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2006.
- “Improving the naturalness of simulated conversations for end-to-end neural diarization,” in Odyssey 2022, 2022.
- “From Simulated Mixtures to Simulated Conversations as Training Data for End-to-End Neural Diarization,” in Proc. Interspeech, 2022, pp. 5095–5099.
- “Discriminatively trained probabilistic linear discriminant analysis for speaker verification,” in International Conference on Acoustics, Speech, and Signal Processing (ICASSP). IEEE, 2011, pp. 4832–4835.
- Alan McCree, “Multiclass discriminative training of i-vector language recognition,” The Speaker and Language Recognition Workshop (Odyssey 2014), 2014.
- “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2017, pp. 241–245.
- “End-to-End Neural Speaker Diarization with Permutation-Free Objectives,” in Proc. Interspeech 2019, 2019, pp. 4300–4304.
- N. Ryant et. al., “The Second DIHARD Diarization Challenge: Dataset, task, and baselines.,” in Proceedings of Interspeech, 2019.
- “Speaker recognition in a multi-speaker environment,” in 7th European Conference on Speech Communication and Technology, Eurospeech, September 2001, vol. 7, num. 2, pp. 787–790.
- “The AMI meeting corpus: A pre-announcement,” in International workshop on machine learning for multimodal interaction. Springer, 2006, pp. 28–39.
- “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- “But system description to voxceleb speaker recognition challenge 2019,” in Proceedings of The VoxCeleb Challange Workshop 2019, 2019, pp. 1–4.
- “Adam: A method for stochastic optimization,” in 3rd International Conference on Learning Representations, ICLR, 2015.
- Hervé Bredin, “pyannote.metrics: a toolkit for reproducible evaluation, diagnostic, and error analysis of speaker diarization systems,” in Proceedings of Interspeech 2017, 2017.