Joint Training of Speaker Embedding Extractor, Speech and Overlap Detection for Diarization (2411.02165v1)
Abstract: In spite of the popularity of end-to-end diarization systems nowadays, modular systems comprised of voice activity detection (VAD), speaker embedding extraction plus clustering, and overlapped speech detection (OSD) plus handling still attain competitive performance in many conditions. However, one of the main drawbacks of modular systems is the need to run (and train) different modules independently. In this work, we propose an approach to jointly train a model to produce speaker embeddings, VAD and OSD simultaneously and reach competitive performance at a fraction of the inference time of a standard approach. Furthermore, the joint inference leads to a simplified overall pipeline which brings us one step closer to a unified clustering-based method that can be trained end-to-end towards a diarization-specific objective.
- G. Sell et al., “Diarization is Hard: Some Experiences and Lessons Learned for the JHU Team in the Inaugural DIHARD Challenge.” in Interspeech, 2018, pp. 2808–2812.
- F. Landini et al., “BUT System for the Second DIHARD Speech Diarization Challenge,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 6529–6533.
- T. J. Park et al., “Auto-tuning spectral clustering for speaker diarization using normalized maximum eigengap,” IEEE Signal Processing Letters, vol. 27, pp. 381–385, 2019.
- Y. Fujita et al., “End-to-End Neural Speaker Diarization with Permutation-Free Objectives,” in Proc. Interspeech, 2019.
- S. Horiguchi et al., “End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors,” in Proc. Interspeech, 2020, pp. 269–273.
- I. Medennikov et al., “Target-Speaker Voice Activity Detection: A Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario,” in Proc. Interspeech, 2020, pp. 274–278.
- K. Kinoshita, M. Delcroix, and N. Tawara, “Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds,” in International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 7198–7202.
- ——, “Advances in Integration of End-to-End Neural and Clustering-Based Diarization for Real Conversational Speech,” in Proc. Interspeech, 2021, pp. 3565–3569.
- H. Bredin, “pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe,” in Proc. INTERSPEECH 2023, 2023.
- A. Plaquet and H. Bredin, “Powerset multi-class cross entropy loss for neural speaker diarization,” in Proc. INTERSPEECH 2023, 2023.
- F. Landini, “From Modular to End-to-End Speaker Diarization,” Ph.D. Thesis, Brno University of Technology, Faculty of Information Technology, 2024. [Online]. Available: https://www.fit.vut.cz/study/phd-thesis/1357/
- S. Baroudi et al., “pyannote. audio speaker diarization pipeline at VoxSRC 2023,” The VoxCeleb Speaker Recognition Challenge 2023 (VoxSRC-23), 2023.
- F. Landini et al., “DiaPer: End-to-End Neural Diarization with Perceiver-Based Attractors,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024.
- D. Snyder et al., “X-vectors: Robust DNN embeddings for speaker recognition,” in 2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2018, pp. 5329–5333.
- K. He et al., “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- H. Zeinali et al., “But system description to voxceleb speaker tecognition challenge 2019,” arXiv preprint arXiv:1910.12592, 2019.
- B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized Channel Attention, Propagation and Aggregation in TDNN Based Speaker Verification,” in Proc. Interspeech 2020, 2020.
- J. Thienpondt and K. Demuynck, “ECAPA2: A hybrid neural network architecture and training strategy for robust speaker embeddings,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023.
- T. J. Park, M. Kumar, and S. Narayanan, “Multi-scale speaker diarization with neural affinity score fusion,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 7173–7177.
- Y. Kwon et al., “Multi-scale speaker embedding-based graph attention networks for speaker diarisation,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 8367–8371.
- Y. J. Kim et al., “Advancing the dimensionality reduction of speaker embeddings for speaker diarisation: disentangling noise and informing speech activity,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023.
- J.-w. Jung et al., “In search of strong embedding extractors for speaker diarisation,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- J.-H. Choi et al., “Efficient Speaker Embedding Extraction Using a Twofold Sliding Window Algorithm for Speaker Diarization,” in Interspeech 2024, 2024, pp. 3749–3753.
- F. Landini et al., “From Simulated Mixtures to Simulated Conversations as Training Data for End-to-End Neural Diarization,” in Proc. Interspeech 2022, 2022, pp. 5095–5099.
- N. Yamashita, S. Horiguchi, and T. Homma, “Improving the Naturalness of Simulated Conversations for End-to-End Neural Diarization,” in Proc. The Speaker and Language Recognition Workshop (Odyssey 2022), 2022, pp. 133–140.
- F. Landini et al., “Multi-Speaker and Wide-Band Simulated Conversations as Training Data for End-to-End Neural Diarization,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- V. A. Miasato Filho, D. A. Silva, and L. G. D. Cuozzo, “Multi-objective Long-Short Term Memory Neural Networks for Speaker Diarization in Telephone Interactions,” in 2017 Brazilian Conference on Intelligent Systems (BRACIS). IEEE, 2017, pp. 181–185.
- T. Cord-Landwehr et al., “Frame-wise and overlap-robust speaker embeddings for meeting diarization,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- Y. Kwon et al., “Look who’s not talking,” in 2021 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2021, pp. 567–573.
- J. Thienpondt and K. Demuynck, “Speaker Embeddings With Weakly Supervised Voice Activity Detection For Efficient Speaker Diarization,” in Proc. The Speaker and Language Recognition Workshop (Odyssey 2024), 2024, pp. 131–136.
- D. Klement et al., “Discriminative Training of VBx Diarization,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 11 871–11 875.
- S. Otterson and M. Ostendorf, “Efficient use of overlap information in speaker diarization,” in 2007 IEEE Workshop on Automatic Speech Recognition & Understanding (ASRU). IEEE, 2007, pp. 683–686.
- F. Landini et al., “Bayesian HMM Clustering of x-vector Sequences (VBx) in Speaker Diarization: Theory, Implementation and Analysis on Standard Tasks,” Computer Speech & Language, vol. 71, p. 101254, 2022.
- S. Wang et al., “Advancing speaker embedding learning: Wespeaker toolkit for research and production,” Speech Communication, vol. 162, p. 103104, 2024. [Online]. Available: https://www.sciencedirect.com/science/article/pii/S0167639324000761
- J. Deng et al., “ArcFace: Additive Angular Margin Loss for Deep Face Recognition,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.
- X. Xiang et al., “Margin matters: Towards more discriminative deep neural network embeddings for speaker recognition,” in 2019 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2019, pp. 1652–1656.
- J. S. Chung, A. Nagrani, and A. Zisserman, “VoxCeleb2: Deep Speaker Recognition,” in Proc. Interspeech 2018, 2018, pp. 1086–1090.
- N. Ryant et al., “Second DIHARD challenge evaluation plan,” Linguistic Data Consortium, Tech. Rep, 2019.
- J. Carletta et al., “The AMI meeting corpus: A pre-announcement,” in International workshop on machine learning for multimodal interaction. Springer, 2006, pp. 28–39.
- Y. Fu et al., “AISHELL-4: An Open Source Dataset for Speech Enhancement, Separation, Recognition and Speaker Diarization in Conference Scenario,” in Proc. Interspeech 2021, 2021, pp. 3665–3669.
- F. Yu et al., “M2MeT: The ICASSP 2022 multi-channel multi-party meeting transcription challenge,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6167–6171.
- W. Kraaij et al., “The AMI meeting corpus,” in Proc. International Conference on Methods and Techniques in Behavioral Research, 2005.
- N. Ryant et al., “The Third DIHARD Diarization Challenge,” in Proc. Interspeech 2021, 2021, pp. 3570–3574.
- T. Liu et al., “MSDWild: Multi-modal Speaker Diarization Dataset in the Wild,” in Proc. Interspeech 2022, 2022, pp. 1476–1480.
- Z. Yang et al., “Open Source MagicData-RAMC: A Rich Annotated Mandarin Conversational(RAMC) Speech Dataset,” in Proc. Interspeech 2022, 2022, pp. 1736–1740.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.