Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Online speaker diarization of meetings guided by speech separation (2402.00067v1)

Published 30 Jan 2024 in eess.AS, cs.LG, cs.SD, and eess.SP

Abstract: Overlapped speech is notoriously problematic for speaker diarization systems. Consequently, the use of speech separation has recently been proposed to improve their performance. Although promising, speech separation models struggle with realistic data because they are trained on simulated mixtures with a fixed number of speakers. In this work, we introduce a new speech separation-guided diarization scheme suitable for the online speaker diarization of long meeting recordings with a variable number of speakers, as present in the AMI corpus. We envisage ConvTasNet and DPRNN as alternatives for the separation networks, with two or three output sources. To obtain the speaker diarization result, voice activity detection is applied on each estimated source. The final model is fine-tuned end-to-end, after first adapting the separation to real data using AMI. The system operates on short segments, and inference is performed by stitching the local predictions using speaker embeddings and incremental clustering. The results show that our system improves the state-of-the-art on the AMI headset mix, using no oracle information and under full evaluation (no collar and including overlapped speech). Finally, we show the strength of our system particularly on overlapped speech sections.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (34)
  1. T. J. Park, N. Kanda, D. Dimitriadis, K. J. Han, S. Watanabe, and S. Narayanan, “A Review of Speaker Diarization: Recent Advances with Deep Learning,” 2021.
  2. M. Sahidullah, J. Patino, S. Cornell, R. Yin, S. Sivasankaran, H. Bredin et al., “The Speed Submission to DIHARD II: Contributions & Lessons Learned,” 2019.
  3. F. Landini, O. Glembek, P. Matějka, J. Rohdin, L. Burget, M. Diez et al., “Analysis of the BUT Diarization System for VoxConverse Challenge,” 2021.
  4. S. Watanabe, M. Mandel, J. Barker, E. Vincent, A. Arora, X. Chang et al., “CHiME-6 Challenge:Tackling Multispeaker Speech Recognition for Unsegmented Recordings,” 2020.
  5. Y. Fujita, S. Watanabe, S. Horiguchi, Y. Xue, and K. Nagamatsu, “End-to-End Neural Diarization: Reformulating Speaker Diarization as Simple Multi-label Classification,” 2020.
  6. S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue, and P. Garcia, “Encoder-Decoder Based Attractors for End-to-End Neural Diarization,” IEEE Trans. Audio, Speech, Lang. Process., vol. 30, pp. 1493–1507, 2022.
  7. I. Medennikov, M. Korenevsky, T. Prisyach, Y. Khokhlov, M. Korenevskaya, I. Sorokin et al., “Target-Speaker Voice Activity Detection: a Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario,” in Proc. Interspeech, 2020, pp. 274–278.
  8. Z. Huang, S. Watanabe, Y. Fujita, P. García, Y. Shao, D. Povey et al., “Speaker Diarization with Region Proposal Network,” in Proc. ICASSP, 2020, pp. 6514–6518.
  9. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez et al., “Attention Is All You Need,” 2023.
  10. E. Tzinis, Z. Wang, and P. Smaragdis, “Sudo rm -rf: Efficient Networks for Universal Audio Source Separation,” in IEEE MLSP, Sep. 2020, pp. 1–6.
  11. X. Fang, Z.-H. Ling, L. Sun, S.-T. Niu, J. Du, C. Liu et al., “A Deep Analysis of Speech Separation Guided Diarization Under Realistic Conditions,” in Proc. APSIPA ASC, 2021, pp. 667–671.
  12. S.-T. Niu, J. Du, L. Sun, and C.-H. Lee, “Separation Guided Speaker Diarization in Realistic Mismatched Conditions,” 2021.
  13. A. Zhang, Q. Wang, Z. Zhu, J. Paisley, and C. Wang, “Fully Supervised Speaker Diarization,” in Proc. ICASSP, 2019, pp. 6301–6305.
  14. K. Kinoshita, M. Delcroix, and N. Tawara, “Integrating end-to-end neural and clustering-based diarization: Getting the best of both worlds,” Proc. ICASSP, 2021.
  15. ——, “Advances in integration of end-to-end neural and clustering-based diarization for real conversational speech,” Proc. Interspeech, 2021.
  16. J. M. Coria, H. Bredin, S. Ghannay, and S. Rosset, “Overlap-Aware Low-Latency Online Speaker Diarization Based On End-To-End Local Segmentation,” in IEEE ASRU, 2021.
  17. G. Morrone, S. Cornell, D. Raj, L. Serafini, E. Zovato, A. Brutti et al., “Low-Latency Speech Separation Guided Diarization for Telephone Conversations,” in IEEE SLT, 2023, pp. 641–646.
  18. J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR - half-baked or well done?” in Proc. ICASSP, 2019.
  19. G. Morrone, S. Cornell, L. Serafini, E. Zovato, A. Brutti, and S. Squartini, “End-to-End Integration of Speech Separation and Voice Activity Detection for Low-Latency Diarization of Telephone Conversations,” 2023.
  20. Z. Chen, T. Yoshioka, L. Lu, T. Zhou, Z. Meng, Y. Luo et al., “Continuous Speech Separation: Dataset and Analysis,” in Proc. ICASSP.   IEEE, 2020, pp. 7284–7288.
  21. Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing Ideal Time-Frequency Magnitude Masking for Speech Separation,” IEEE Trans. Audio, Speech, Lang. Process., vol. 27, no. 8, pp. 1256–1266, 2019.
  22. Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path RNN: efficient long sequence modeling for time-domain single-channel speech separation,” in Proc. ICASSP, 2020.
  23. D. Snyder, D. Garcia-Romero, D. Povey, and S. Khudanpur, “Deep Neural Network Embeddings for Text-Independent Speaker Verification,” in Proc. Interspeech, 2017.
  24. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-Vectors: Robust DNN Embeddings for Speaker Recognition,” in Proc. ICASSP, 2018, pp. 5329–5333.
  25. J. Deng, J. Guo, N. Xue, and S. Zafeiriou, “ArcFace: Additive Angular Margin Loss for Deep Face Recognition,” in Proc. CVPR, 2019, pp. 4685–4694.
  26. I. McCowan, J. Carletta, W. Kraaij, S. Ashby, S. Bourban, M. Flynn et al., “The AMI meeting corpus,” 2005.
  27. J. Cosentino, M. Pariente, S. Cornell, A. Deleforge, and E. Vincent, “LibriMix: An Open-Source Dataset for Generalizable Speech Separation,” 2020.
  28. F. Landini, J. Profant, M. Diez, and L. Burget, “Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: Theory, implementation and analysis on standard tasks,” Computer Speech & Language, vol. 71, p. 101254, 2022.
  29. M. Pariente, S. Cornell, J. Cosentino, S. Sivasankaran, E. Tzinis, J. Heitkaemper et al., “Asteroid: the PyTorch-based audio source separation toolkit for researchers,” 2020.
  30. H. Bredin and A. Laurent, “End-to-end speaker segmentation for overlap-aware resegmentation,” in Proc. Interspeech, 2021.
  31. Y. Kwon, H.-S. Heo, B.-J. Lee, Y. J. Kim, and J.-w. Jung, “Absolute decision corrupts absolutely: conservative online speaker diarisation,” 2022.
  32. Y. Yue, J. Du, M.-K. He, Y. Yeung, and R. Wang, “Online Speaker Diarization with Core Samples Selection,” in Proc. Interspeech, 2022, pp. 1466–1470.
  33. F. Kynych, J. Zdansky, P. Cerva, and L. Mateju, “Online Speaker Diarization Using Optimized SE-ResNet Architecture,” in Text, Speech, and Dialogue, K. Ekštein, F. Pártl, and M. Konopík, Eds.   Springer Nature Switzerland, 2023, vol. 14102, pp. 176–187, series Title: Lecture Notes in Computer Science.
  34. H. Bredin, “pyannote.audio 2.1 speaker diarization pipeline: principle, benchmark, and recipe,” in INTERSPEECH 2023.   ISCA, Aug. 2023, pp. 1983–1987.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Elio Gruttadauria (1 paper)
  2. Mathieu Fontaine (15 papers)
  3. Slim Essid (37 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com