USED: Universal Speaker Extraction and Diarization (2309.10674v2)
Abstract: Speaker extraction and diarization are two enabling techniques for real-world speech applications. Speaker extraction aims to extract a target speaker's voice from a speech mixture, while speaker diarization demarcates speech segments by speaker, annotating who spoke when'. Previous studies have typically treated the two tasks independently. In practical applications, it is more meaningful to have knowledge about
who spoke what and when', which is captured by the two tasks. The two tasks share a similar objective of disentangling speakers. Speaker extraction operates in the frequency domain, whereas diarization is in the temporal domain. It is logical to believe that speaker activities obtained from speaker diarization can benefit speaker extraction, while the extracted speech offers more accurate speaker activity detection than the speech mixture. In this paper, we propose a unified model called Universal Speaker Extraction and Diarization (USED) to address output inconsistency and scenario mismatch issues. It is designed to manage speech mixture with varying overlap ratios and variable number of speakers. We show that the USED model significantly outperforms the competitive baselines for speaker extraction and diarization tasks on LibriMix and SparseLibriMix datasets. We further validate the diarization performance on CALLHOME, a dataset based on real recordings, and experimental results indicate that our model surpasses recently proposed approaches.
- K. Zmolikova, M. Delcroix, T. Ochiai, K. Kinoshita, J. Černockỳ, and D. Yu, “Neural target speech extraction: An overview,” IEEE Signal Processing Magazine, vol. 40, no. 3, pp. 8–29, 2023.
- T. J. Park, N. Kanda, D. Dimitriadis, K. J. Han, S. Watanabe, and S. Narayanan, “A review of speaker diarization: Recent advances with deep learning,” Computer Speech & Language, vol. 72, p. 101317, 2022.
- W. Rao, C. Xu, E. S. Chng, and H. Li, “Target Speaker Extraction for Multi-Talker Speaker Verification,” in Proc. Interspeech 2019, 2019, pp. 1273–1277.
- C. Xu, W. Rao, J. Wu, and H. Li, “Target speaker verification with selective auditory attention for single and multi-talker speech,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2696–2709, 2021.
- K. Žmolíková, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa, and T. Nakatani, “Speaker-Aware Neural Network Based Beamformer for Speaker Extraction in Speech Mixtures,” in Proc. Interspeech 2017, 2017, pp. 2655–2659.
- M. Delcroix, K. Zmolikova, K. Kinoshita, A. Ogawa, and T. Nakatani, “Single channel target speaker extraction and recognition with speaker beam,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5554–5558.
- M. Ge, C. Xu, L. Wang, E. S. Chng, J. Dang, and H. Li, “SpEx+: A Complete Time Domain Speaker Extraction Network,” in Proc. Interspeech 2020, 2020, pp. 1406–1410.
- C. Xu, W. Rao, E. S. Chng, and H. Li, “Spex: Multi-scale time domain speaker extraction network,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1370–1384, 2020.
- K. Liu, Z. Du, X. Wan, and H. Zhou, “X-sepformer: End-to-end speaker extraction network with explicit optimization on speaker confusion,” in ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- Y. Fujita, N. Kanda, S. Horiguchi, K. Nagamatsu, and S. Watanabe, “End-to-End Neural Speaker Diarization with Permutation-Free Objectives,” in Proc. Interspeech 2019, 2019, pp. 4300–4304.
- S. Horiguchi, Y. Fujita, S. Watanabe, Y. Xue, and K. Nagamatsu, “End-to-End Speaker Diarization for an Unknown Number of Speakers with Encoder-Decoder Based Attractors,” in Proc. Interspeech 2020, 2020, pp. 269–273.
- I. Medennikov, M. Korenevsky, T. Prisyach, Y. Khokhlov, M. Korenevskaya, I. Sorokin, T. Timofeeva, A. Mitrofanov, A. Andrusenko, I. Podluzhny, A. Laptev, and A. Romanenko, “Target-Speaker Voice Activity Detection: A Novel Approach for Multi-Speaker Diarization in a Dinner Party Scenario,” in Proc. Interspeech 2020, 2020, pp. 274–278.
- Y. Jiang, Z. Chen, R. Tao, L. Deng, Y. Qian, and H. Li, “Prompt-driven target speech diarization,” arXiv preprint arXiv:2310.14823, 2023.
- Z. Chen, B. Han, S. Wang, and Y. Qian, “Attention-based encoder-decoder end-to-end neural diarization with embedding enhancer,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 1636–1649, 2024.
- Z. Chen, T. Yoshioka, L. Lu, T. Zhou, Z. Meng, Y. Luo, J. Wu, X. Xiao, and J. Li, “Continuous speech separation: Dataset and analysis,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7284–7288.
- L. E. Shafey, H. Soltau, and I. Shafran, “Joint speech recognition and speaker diarization via sequence transduction,” arXiv preprint arXiv:1907.05337, 2019.
- N. Kanda, Y. Gaur, X. Wang, Z. Meng, Z. Chen, T. Zhou, and T. Yoshioka, “Joint speaker counting, speech recognition, and speaker identification for overlapped speech of any number of speakers,” arXiv preprint arXiv:2006.10930, 2020.
- N. Kanda, Z. Meng, L. Lu, Y. Gaur, X. Wang, Z. Chen, and T. Yoshioka, “Minimum bayes risk training for end-to-end speaker-attributed asr,” in ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2021, pp. 6503–6507.
- J. Cosentino, M. Pariente, S. Cornell, A. Deleforge, and E. Vincent, “Librimix: An open-source dataset for generalizable speech separation,” arXiv preprint arXiv:2005.11262, 2020.
- Özgür Çetin and E. Shriberg, “Analysis of overlaps in meetings by dialog factors, hot spots, speakers, and collection site: insights for automatic speech recognition,” in Proc. Interspeech, 2006.
- J. Barker, S. Watanabe, E. Vincent, and J. Trmal, “The Fifth ’CHiME’ Speech Separation and Recognition Challenge: Dataset, Task and Baselines,” in Proc. Interspeech 2018, 2018, pp. 1561–1565.
- C. Boeddeker, A. S. Subramanian, G. Wichern, R. Haeb-Umbach, and J. L. Roux, “Ts-sep: Joint diarization and separation conditioned on estimated speaker embeddings,” arXiv preprint arXiv:2303.03849, 2023.
- S. Maiti, Y. Ueda, S. Watanabe, C. Zhang, M. Yu, S.-X. Zhang, and Y. Xu, “Eend-ss: Joint end-to-end neural speaker diarization and speech separation for flexible number of speakers,” in 2022 IEEE Spoken Language Technology Workshop (SLT), 2023, pp. 480–487.
- D. Yu, M. Kolbæk, Z.-H. Tan, and J. Jensen, “Permutation invariant training of deep models for speaker-independent multi-talker speech separation,” in 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 241–245.
- Z. Pan, M. Ge, and H. Li, “Usev: Universal speaker extraction with visual cue,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 3032–3045, 2022.
- Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256–1266, 2019.
- Y. Luo, Z. Chen, and T. Yoshioka, “Dual-path rnn: Efficient long sequence modeling for time-domain single-channel speech separation,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 46–50.
- Y. Luo and J. Yu, “Music source separation with band-split rnn,” arXiv preprint arXiv:2209.15174, 2022.
- K. Žmolíková, M. Delcroix, K. Kinoshita, T. Higuchi, A. Ogawa, and T. Nakatani, “Learning speaker representation for neural network based multichannel speaker extraction,” in 2017 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2017, pp. 8–15.
- M. Delcroix, K. Zmolikova, T. Ochiai, K. Kinoshita, S. Araki, and T. Nakatani, “Compact network for speakerbeam target speaker extraction,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 6965–6969.
- K. Žmolíková, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Burget, and J. Černockỳ, “Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 800–814, 2019.
- Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. R. Hershey, R. A. Saurous, R. J. Weiss, Y. Jia, and I. L. Moreno, “VoiceFilter: Targeted Voice Separation by Speaker-Conditioned Spectrogram Masking,” in Proc. Interspeech 2019, 2019, pp. 2728–2732.
- M. Borsdorf, C. Xu, H. Li, and T. Schultz, “Universal Speaker Extraction in the Presence and Absence of Target Speakers for Speech of One and Two Talkers,” in Proc. Interspeech 2021, 2021, pp. 1469–1473.
- Z. Zhang, B. He, and Z. Zhang, “X-TaSNet: Robust and Accurate Time-Domain Speaker Extraction Network,” in Proc. Interspeech 2020, 2020, pp. 1421–1425.
- J. L. Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “Sdr – half-baked or well done?” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2019, pp. 626–630.
- S. H. Shum, N. Dehak, R. Dehak, and J. R. Glass, “Unsupervised methods for speaker diarization: An integrated and iterative approach,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 10, pp. 2015–2028, 2013.
- G. Sell and D. Garcia-Romero, “Speaker diarization with plda i-vector scoring and unsupervised calibration,” in 2014 IEEE Spoken Language Technology Workshop (SLT), 2014, pp. 413–417.
- M. Senoussaoui, P. Kenny, T. Stafylakis, and P. Dumouchel, “A study of the cosine distance-based mean shift for telephone speech diarization,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 1, pp. 217–227, 2014.
- Q. Wang, C. Downey, L. Wan, P. A. Mansfield, and I. L. Moreno, “Speaker diarization with lstm,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5239–5243.
- Y.-X. Wang, J. Du, M. He, S.-T. Niu, L. Sun, and C.-H. Lee, “Scenario-Dependent Speaker Diarization for DIHARD-III Challenge,” in Proc. Interspeech 2021, 2021, pp. 3106–3110.
- A. Graves, S. Fernández, F. Gomez, and J. Schmidhuber, “Connectionist temporal classification: labelling unsegmented sequence data with recurrent neural networks,” in Proceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376.
- K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778.
- V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: An asr corpus based on public domain audio books,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 5206–5210.
- G. Wichern, J. Antognini, M. Flynn, L. R. Zhu, E. McQuinn, D. Crow, E. Manilow, and J. L. Roux, “WHAM!: Extending Speech Separation to Noisy Environments,” in Proc. Interspeech 2019, 2019, pp. 1368–1372.
- K. Li, R. Yang, and X. Hu, “An efficient encoder-decoder architecture with top-down attention for speech separation,” in The Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=fzberKYWKsI
- S. wen Yang, P.-H. Chi, Y.-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y. Y. Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Lin, T.-H. Huang, W.-C. Tseng, K. tik Lee, D.-R. Liu, Z. Huang, S. Dong, S.-W. Li, S. Watanabe, A. Mohamed, and H. yi Lee, “SUPERB: Speech Processing Universal PERformance Benchmark,” in Proc. Interspeech 2021, 2021, pp. 1194–1198.
- M. Ge, C. Xu, L. Wang, E. S. Chng, J. Dang, and H. Li, “L-spex: Localized target speaker extraction,” in ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 7287–7291.
- P. Mark and A. Martin, “2000 nist speaker recognition evaluation,” 2001. [Online]. Available: https://catalog.ldc.upenn.edu/LDC2001S97
- A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. Balcan, and H. Lin, Eds., vol. 33. Curran Associates, Inc., 2020, pp. 12 449–12 460. [Online]. Available: https://proceedings.neurips.cc/paper_files/paper/2020/file/92d1e1eb1cd6f9fba3227870bb6d7f07-Paper.pdf
- W.-N. Hsu, B. Bolte, Y.-H. H. Tsai, K. Lakhotia, R. Salakhutdinov, and A. Mohamed, “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
- C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, pp. 4214–4217.
- A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in 2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), vol. 2, 2001, pp. 749–752 vol.2.
- Y. Fujita, S. Watanabe, S. Horiguchi, Y. Xue, J. Shi, and K. Nagamatsu, “Neural speaker diarization with speaker-wise chain rule,” arXiv preprint arXiv:2006.01796, 2020.
- Z. Chen, B. Han, S. Wang, and Y. Qian, “Attention-based Encoder-Decoder Network for End-to-End Neural Speaker Diarization with Target Speaker Attractor,” in Proc. INTERSPEECH 2023, 2023, pp. 3552–3556.