Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Probing Self-supervised Learning Models with Target Speech Extraction (2402.13200v1)

Published 17 Feb 2024 in eess.AS and cs.SD

Abstract: Large-scale pre-trained self-supervised learning (SSL) models have shown remarkable advancements in speech-related tasks. However, the utilization of these models in complex multi-talker scenarios, such as extracting a target speaker in a mixture, is yet to be fully evaluated. In this paper, we introduce target speech extraction (TSE) as a novel downstream task to evaluate the feature extraction capabilities of pre-trained SSL models. TSE uniquely requires both speaker identification and speech separation, distinguishing it from other tasks in the Speech processing Universal PERformance Benchmark (SUPERB) evaluation. Specifically, we propose a TSE downstream model composed of two lightweight task-oriented modules based on the same frozen SSL model. One module functions as a speaker encoder to obtain target speaker information from an enroLLMent speech, while the other estimates the target speaker's mask to extract its speech from the mixture. Experimental results on the Libri2mix datasets reveal the relevance of the TSE downstream task to probe SSL models, as its performance cannot be simply deduced from other related tasks such as speaker verification and separation.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (29)
  1. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  2. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12449–12460, 2020.
  3. “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  4. “Data2vec: A general framework for self-supervised learning in speech, vision and language,” in International Conference on Machine Learning. PMLR, 2022, pp. 1298–1312.
  5. “Parameter-efficient cross-language transfer learning for a language-modular audiovisual speech recognition,” in 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2023, pp. 1–8.
  6. “Fine-tuning wav2vec2 for speaker recognition,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7967–7971.
  7. “Large-scale self-supervised speech representation learning for automatic speaker verification,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6147–6151.
  8. “An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification,” in 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 555–562.
  9. “Exploring WavLM on Speech Enhancement,” in 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 451–457.
  10. “Boosting Self-Supervised Embeddings for Speech Enhancement,” in Proc. Interspeech 2022, 2022, pp. 186–190.
  11. “Improving Speaker Verification with Self-Pretrained Transformer Models,” in Proc. INTERSPEECH 2023, 2023, pp. 5361–5365.
  12. “SUPERB: Speech Processing Universal PERformance Benchmark,” in Proc. Interspeech 2021, 2021, pp. 1194–1198.
  13. “SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 8479–8492.
  14. “Ml-superb: Multilingual speech universal performance benchmark,” arXiv preprint arXiv:2305.10615, 2023.
  15. “Neural target speech extraction: An overview,” IEEE Signal Processing Magazine, vol. 40, no. 3, pp. 8–29, 2023.
  16. “SpEx+: A Complete Time Domain Speaker Extraction Network,” in Proc. Interspeech 2020, 2020, pp. 1406–1410.
  17. “Target Speech Extraction with Conditional Diffusion Model,” in Proc. INTERSPEECH 2023, 2023, pp. 176–180.
  18. “Quantitative evidence on overlooked aspects of enrollment speaker embeddings for target speaker separation,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  19. “Target speech extraction with pre-trained self-supervised learning models,” in ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024.
  20. “Improving Speaker Discrimination of Target Speech Extraction With Time-Domain Speakerbeam,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 691–695.
  21. “Robust speech recognition via large-scale weak supervision,” in International Conference on Machine Learning. PMLR, 2023, pp. 28492–28518.
  22. Yi Luo and Nima Mesgarani, “Tasnet: time-domain audio separation network for real-time, single-channel speech separation,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 696–700.
  23. “X-vectors: Robust DNN embeddings for speaker recognition,” in ICASSP 2018. IEEE, 2018, pp. 5329–5333.
  24. “Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 800–814, 2019.
  25. “Librimix: An open-source dataset for generalizable speech separation,” arXiv preprint arXiv:2005.11262, 2020.
  26. “VoxCeleb: A Large-Scale Speaker Identification Dataset,” Proc. Interspeech 2017, pp. 2616–2620, 2017.
  27. “On loss functions for supervised monaural time-domain speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 825–838, 2020.
  28. “Listen only to me! How well can target speech extraction handle false alarms?,” in Proc. Interspeech 2022, 2022, pp. 216–220.
  29. “Libri-light: A benchmark for asr with limited or no supervision,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 7669–7673.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Junyi Peng (15 papers)
  2. Marc Delcroix (94 papers)
  3. Tsubasa Ochiai (43 papers)
  4. Takanori Ashihara (28 papers)
  5. Shoko Araki (41 papers)
  6. Oldrich Plchot (80 papers)
  7. Jan Cernocky (80 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.