Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Target Speech Extraction with Pre-trained Self-supervised Learning Models (2402.13199v1)

Published 17 Feb 2024 in eess.AS and cs.SD

Abstract: Pre-trained self-supervised learning (SSL) models have achieved remarkable success in various speech tasks. However, their potential in target speech extraction (TSE) has not been fully exploited. TSE aims to extract the speech of a target speaker in a mixture guided by enroLLMent utterances. We exploit pre-trained SSL models for two purposes within a TSE framework, i.e., to process the input mixture and to derive speaker embeddings from the enroLLMent. In this paper, we focus on how to effectively use SSL models for TSE. We first introduce a novel TSE downstream task following the SUPERB principles. This simple experiment shows the potential of SSL models for TSE, but extraction performance remains far behind the state-of-the-art. We then extend a powerful TSE architecture by incorporating two SSL-based modules: an Adaptive Input Enhancer (AIE) and a speaker encoder. Specifically, the proposed AIE utilizes intermediate representations from the CNN encoder by adjusting the time resolution of CNN encoder and transformer blocks through progressive upsampling, capturing both fine-grained and hierarchical features. Our method outperforms current TSE systems achieving a SI-SDR improvement of 14.0 dB on LibriMix. Moreover, we can further improve performance by 0.7 dB by fine-tuning the whole model including the SSL model parameters.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (27)
  1. “Hubert: Self-supervised speech representation learning by masked prediction of hidden units,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 3451–3460, 2021.
  2. “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12449–12460, 2020.
  3. A Arunkumar and Srinivasan Umesh, “Joint Encoder-Decoder Self-Supervised Pre-training for ASR,” in Proc. Interspeech 2022, 2022, pp. 3418–3422.
  4. “Fine-tuning wav2vec2 for speaker recognition,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7967–7971.
  5. “Large-scale self-supervised speech representation learning for automatic speaker verification,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6147–6151.
  6. “Exploring WavLM on Speech Enhancement,” in 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 451–457.
  7. “Boosting Self-Supervised Embeddings for Speech Enhancement,” in Proc. Interspeech 2022, 2022, pp. 186–190.
  8. “Improving Speaker Verification with Self-Pretrained Transformer Models,” in Proc. INTERSPEECH 2023, 2023, pp. 5361–5365.
  9. “Quantitative evidence on overlooked aspects of enrollment speaker embeddings for target speaker separation,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  10. “Neural target speech extraction: An overview,” IEEE Signal Processing Magazine, vol. 40, no. 3, pp. 8–29, 2023.
  11. “SUPERB: Speech Processing Universal PERformance Benchmark,” in Proc. Interspeech 2021, 2021, pp. 1194–1198.
  12. “SUPERB-SG: Enhanced Speech processing Universal PERformance Benchmark for Semantic and Generative Capabilities,” in Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2022, pp. 8479–8492.
  13. Yi Luo and Nima Mesgarani, “Tasnet: time-domain audio separation network for real-time, single-channel speech separation,” in 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 696–700.
  14. “Investigating self-supervised learning for speech enhancement and separation,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6837–6841.
  15. “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing,” IEEE Journal of Selected Topics in Signal Processing, vol. 16, no. 6, pp. 1505–1518, 2022.
  16. “Adapting self-supervised models to multi-talker speech recognition using speaker embeddings,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
  17. “An attention-based backend allowing efficient fine-tuning of transformer models for speaker verification,” in 2022 IEEE Spoken Language Technology Workshop (SLT). IEEE, 2023, pp. 555–562.
  18. “SpEx+: A Complete Time Domain Speaker Extraction Network,” in Proc. Interspeech 2020, 2020, pp. 1406–1410.
  19. “Improving Speaker Discrimination of Target Speech Extraction With Time-Domain Speakerbeam,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2020, pp. 691–695.
  20. “Librimix: An open-source dataset for generalizable speech separation,” arXiv preprint arXiv:2005.11262, 2020.
  21. “The voice bank corpus: Design, collection and data analysis of a large regional accent speech database,” in 2013 international conference oriental COCOSDA held jointly with 2013 conference on Asian spoken language research and evaluation (O-COCOSDA/CASLRE). IEEE, 2013, pp. 1–4.
  22. “Investigating RNN-based speech enhancement methods for noise-robust Text-to-Speech.,” in SSW, 2016, pp. 146–152.
  23. “Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures,” IEEE Journal of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 800–814, 2019.
  24. “Feature pyramid networks for object detection,” in Proceedings of the IEEE conference on computer vision and pattern recognition, 2017, pp. 2117–2125.
  25. “U-net: Convolutional networks for biomedical image segmentation,” in Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18. Springer, 2015, pp. 234–241.
  26. “Listen only to me! How well can target speech extraction handle false alarms?,” in Proc. Interspeech 2022, 2022, pp. 216–220.
  27. “DPCCN: Densely-connected pyramid complex convolutional network for robust speech separation and extraction,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 7292–7296.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Junyi Peng (15 papers)
  2. Marc Delcroix (94 papers)
  3. Tsubasa Ochiai (43 papers)
  4. Shoko Araki (41 papers)
  5. Oldrich Plchot (80 papers)
  6. Jan Cernocky (80 papers)
Citations (7)
X Twitter Logo Streamline Icon: https://streamlinehq.com