Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
47 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A lightweight dual-stage framework for personalized speech enhancement based on DeepFilterNet2 (2404.08022v1)

Published 11 Apr 2024 in cs.SD and eess.AS

Abstract: Isolating the desired speaker's voice amidst multiplespeakers in a noisy acoustic context is a challenging task. Per-sonalized speech enhancement (PSE) endeavours to achievethis by leveraging prior knowledge of the speaker's voice.Recent research efforts have yielded promising PSE mod-els, albeit often accompanied by computationally intensivearchitectures, unsuitable for resource-constrained embeddeddevices. In this paper, we introduce a novel method to per-sonalize a lightweight dual-stage Speech Enhancement (SE)model and implement it within DeepFilterNet2, a SE modelrenowned for its state-of-the-art performance. We seek anoptimal integration of speaker information within the model,exploring different positions for the integration of the speakerembeddings within the dual-stage enhancement architec-ture. We also investigate a tailored training strategy whenadapting DeepFilterNet2 to a PSE task. We show that ourpersonalization method greatly improves the performancesof DeepFilterNet2 while preserving minimal computationaloverhead.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (22)
  1. C. Xu, W. Rao, E. S. Chng, and H. Li, “Spex: Multi-scale time domain speaker extraction network,” IEEE Trans. Audio, Speech, Lang. Process., vol. 28, pp. 1370–1384, 2020.
  2. M. Ge, C. Xu, L. Wang, E. S. Chng, J. Dang, and H. Li, “Spex+: A complete time domain speaker extraction network,” Proc. Interspeech, 2020.
  3. ——, “Multi-stage speaker extraction with utterance and frame-level reference signals,” in Proc. ICASSP, 2021, pp. 6109–6113.
  4. X. Ji, M. Yu, C. Zhang, D. Su, T. Yu, X. Liu et al., “Speaker-aware target speaker enhancement by jointly learning with speaker embedding extraction,” in Proc. ICASSP.   IEEE, 2020, pp. 7294–7298.
  5. Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Hershey et al., “Voicefilter: Targeted voice separation by speaker-conditioned spectrogram masking,” Proc. Interspeech, 2018.
  6. R. Giri, S. Venkataramani, J.-M. Valin, U. Isik, and A. Krishnaswamy, “Personalized percepnet: Real-time, low-complexity target voice separation and enhancement,” Proc. Interspeech, 2021.
  7. S. E. Eskimez, T. Yoshioka, H. Wang, X. Wang, Z. Chen, and X. Huang, “Personalized speech enhancement: New models and comprehensive evaluation,” in Proc. ICASSP, 2022, pp. 356–360.
  8. K. Žmolíková, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Burget et al., “Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures,” IEEE Jour. of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 800–814, 2019.
  9. A. Li, W. Liu, X. Luo, C. Zheng, and X. Li, “Icassp 2021 deep noise suppression challenge: Decoupling magnitude and phase optimization with a two-stage deep network,” in Proc. ICASSP, 2021, pp. 6628–6632.
  10. Y. Ju, W. Rao, X. Yan, Y. Fu, S. Lv, L. Cheng et al., “Tea-pse: Tencent-ethereal-audio-lab personalized speech enhancement system for icassp 2022 dns challenge,” in Proc. ICASSP.   IEEE, 2022, pp. 9291–9295.
  11. A. Li, C. Zheng, L. Zhang, and X. Li, “Glance and gaze: A collaborative learning framework for single-channel speech enhancement,” Applied Acoustics, vol. 187, p. 108499, 2022.
  12. H. Schröter, A. Maier, A. Escalante-B, and T. Rosenkranz, “Deepfilternet2: Towards real-time speech enhancement on embedded devices for full-band audio,” in Proc. IWAENC, 2022, pp. 1–5.
  13. M. Thakker, S. E. Eskimez, T. Yoshioka, and H. Wang, “Fast real-time personalized speech enhancement: End-to-end enhancement network (e3net) and knowledge distillation,” Proc. Interspeech, 2022.
  14. Q. Wang, I. L. Moreno, M. Saglam, K. Wilson, A. Chiao, R. Liu et al., “Voicefilter-lite: Streaming targeted voice separation for on-device speech recognition,” Proc. Interspeech, 2020.
  15. M. Delcroix, K. Zmolikova, T. Ochiai, K. Kinoshita, S. Araki, and T. Nakatani, “Compact network for speakerbeam target speaker extraction,” in Proc. ICASSP, 2019, pp. 6965–6969.
  16. B. Desplanques, J. Thienpondt, and K. Demuynck, “Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” Proc. Interspeech, 2020.
  17. D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in Proc. ICASSP, 2018, pp. 5329–5333.
  18. H. Dubey, A. Aazami, V. Gopal, B. Naderi, S. Braun, R. Cutler et al., “Icassp 2023 deep noise suppression challenge,” Proc. ICASSP, 2023.
  19. R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer et al., “Common voice: A massively-multilingual speech corpus,” LREC, 2019.
  20. Z. Wang, R. Giri, D. Shah, J.-M. Valin, M. M. Goodwin, and P. Smaragdis, “A framework for unified real-time personalized and non-personalized speech enhancement,” in Proc. ICASSP, 2023, pp. 1–5.
  21. Y. Ju, J. Chen, S. Zhang, S. He, W. Rao, W. Zhu et al., “Tea-pse 3.0: Tencent-ethereal-audio-lab personalized speech enhancement system for icassp 2023 dns-challenge,” in Proc. ICASSP, 2023, pp. 1–2.
  22. C. K. A. Reddy, V. Gopal, and R. Cutler, “Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” Proc. ICASSP, 2021.

Summary

We haven't generated a summary for this paper yet.