A lightweight dual-stage framework for personalized speech enhancement based on DeepFilterNet2 (2404.08022v1)
Abstract: Isolating the desired speaker's voice amidst multiplespeakers in a noisy acoustic context is a challenging task. Per-sonalized speech enhancement (PSE) endeavours to achievethis by leveraging prior knowledge of the speaker's voice.Recent research efforts have yielded promising PSE mod-els, albeit often accompanied by computationally intensivearchitectures, unsuitable for resource-constrained embeddeddevices. In this paper, we introduce a novel method to per-sonalize a lightweight dual-stage Speech Enhancement (SE)model and implement it within DeepFilterNet2, a SE modelrenowned for its state-of-the-art performance. We seek anoptimal integration of speaker information within the model,exploring different positions for the integration of the speakerembeddings within the dual-stage enhancement architec-ture. We also investigate a tailored training strategy whenadapting DeepFilterNet2 to a PSE task. We show that ourpersonalization method greatly improves the performancesof DeepFilterNet2 while preserving minimal computationaloverhead.
- C. Xu, W. Rao, E. S. Chng, and H. Li, “Spex: Multi-scale time domain speaker extraction network,” IEEE Trans. Audio, Speech, Lang. Process., vol. 28, pp. 1370–1384, 2020.
- M. Ge, C. Xu, L. Wang, E. S. Chng, J. Dang, and H. Li, “Spex+: A complete time domain speaker extraction network,” Proc. Interspeech, 2020.
- ——, “Multi-stage speaker extraction with utterance and frame-level reference signals,” in Proc. ICASSP, 2021, pp. 6109–6113.
- X. Ji, M. Yu, C. Zhang, D. Su, T. Yu, X. Liu et al., “Speaker-aware target speaker enhancement by jointly learning with speaker embedding extraction,” in Proc. ICASSP. IEEE, 2020, pp. 7294–7298.
- Q. Wang, H. Muckenhirn, K. Wilson, P. Sridhar, Z. Wu, J. Hershey et al., “Voicefilter: Targeted voice separation by speaker-conditioned spectrogram masking,” Proc. Interspeech, 2018.
- R. Giri, S. Venkataramani, J.-M. Valin, U. Isik, and A. Krishnaswamy, “Personalized percepnet: Real-time, low-complexity target voice separation and enhancement,” Proc. Interspeech, 2021.
- S. E. Eskimez, T. Yoshioka, H. Wang, X. Wang, Z. Chen, and X. Huang, “Personalized speech enhancement: New models and comprehensive evaluation,” in Proc. ICASSP, 2022, pp. 356–360.
- K. Žmolíková, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, L. Burget et al., “Speakerbeam: Speaker aware neural network for target speaker extraction in speech mixtures,” IEEE Jour. of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 800–814, 2019.
- A. Li, W. Liu, X. Luo, C. Zheng, and X. Li, “Icassp 2021 deep noise suppression challenge: Decoupling magnitude and phase optimization with a two-stage deep network,” in Proc. ICASSP, 2021, pp. 6628–6632.
- Y. Ju, W. Rao, X. Yan, Y. Fu, S. Lv, L. Cheng et al., “Tea-pse: Tencent-ethereal-audio-lab personalized speech enhancement system for icassp 2022 dns challenge,” in Proc. ICASSP. IEEE, 2022, pp. 9291–9295.
- A. Li, C. Zheng, L. Zhang, and X. Li, “Glance and gaze: A collaborative learning framework for single-channel speech enhancement,” Applied Acoustics, vol. 187, p. 108499, 2022.
- H. Schröter, A. Maier, A. Escalante-B, and T. Rosenkranz, “Deepfilternet2: Towards real-time speech enhancement on embedded devices for full-band audio,” in Proc. IWAENC, 2022, pp. 1–5.
- M. Thakker, S. E. Eskimez, T. Yoshioka, and H. Wang, “Fast real-time personalized speech enhancement: End-to-end enhancement network (e3net) and knowledge distillation,” Proc. Interspeech, 2022.
- Q. Wang, I. L. Moreno, M. Saglam, K. Wilson, A. Chiao, R. Liu et al., “Voicefilter-lite: Streaming targeted voice separation for on-device speech recognition,” Proc. Interspeech, 2020.
- M. Delcroix, K. Zmolikova, T. Ochiai, K. Kinoshita, S. Araki, and T. Nakatani, “Compact network for speakerbeam target speaker extraction,” in Proc. ICASSP, 2019, pp. 6965–6969.
- B. Desplanques, J. Thienpondt, and K. Demuynck, “Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” Proc. Interspeech, 2020.
- D. Snyder, D. Garcia-Romero, G. Sell, D. Povey, and S. Khudanpur, “X-vectors: Robust dnn embeddings for speaker recognition,” in Proc. ICASSP, 2018, pp. 5329–5333.
- H. Dubey, A. Aazami, V. Gopal, B. Naderi, S. Braun, R. Cutler et al., “Icassp 2023 deep noise suppression challenge,” Proc. ICASSP, 2023.
- R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer et al., “Common voice: A massively-multilingual speech corpus,” LREC, 2019.
- Z. Wang, R. Giri, D. Shah, J.-M. Valin, M. M. Goodwin, and P. Smaragdis, “A framework for unified real-time personalized and non-personalized speech enhancement,” in Proc. ICASSP, 2023, pp. 1–5.
- Y. Ju, J. Chen, S. Zhang, S. He, W. Rao, W. Zhu et al., “Tea-pse 3.0: Tencent-ethereal-audio-lab personalized speech enhancement system for icassp 2023 dns-challenge,” in Proc. ICASSP, 2023, pp. 1–2.
- C. K. A. Reddy, V. Gopal, and R. Cutler, “Dnsmos: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” Proc. ICASSP, 2021.