Papers
Topics
Authors
Recent
2000 character limit reached

Speech enhancement deep-learning architecture for efficient edge processing (2405.16834v1)

Published 27 May 2024 in eess.AS

Abstract: Deep learning has become a de facto method of choice for speech enhancement tasks with significant improvements in speech quality. However, real-time processing with reduced size and computations for low-power edge devices drastically degrades speech quality. Recently, transformer-based architectures have greatly reduced the memory requirements and provided ways to improve the model performance through local and global contexts. However, the transformer operations remain computationally heavy. In this work, we introduce WaveUNet squeeze-excitation Res2 (WSR)-based metric generative adversarial network (WSR-MGAN) architecture that can be efficiently implemented on low-power edge devices for noise suppression tasks while maintaining speech quality. We utilize multi-scale features using Res2Net blocks that can be related to spectral content used in speech-processing tasks. In the generator, we integrate squeeze-excitation blocks (SEB) with multi-scale features for maintaining local and global contexts along with gated recurrent units (GRUs). The proposed approach is optimized through a combined loss function calculated over raw waveform, multi-resolution magnitude spectrogram, and objective metrics using a metric discriminator. Experimental results in terms of various objective metrics on VoiceBank+DEMAND and DNS-2020 challenge datasets demonstrate that the proposed speech enhancement (SE) approach outperforms the baselines and achieves state-of-the-art (SOTA) performance in the time domain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. P. C. Loizou, Speech enhancement: theory and practice. CRC press, 2007.
  2. Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “An experimental study on speech enhancement based on deep neural networks,” IEEE Sign. Process. Lett., vol. 21, no. 1, pp. 65–68, 2013.
  3. H. Wang and D. Wang, “Neural cascade architecture with triple-domain loss for speech enhancement,” IEEE/ACM Trans. Audio, Speech, and Lang. Process., vol. 30, pp. 734–743, 2021.
  4. S.-W. Fu, C.-F. Liao, Y. Tsao, and S.-D. Lin, “MetricGAN: Generative adversarial networks based black-box metric scores optimization for speech enhancement,” in Proc. ICML, pp. 2031–2041, 2019.
  5. S.-W. Fu et al., “MetricGAN+: An improved version of metricgan for speech enhancement,” in Proc. Interspeech, pp. 201–205, 2021.
  6. X. Hao, X. Su, R. Horaud, and X. Li, “FullSubNet: A full-band and sub-band fusion model for real-time single-channel speech enhancement,” in Proc. ICASSP, pp. 6633–6637, 2021.
  7. U. Isik et al., “PoCoNet: Better speech enhancement with frequency-positional embeddings, semi-supervised conversational data, and biased loss,” in Proc. Interspeech, pp. 2487–2491, 2020.
  8. J.-M. Valin, “A hybrid DSP/deep learning approach to real-time full-band speech enhancement,” in IEEE MMSP, pp. 1–5, 2018.
  9. S. Abdulatif et al., “Investigating cross-domain losses for speech enhancement,” in Proc. EUSIPCO, pp. 411–415, 2021.
  10. Y. Hu et al., “DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement,” in Proc. Interspeech, pp. 2472–2476, 2020.
  11. K. Tan and D. Wang, “Complex spectral mapping with a convolutional recurrent network for monaural speech enhancement,” in Proc. ICASSP, pp. 6865–6869, 2019.
  12. T.-A. Hsieh, H.-M. Wang, X. Lu, and Y. Tsao, “WaveCRN: An efficient convolutional recurrent neural network for end-to-end speech enhancement,” IEEE Sign. Process. Lett., vol. 27, pp. 2149–2153, 2020.
  13. A. Pandey and D. Wang, “TCNN: Temporal convolutional neural network for real-time speech enhancement in the time domain,” in Proc. ICASSP, pp. 6875–6879, 2019.
  14. A. Defossez, G. Synnaeve, and Y. Adi, “Real time speech enhancement in the waveform domain,” in Proc. Interspeech, pp. 3291–3295, 2020.
  15. K. Wang, B. He, and W.-P. Zhu, “TSTNN: Two-stage transformer based neural network for speech enhancement in the time domain,” in Proc. ICASSP, pp. 7098–7102, 2021.
  16. Z. Kong, W. Ping, A. Dantrey, and B. Catanzaro, “Speech denoising in the waveform domain with self-attention,” in Proc. ICASSP, pp. 7867–7871, 2022.
  17. E. Kim and H. Seo, “SE-Conformer: Time-domain speech enhancement using conformer.,” in Proc. Interspeech, pp. 2736–2740, 2021.
  18. I. Fedorov et al., “TinyLSTMs: Efficient neural speech enhancement for hearing aids,” in Proc. Interspeech, pp. 4054–4058, 2020.
  19. N. Shankar, G. S. Bhat, and I. M. Panahi, “Real-time single-channel deep neural network-based speech enhancement on edge devices,” in Proc. Interspeech, vol. 2020, p. 3281, 2020.
  20. D. Wang et al., “Noisy training improves E2E ASR for the edge,” arXiv preprint arXiv:2107.04677, 2021.
  21. S.-H. Gao et al., “Res2Net: A new multi-scale backbone architecture,” IEEE Trans. on pattern analysis and machine intelligence, vol. 43, no. 2, pp. 652–662, 2019.
  22. X. Li et al., “Replay and synthetic speech detection with res2net architecture,” in Proc. ICASSP, pp. 6354–6358, 2021.
  23. B. Desplanques, J. Thienpondt, and K. Demuynck, “ECAPA-TDNN: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” in Proc. Interspeech, pp. 3830–3834, 2020.
  24. J. Hu, L. Shen, and G. Sun, “Squeeze-and-excitation networks,” in Proc. CVPR, pp. 7132–7141, 2018.
  25. O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomedical image segmentation,” in Proc. MICCAI, pp. 234–241, Springer, 2015.
  26. H. Zhang, M. Cisse, Y. N. Dauphin, and D. Lopez-Paz, “mixup: Beyond empirical risk minimization,” in Proc. ICLR, 2018.
  27. C. Macartney and T. Weyde, “Improved speech enhancement with the Wave-U-Net,” arXiv preprint arXiv:1811.11307, 2018.
  28. C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Investigating RNN-based speech enhancement methods for noise-robust text-to-speech.,” in SSW, pp. 146–152, 2016.
  29. C. K. Reddy et al., “The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,” in Proc. Interspeech, pp. 2492–2496, 2020.
  30. I.-T. Recommendation, “Perceptual evaluation of speech quality (PESQ): An objective method for end-to-end speech quality assessment of narrow-band telephone networks and speech codecs,” Rec. ITU-T P. 862, 2001.
  31. C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time–frequency weighted noisy speech,” IEEE Trans. Audio, Speech, and Lang. Process., vol. 19, no. 7, pp. 2125–2136, 2011.
  32. H. Yi and P. C. Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE/ACM Trans. Audio, Speech, and Lang. Process., 2008.

Summary

We haven't generated a summary for this paper yet.

Whiteboard

Paper to Video (Beta)

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 2 likes about this paper.