Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

CMGAN: Conformer-based Metric GAN for Speech Enhancement (2203.15149v4)

Published 28 Mar 2022 in cs.SD, cs.AI, cs.LG, and eess.AS

Abstract: Recently, convolution-augmented transformer (Conformer) has achieved promising performance in automatic speech recognition (ASR) and time-domain speech enhancement (SE), as it can capture both local and global dependencies in the speech signal. In this paper, we propose a conformer-based metric generative adversarial network (CMGAN) for SE in the time-frequency (TF) domain. In the generator, we utilize two-stage conformer blocks to aggregate all magnitude and complex spectrogram information by modeling both time and frequency dependencies. The estimation of magnitude and complex spectrogram is decoupled in the decoder stage and then jointly incorporated to reconstruct the enhanced speech. In addition, a metric discriminator is employed to further improve the quality of the enhanced estimated speech by optimizing the generator with respect to a corresponding evaluation score. Quantitative analysis on Voice Bank+DEMAND dataset indicates the capability of CMGAN in outperforming various previous models with a margin, i.e., PESQ of 3.41 and SSNR of 11.10 dB.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (40)
  1. D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Transactions on Audio, Speech and Language Processing, vol. 26, no. 10, pp. 1702–1726, 2018.
  2. P. C. Loizou, Speech Enhancement: Theory and Practice, CRC Press, Inc., USA, 2nd edition, 2013.
  3. C. Valentini-Botinhao, X. Wang, S. Takaki and J. Yamagishi, “Investigating RNN-based speech enhancement methods for noise-robust text-to-speech.” in SSW, 2016, pp. 146–152.
  4. K. Kinoshita et al., “A summary of the reverb challenge: State-of-the-art and remaining challenges in reverberant speech processing research,” Journal on Advances in Signal Processing, vol. 2016, no. 01, pp. 1–19, 2016.
  5. J. Barker, R. Marxer, E. Vincent and S. Watanabe, “The third ‘CHiME’speech separation and recognition challenge: Dataset, task and baselines,” in IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU).   2015, pp. 504–511.
  6. H. Dubey et al., “ICASSP 2022 Deep noise suppression challenge,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022.
  7. S.-W. Fu, C.-F. Liao, Y. Tsao and S.-D. Lin, “MetricGAN: Generative adversarial networks based black-box metric scores optimization for speech enhancement,” in International Conference on Machine Learning.   PMLR, 2019, pp. 2031–2041.
  8. D. Yin, C. Luo, Z. Xiong and W. Zeng, “Phasen: A phase-and-harmonics-aware speech enhancement network,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, 2020, pp. 9458–9465.
  9. G. Yu et al., “Dual-branch attention-in-attention transformer for single-channel speech enhancement,” arXiv, vol. abs/2110.06467, 2021.
  10. S. Pascual, A. Bonafonte and J. Serra, “SEGAN: Speech enhancement generative adversarial network,” in Proc. of Interspeech, 2017, pp. 3642–3646.
  11. C. Macartney and T. Weyde, “Improved speech enhancement with the wave-u-net,” arXiv, vol. abs/1811.11307, 2018.
  12. K. Wang, B. He and W.-P. Zhu, “TSTNN: Two-stage transformer based neural network for speech enhancement in the time domain,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 7098–7102.
  13. A. Defossez, G. Synnaeve and Y. Adi, “Real time speech enhancement in the waveform domain,” in Proc. of Interspeech, 2020, pp. 3291–3295.
  14. E. Kim and H. Seo, “SE-Conformer: Time-Domain Speech Enhancement Using Conformer,” in Proc. Interspeech, 2021, pp. 2736–2740.
  15. S. Abdulatif et al., “AeGAN: Time-frequency speech denoising via generative adversarial networks,” in 28th European Signal Processing Conference (EUSIPCO), 2020, pp. 451–455.
  16. S. Abdulatif et al., “Investigating cross-domain losses for speech enhancement,” in 29th European Signal Processing Conference (EUSIPCO), 2021, pp. 411–415.
  17. Z.-Q. Wang, G. Wichern and J. Le Roux, “On the compensation between magnitude and phase in speech separation,” IEEE Signal Processing Letters, vol. 28, pp. 2018–2022, 2021.
  18. D. S. Williamson, Y. Wang and P. Wang, “Complex ratio masking for monaural speech separation,” IEEE Transactions on Audio, Speech and Language Processing, vol. 24, no. 3, pp. 483–492, 2016.
  19. A. Li, C. Zheng, L. Zhang and X. Li, “Glance and gaze: A collaborative learning framework for single-channel speech enhancement,” Applied Acoustics, vol. 187, p. 108499, 2022.
  20. A. Vaswani et al., “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  21. F. Dang, H. Chen and P. Zhang, “DPT-FSNet: Dual-path transformer based full-band and sub-band fusion network for speech enhancement,” arXiv, vol. abs/2104.13002, 2021.
  22. A. Gulati et al., “Conformer: Convolution-augmented transformer for speech recognition,” in Proc. Interspeech, 2020, pp. 5036–5040.
  23. S. Chen et al., “Continuous speech separation with conformer,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, pp. 5749–5753.
  24. J. Chen, Q. Mao and D. Liu, “Dual-path transformer network: Direct context-aware modeling for end-to-end monaural speech separation,” in Proc. Interspeech, 2020, pp. 2642-2646.
  25. S. Braun and I. Tashev, “A consolidated view of loss functions for supervised deep learning-based speech enhancement,” in 44th International Conference on Telecommunications and Signal Processing (TSP), 2021, pp. 72–76.
  26. J. Lee, J. Skoglund, T. Shabestary and H.-G. Kang, “Phase-sensitive joint learning algorithms for deep learning-based speech enhancement,” IEEE Signal Processing Letters, vol. 25, no. 8, pp. 1276–1280, 2018.
  27. K. Wilson et al., “Exploring tradeoffs in models for low-latency speech enhancement,” in 16th International Workshop on Acoustic Signal Enhancement (IWAENC), 2018, pp. 366–370.
  28. A. Pandey and D. Wang, “Densely connected neural network with dilated convolutions for real-time speech enhancement in the time domain,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2020, pp. 6629–6633.
  29. D. Ulyanov, A. Vedaldi and V. Lempitsky, “Instance normalization: The missing ingredient for fast stylization,” arXiv, vol. abs/1607.08022, 2016.
  30. K. He, X. Zhang, S. Ren and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in IEEE International Conference on Computer Vision (ICCV), 2015, pp. 1026–1034.
  31. “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2001, vol. 2, pp. 749–752.
  32. “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in 2010 IEEE International Conference on Acoustics, Speech and Signal Processing, 2010, pp. 4214–4217.
  33. X. Mao et al., “Least squares generative adversarial networks,” in IEEE International Conference on Computer Vision (ICCV), 2017, pp. 2813–2821.
  34. C. Veaux, J. Yamagishi and S. King, “The voice bank corpus: Design, collection and data analysis of a large regional accent speech database,” in Oriental International Conference on Speech Database and Assessments COCOSDA, 2013, pp. 1–4.
  35. J. Thiemann, N. Ito and E. Vincent, “The diverse environments multi-channel acoustic noise database (DEMAND): A database of multichannel environmental noise recordings,” in Proceedings of Meetings on Acoustics, vol. 19, no. 1, Acoustical Society of America, 2013, pp. 035081.
  36. T.-A. Hsieh et al., “Improving perceptual quality by phone-fortified perceptual loss using wasserstein distance for speech enhancement,” arXiv, vol. abs/2010.15174, 2020.
  37. S.-W. Fu et al., “MetricGAN+: An improved version of metricGAN for speech enhancement,” in Proc. Interspeech, 2021, pp. 201–205.
  38. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv, vol. abs/1711.05101, 2017.
  39. Y. Hu and P. C. Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE Transactions on Audio, Speech and Language Processing, vol. 16, no. 1, pp. 229–238, 2008.
  40. P. Isola et al., “Image-to-Image translation with conditional adversarial networks,” in IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017, pp. 5967–5976.
Citations (78)

Summary

We haven't generated a summary for this paper yet.