Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Explicit Estimation of Magnitude and Phase Spectra in Parallel for High-Quality Speech Enhancement (2308.08926v2)

Published 17 Aug 2023 in eess.AS and cs.SD

Abstract: Phase information has a significant impact on speech perceptual quality and intelligibility. However, existing speech enhancement methods encounter limitations in explicit phase estimation due to the non-structural nature and wrapping characteristics of the phase, leading to a bottleneck in enhanced speech quality. To overcome the above issue, in this paper, we proposed MP-SENet, a novel Speech Enhancement Network that explicitly enhances Magnitude and Phase spectra in parallel. The proposed MP-SENet comprises a Transformer-embedded encoder-decoder architecture. The encoder aims to encode the input distorted magnitude and phase spectra into time-frequency representations, which are further fed into time-frequency Transformers for alternatively capturing time and frequency dependencies. The decoder comprises a magnitude mask decoder and a phase decoder, directly enhancing magnitude and wrapped phase spectra by incorporating a magnitude masking architecture and a phase parallel estimation architecture, respectively. Multi-level loss functions explicitly defined on the magnitude spectra, wrapped phase spectra, and short-time complex spectra are adopted to jointly train the MP-SENet model. A metric discriminator is further employed to compensate for the incomplete correlation between these losses and human auditory perception. Experimental results demonstrate that our proposed MP-SENet achieves state-of-the-art performance across multiple speech enhancement tasks, including speech denoising, dereverberation, and bandwidth extension. Compared to existing phase-aware speech enhancement methods, it further mitigates the compensation effect between the magnitude and phase by explicit phase estimation, elevating the perceptual quality of enhanced speech.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (81)
  1. Y.-X. Lu, Y. Ai, and Z.-H. Ling, “MP-SENet: A speech enhancement model with parallel denoising of magnitude and phase spectra,” in Proc. Interspeech, 2023, pp. 3834–3838.
  2. F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. Le Roux, J. R. Hershey, and B. Schuller, “Speech enhancement with LSTM recurrent neural networks and its application to noise-robust ASR,” in International Conference on Latent Variable Analysis and Signal Separation (LVA/ICA), 2015, pp. 91–99.
  3. J. L. Desjardins and K. A. Doherty, “The effect of hearing aid noise reduction on listening effort in hearing-impaired adults,” Ear and hearing, vol. 35, no. 6, pp. 600–610, 2014.
  4. D. Wang and J. Lim, “The unimportance of phase in speech enhancement,” IEEE/ACM Transactions on Acoustics, Speech, and Signal Processing, vol. 30, no. 4, pp. 679–681, 1982.
  5. K. Paliwal, K. Wójcicki, and B. Shannon, “The importance of phase in speech enhancement,” Speech Communication, vol. 53, no. 4, pp. 465–494, 2011.
  6. S. Pascual, A. Bonafonte, and J. Serrà, “SEGAN: Speech enhancement generative adversarial network,” in Proc. Interspeech, 2017, pp. 3642–3646.
  7. V. Kuleshov, S. Z. Enam, and S. Ermon, “Audio super-resolution using neural nets,” in Proc. ICLR (Workshop Track), 2017.
  8. A. Défossez, G. Synnaeve, and Y. Adi, “Real time speech enhancement in the waveform domain,” in Proc. Interspeech, 2020, pp. 3291–3295.
  9. E. Kim and H. Seo, “SE-Conformer: Time-domain speech enhancement using conformer.” in Proc. Interspeech, 2021, pp. 2736–2740.
  10. H. Wang and D. Wang, “Towards robust speech super-resolution,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 2058–2066, 2021.
  11. H.-S. Choi, J.-H. Kim, J. Huh, A. Kim, J.-W. Ha, and K. Lee, “Phase-aware speech enhancement with deep complex U-Net,” in Proc. ICLR, 2019.
  12. X. Hao, X. Su, R. Horaud, and X. Li, “FullSubNet: A full-band and sub-band fusion model for real-time single-channel speech enhancement,” in Proc. ICASSP, 2021, pp. 6633–6637.
  13. V. Kothapally and J. H. Hansen, “SkipConvGAN: Monaural speech dereverberation using generative adversarial networks via complex time-frequency masking,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 30, pp. 1600–1613, 2022.
  14. S. Zhao, B. Ma, K. N. Watcharasupat, and W.-S. Gan, “FRCRN: Boosting feature representation using frequency recurrence for monaural speech enhancement,” in Proc. ICASSP, 2022, pp. 9281–9285.
  15. F. Dang, H. Chen, and P. Zhang, “DPT-FSNet: Dual-path transformer based full-band and sub-band fusion network for speech enhancement,” in Proc. ICASSP, 2022, pp. 6857–6861.
  16. D. Yin, Z. Zhao, C. Tang, Z. Xiong, and C. Luo, “TridentSE: Guiding speech enhancement with 32 global tokens,” in Proc. Interspeech, 2023, pp. 3839–3843.
  17. Z.-Q. Wang, P. Wang, and D. Wang, “Complex spectral mapping for single-and multi-channel speech enhancement and robust ASR,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 28, pp. 1778–1787, 2020.
  18. Z.-Q. Wang, S. Cornell, S. Choi, Y. Lee, B.-Y. Kim, and S. Watanabe, “TF-GridNet: Making time-frequency domain models great again for monaural speaker separation,” in Proc. ICASSP, 2023, pp. 1–5.
  19. A. Li, W. Liu, C. Zheng, C. Fan, and X. Li, “Two heads are better than one: A two-stage complex spectral mapping approach for monaural speech enhancement,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 29, pp. 1829–1843, 2021.
  20. A. Li, S. You, G. Yu, C. Zheng, and X. Li, “Taylor, can you hear me now? a taylor-unfolding framework for monaural speech enhancement,” in Proc. IJCAI, 2022, pp. 4193–4200.
  21. G. Yu, A. Li, C. Zheng, Y. Guo, Y. Wang, and H. Wang, “Dual-branch attention-in-attention transformer for single-channel speech enhancement,” in Proc. ICASSP, 2022, pp. 7847–7851.
  22. S. Abdulatif, R. Cao, and B. Yang, “CMGAN: Conformer-based metric-gan for monaural speech enhancement,” arXiv preprint arXiv:2209.11112, 2022.
  23. Z.-Q. Wang, G. Wichern, and J. Le Roux, “On the compensation between magnitude and phase in speech separation,” IEEE Signal Processing Letters, vol. 28, pp. 2018–2022, 2021.
  24. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
  25. Y. Ai and Z.-H. Ling, “Neural speech phase prediction based on parallel estimation architecture and anti-wrapping losses,” in Proc. ICASSP, 2023.
  26. S.-W. Fu, C.-F. Liao, Y. Tsao, and S.-D. Lin, “MetricGAN: Generative adversarial networks based black-box metric scores optimization for speech enhancement,” in Proc. ICML, 2019, pp. 2031–2041.
  27. S.-W. Fu, C. Yu, T.-A. Hsieh, P. Plantinga, M. Ravanelli, X. Lu, and Y. Tsao, “MetricGAN+: An improved version of MetricGAN for speech enhancement,” in Proc. Interspeech, 2021, pp. 201–205.
  28. Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 23, no. 1, pp. 7–19, 2014.
  29. Y. Ai, J.-X. Zhang, L. Chen, and Z.-H. Ling, “DNN-based spectral enhancement for neural waveform generators with low-bit quantization,” in Proc. ICASSP, 2019, pp. 7025–7029.
  30. D. Yin, C. Luo, Z. Xiong, and W. Zeng, “PHASEN: A phase-and-harmonics-aware speech enhancement network,” in Proc. AAAI, vol. 34, no. 05, 2020, pp. 9458–9465.
  31. M. Berouti, R. Schwartz, and J. Makhoul, “Enhancement of speech corrupted by acoustic noise,” in Proc. ICASSP, vol. 4, 1979, pp. 208–211.
  32. J. Lim and A. Oppenheim, “All-pole modeling of degraded speech,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 26, no. 3, pp. 197–210, 1978.
  33. Y. Ephraim, “Statistical-model-based speech enhancement systems,” Proceedings of the IEEE, vol. 80, no. 10, pp. 1526–1555, 1992.
  34. M. Dendrinos, S. Bakamidis, and G. Carayannis, “Speech enhancement from noise: A regenerative approach,” Speech Communication, vol. 10, no. 1, pp. 45–57, 1991.
  35. Y. Ephraim and H. L. Van Trees, “A signal subspace approach for speech enhancement,” IEEE Transactions on Speech and Audio Processing, vol. 3, no. 4, pp. 251–266, 1995.
  36. L. Liu, H. Guan, J. Ma, W. Dai, G. Wang, and S. Ding, “A mask free neural network for monaural speech enhancement,” in Proc. Interspeech, 2023, pp. 2468–2472.
  37. G. Hu and D. Wang, “Speech segregation based on pitch tracking and amplitude modulation,” in Proc. WASPAA, 2001, pp. 79–82.
  38. S. Srinivasan, N. Roman, and D. Wang, “Binary and ratio time-frequency masks for robust speech recognition,” Speech Communication, vol. 48, no. 11, pp. 1486–1501, 2006.
  39. A. Narayanan and D. Wang, “Ideal ratio mask estimation using deep neural networks for robust speech recognition,” in Proc. ICASSP, 2013, pp. 7092–7096.
  40. Y. Wang, A. Narayanan, and D. Wang, “On training targets for supervised speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 12, pp. 1849–1858, 2014.
  41. H. Erdogan, J. R. Hershey, S. Watanabe, and J. Le Roux, “Phase-sensitive and recognition-boosted speech separation using deep recurrent neural networks,” in Proc. ICASSP, 2015, pp. 708–712.
  42. D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio masking for monaural speech separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 24, no. 3, pp. 483–492, 2015.
  43. T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H. Juang, “Speech dereverberation based on variance-normalized delayed linear prediction,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1717–1731, 2010.
  44. N. Roman and J. Woodruff, “Intelligibility of reverberant noisy speech with ideal binary masking,” The Journal of the Acoustical Society of America, vol. 130, no. 4, pp. 2153–2161, 2011.
  45. X. Li, J. Li, and Y. Yan, “Ideal ratio mask estimation using deep neural networks for monaural speech segregation in noisy reverberant conditions.” in Proc. Interspeech, 2017, pp. 1203–1207.
  46. O. Ernst, S. E. Chazan, S. Gannot, and J. Goldberger, “Speech dereverberation using fully convolutional networks,” in Proc. EUSIPCO, 2018, pp. 390–394.
  47. V. Kothapally and J. H. Hansen, “Monaural speech dereverberation using deformable convolutional networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024.
  48. Z.-H. Ling, S.-Y. Kang, H. Zen, A. Senior, M. Schuster, X.-J. Qian, H. M. Meng, and L. Deng, “Deep learning for acoustic modeling in parametric speech generation: A systematic review of existing techniques and future trends,” IEEE Signal Processing Magazine, vol. 32, no. 3, pp. 35–52, 2015.
  49. J. Abel, M. Strake, and T. Fingscheidt, “A simple cepstral domain DNN approach to artificial speech bandwidth extension,” in Proc. ICASSP, 2018, pp. 5469–5473.
  50. K. Li and C.-H. Lee, “A deep neural network approach to speech bandwidth expansion,” in Proc. ICASSP, 2015, pp. 4395–4399.
  51. H. Liu, W. Choi, X. Liu, Q. Kong, Q. Tian, and D. Wang, “Neural vocoder is all you need for speech super-resolution,” in Proc. Interspeech, 2022, pp. 4227–4231.
  52. T. Y. Lim, R. A. Yeh, Y. Xu, M. N. Do, and M. Hasegawa-Johnson, “Time-frequency networks for audio super-resolution,” in Proc. ICASSP, 2018, pp. 646–650.
  53. S. Wisdom, J. R. Hershey, K. Wilson, J. Thorpe, M. Chinen, B. Patton, and R. A. Saurous, “Differentiable consistency constraints for improved deep speech enhancement,” in Proc. ICASSP, 2019, pp. 900–904.
  54. A. Pandey and D. Wang, “Densely connected neural network with dilated convolutions for real-time speech enhancement in the time domain,” in Proc. ICASSP, 2020, pp. 6629–6633.
  55. D. Ulyanov, A. Vedaldi, and V. Lempitsky, “Instance normalization: The missing ingredient for fast stylization,” arXiv preprint arXiv:1607.08022, 2016.
  56. K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing human-level performance on imagenet classification,” in Proc. ICCV, 2015, pp. 1026–1034.
  57. K. Wang, B. He, and W.-P. Zhu, “TSTNN: Two-stage transformer based neural network for speech enhancement in the time domain,” in Proc. ICASSP, 2021, pp. 7098–7102.
  58. Y. Fu, Y. Liu, J. Li, D. Luo, S. Lv, Y. Jv, and L. Xie, “Uformer: A unet based dilated complex & real dual-path conformer network for simultaneous speech enhancement and dereverberation,” in Proc. ICASSP, 2022, pp. 7417–7421.
  59. J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” arXiv preprint arXiv:1607.06450, 2016.
  60. X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” in Proc. AISTATS, 2011, pp. 315–323.
  61. M. Sperber, J. Niehues, G. Neubig, S. Stüker, and A. Waibel, “Self-attentional acoustic models,” Proc. Interspeech, pp. 3723–3727, 2018.
  62. W. Shi, J. Caballero, F. Huszár, J. Totz, A. P. Aitken, R. Bishop, D. Rueckert, and Z. Wang, “Real-time single image and video super-resolution using an efficient sub-pixel convolutional neural network,” in Proc. CVPR, 2016, pp. 1874–1883.
  63. J. Le Roux, N. Ono, and S. Sagayama, “Explicit consistency constraints for STFT spectrograms and their application to phase reconstruction.” in Proc. SAPA, 2008, pp. 23–28.
  64. J. Le Roux, H. Kameoka, N. Ono, and S. Sagayama, “Fast signal reconstruction from magnitude STFT spectrogram based on spectrogram consistency,” in Proc. DAFx, vol. 10, 2010, pp. 397–403.
  65. A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in Proc. ICASSP, vol. 2, 2001, pp. 749–752.
  66. C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of time-frequency weighted noisy speech,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011.
  67. C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Investigating RNN-based speech enhancement methods for noise-robust text-to-speech.” in Proc. SSW, 2016, pp. 146–152.
  68. C. K. Reddy, V. Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braun et al., “The INTERSPEECH 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,” in Proc. Interspeech, 2020, pp. 2492–2496.
  69. C. Veaux, J. Yamagishi, and S. King, “The voice bank corpus: Design, collection and data analysis of a large regional accent speech database,” in Proc. O-COCOSDA/CASLRE, 2013, pp. 1–4.
  70. J. Thiemann, N. Ito, and E. Vincent, “The diverse environments multi-channel acoustic noise database (DEMAND): A database of multichannel environmental noise recordings,” in Proc. ICA, vol. 19, no. 1, 2013, p. 035081.
  71. K. Kinoshita, M. Delcroix, S. Gannot, E. A. P. Habets, R. Haeb-Umbach, W. Kellermann, V. Leutnant, R. Maas, T. Nakatani, B. Raj et al., “A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing research,” EURASIP Journal on Advances in Signal Processing, vol. 2016, pp. 1–19, 2016.
  72. T. Robinson, J. Fransen, D. Pye, J. Foote, and S. Renals, “Wsjcam0: a british english speech corpus for large vocabulary continuous speech recognition,” in Proc. ICASSP, vol. 1, 1995, pp. 81–84.
  73. M. Lincoln, I. McCowan, J. Vepa, and H. K. Maganti, “The multi-channel wall street journal audio visual corpus (MC-WSJ-AV): Specification and initial experiments,” in Proc. ASRU, 2005, pp. 357–362.
  74. C. Veaux, J. Yamagishi, K. MacDonald et al., “CSTR VCTK corpus: English multi-speaker corpus for CSTR voice cloning toolkit,” University of Edinburgh. The Centre for Speech Technology Research (CSTR), 2017.
  75. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” arXiv preprint arXiv:1711.05101, 2017.
  76. Y. Hu and P. C. Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE Transactions on audio, speech, and language processing, vol. 16, no. 1, pp. 229–238, 2007.
  77. J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–half-baked or well done?” in Proc. ICASSP, 2019, pp. 626–630.
  78. A. Gritsenko, T. Salimans, R. van den Berg, J. Snoek, and N. Kalchbrenner, “A spectral energy distance for parallel speech synthesis,” Proc. NeurIPS, vol. 33, pp. 13 062–13 072, 2020.
  79. N. Zheng and X.-L. Zhang, “Phase-aware speech enhancement based on deep neural networks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 27, no. 1, pp. 63–76, 2018.
  80. M. Chinen, F. S. Lim, J. Skoglund, N. Gureev, F. O’Gorman, and A. Hines, “ViSQOL v3: An open source production ready objective speech and audio metric,” in Proc. QoMEX, 2020, pp. 1–6.
  81. A. Gray and J. Markel, “Distance measures for speech processing,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, no. 5, pp. 380–391, 1976.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Ye-Xin Lu (17 papers)
  2. Yang Ai (41 papers)
  3. Zhen-Hua Ling (114 papers)
Citations (3)