Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Mel-FullSubNet: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR (2402.13511v2)

Published 21 Feb 2024 in eess.AS

Abstract: In this work, we propose Mel-FullSubNet, a single-channel Mel-spectrogram denoising and dereverberation network for improving both speech quality and automatic speech recognition (ASR) performance. Mel-FullSubNet takes as input the noisy and reverberant Mel-spectrogram and predicts the corresponding clean Mel-spectrogram. The enhanced Mel-spectrogram can be either transformed to speech waveform with a neural vocoder or directly used for ASR. Mel-FullSubNet encapsulates interleaved full-band and sub-band networks, for learning the full-band spectral pattern of signals and the sub-band/narrow-band properties of signals, respectively. Compared to linear-frequency domain or time-domain speech enhancement, the major advantage of Mel-spectrogram enhancement is that Mel-frequency presents speech in a more compact way and thus is easier to learn, which will benefit both speech quality and ASR. Experimental results demonstrate a significant improvement in both speech quality and ASR performance achieved by the proposed model.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)
  1. A. Défossez, G. Synnaeve, Y. Adi, “Real Time Speech Enhancement in the Waveform Domain,” in Proc. Interspeech 2020, 2020, pp. 3291–3295.
  2. Xiong, W. Chen, P. Wang, X. Li, and J. Feng, “Spectro-Temporal SubNet for Real-Time Monaural Speech Denoising and Dereverberation,” in Proc. Interspeech 2022, 2022, pp. 931–935.
  3. Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang, and L. Xie, “DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement,” in Proc. Interspeech 2020, 2020, pp. 2472–2476.
  4. L. Andong, Z. Chengshi, Z. Lu, and L. Xiaodong, “Glance and gaze: A collaborative learning framework for single-channel speech enhancement,” in arXiv preprint arXiv:2106.11789, 2021.
  5. X. Li and H. Radu, “online monaural speech enhancement using delayed sub-band lstm,” in Proc. Interspeech 2020, pp. 2462–2466, Jan. 2020.
  6. X. Hao, X. Su, R. Horaud, and X. Li, “Fullsubnet: A full-band and sub-band fusion model for real-time single-channel speech enhancement,” in ICASSP 2021- 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hao, pp. 6633–6637.
  7. R. Zhou, W. Zhu, and X. Li, “Single-channel speech dereverberation using sub-band network with a reverberation time shortening target,” in ICASSP 2023- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023 , pp. 1–5.
  8. X. Hao and X. Li, “Fast fullsubnet: Accelerate full-band and sub-band fusion model for single-channel speech enhancement,” in arXiv preprint arXiv:2212.09019, 2023.
  9. H. Liu, X. Liu, Q. Kong, Q. Tian, Y. Zhao, D. Wang, C. Huang and Y. Wang, “VoiceFixer: A Unified Framework for High-Fidelity Speech Restoration,” in Proc. Interspeech 2022, 2022, pp. 4232–4236.
  10. Y. Tian, W. Liu and T. Lee, “Diffusion-Based Mel-Spectrogram Enhancement for Personalized Speech Synthesis with Found Data,” in arXiv preprint arXiv:2305.10891, 2023.
  11. Hubert Siuzdak, “Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis,” in arXiv preprint arXiv:2306.00814, 2023.
  12. E. Hadad, F. Heese, P. Vary, and S. Gannot, “Multichannel audio database in various acoustic environments,” in 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC), 2014, pp. 313–317.
  13. K. Kinoshita, M. Delcroix, T. Yoshioka, T. Nakatani, E. Habets, R. Haeb-Umbach, V. Leutnant, A. Sehr, W. Kellermann, R. Maas, S. Gannot, and B. Raj , “The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech,,” in 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2013, pp. 1–4.
  14. C. Li, J. Shi, W. Zhang, A. S. Subramanian, X. Chang, N. Kamo, M. Hira, T. Hayashi, C. Boeddeker, Z. Chen, and S. Watanabe,, “ESPnet-SE: End-to-end speech enhancement and separation toolkit designed for ASR integration,” in SLT. IEEE, 2021, pp. 785–792.
  15. Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” in Int. Conf. on Learning Representations (ICLR) 2021, 2021.
  16. A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in ICASSP 2001- 2001 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2001, pp. 749–752.
  17. C. K. A. Reddy, V. Gopal, and R. Cutler, “Dnsmos p.835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” in ICASSP 2022- 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 886–890.
  18. T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H. Juang, “Speech dereverberation based on variance-normalized delayed linear prediction,” in IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1717–1731, Jan. 2010.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Rui Zhou (87 papers)
  2. Xian Li (116 papers)
  3. Ying Fang (15 papers)
  4. Xiaofei Li (71 papers)
Citations (1)

Summary

We haven't generated a summary for this paper yet.