Mel-FullSubNet: Mel-Spectrogram Enhancement for Improving Both Speech Quality and ASR (2402.13511v2)
Abstract: In this work, we propose Mel-FullSubNet, a single-channel Mel-spectrogram denoising and dereverberation network for improving both speech quality and automatic speech recognition (ASR) performance. Mel-FullSubNet takes as input the noisy and reverberant Mel-spectrogram and predicts the corresponding clean Mel-spectrogram. The enhanced Mel-spectrogram can be either transformed to speech waveform with a neural vocoder or directly used for ASR. Mel-FullSubNet encapsulates interleaved full-band and sub-band networks, for learning the full-band spectral pattern of signals and the sub-band/narrow-band properties of signals, respectively. Compared to linear-frequency domain or time-domain speech enhancement, the major advantage of Mel-spectrogram enhancement is that Mel-frequency presents speech in a more compact way and thus is easier to learn, which will benefit both speech quality and ASR. Experimental results demonstrate a significant improvement in both speech quality and ASR performance achieved by the proposed model.
- A. Défossez, G. Synnaeve, Y. Adi, “Real Time Speech Enhancement in the Waveform Domain,” in Proc. Interspeech 2020, 2020, pp. 3291–3295.
- Xiong, W. Chen, P. Wang, X. Li, and J. Feng, “Spectro-Temporal SubNet for Real-Time Monaural Speech Denoising and Dereverberation,” in Proc. Interspeech 2022, 2022, pp. 931–935.
- Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang, and L. Xie, “DCCRN: Deep Complex Convolution Recurrent Network for Phase-Aware Speech Enhancement,” in Proc. Interspeech 2020, 2020, pp. 2472–2476.
- L. Andong, Z. Chengshi, Z. Lu, and L. Xiaodong, “Glance and gaze: A collaborative learning framework for single-channel speech enhancement,” in arXiv preprint arXiv:2106.11789, 2021.
- X. Li and H. Radu, “online monaural speech enhancement using delayed sub-band lstm,” in Proc. Interspeech 2020, pp. 2462–2466, Jan. 2020.
- X. Hao, X. Su, R. Horaud, and X. Li, “Fullsubnet: A full-band and sub-band fusion model for real-time single-channel speech enhancement,” in ICASSP 2021- 2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Hao, pp. 6633–6637.
- R. Zhou, W. Zhu, and X. Li, “Single-channel speech dereverberation using sub-band network with a reverberation time shortening target,” in ICASSP 2023- 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2023 , pp. 1–5.
- X. Hao and X. Li, “Fast fullsubnet: Accelerate full-band and sub-band fusion model for single-channel speech enhancement,” in arXiv preprint arXiv:2212.09019, 2023.
- H. Liu, X. Liu, Q. Kong, Q. Tian, Y. Zhao, D. Wang, C. Huang and Y. Wang, “VoiceFixer: A Unified Framework for High-Fidelity Speech Restoration,” in Proc. Interspeech 2022, 2022, pp. 4232–4236.
- Y. Tian, W. Liu and T. Lee, “Diffusion-Based Mel-Spectrogram Enhancement for Personalized Speech Synthesis with Found Data,” in arXiv preprint arXiv:2305.10891, 2023.
- Hubert Siuzdak, “Vocos: Closing the gap between time-domain and Fourier-based neural vocoders for high-quality audio synthesis,” in arXiv preprint arXiv:2306.00814, 2023.
- E. Hadad, F. Heese, P. Vary, and S. Gannot, “Multichannel audio database in various acoustic environments,” in 2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC), 2014, pp. 313–317.
- K. Kinoshita, M. Delcroix, T. Yoshioka, T. Nakatani, E. Habets, R. Haeb-Umbach, V. Leutnant, A. Sehr, W. Kellermann, R. Maas, S. Gannot, and B. Raj , “The reverb challenge: A common evaluation framework for dereverberation and recognition of reverberant speech,,” in 2013 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, 2013, pp. 1–4.
- C. Li, J. Shi, W. Zhang, A. S. Subramanian, X. Chang, N. Kamo, M. Hira, T. Hayashi, C. Boeddeker, Z. Chen, and S. Watanabe,, “ESPnet-SE: End-to-end speech enhancement and separation toolkit designed for ASR integration,” in SLT. IEEE, 2021, pp. 785–792.
- Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” in Int. Conf. on Learning Representations (ICLR) 2021, 2021.
- A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in ICASSP 2001- 2001 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2001, pp. 749–752.
- C. K. A. Reddy, V. Gopal, and R. Cutler, “Dnsmos p.835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” in ICASSP 2022- 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2022, pp. 886–890.
- T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H. Juang, “Speech dereverberation based on variance-normalized delayed linear prediction,” in IEEE Transactions on Audio, Speech, and Language Processing, vol. 18, no. 7, pp. 1717–1731, Jan. 2010.
- Rui Zhou (87 papers)
- Xian Li (116 papers)
- Ying Fang (15 papers)
- Xiaofei Li (71 papers)