Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 88 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 21 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 81 tok/s Pro
Kimi K2 175 tok/s Pro
GPT OSS 120B 450 tok/s Pro
Claude Sonnet 4 39 tok/s Pro
2000 character limit reached

A Refining Underlying Information Framework for Monaural Speech Enhancement (2312.11201v2)

Published 18 Dec 2023 in eess.AS, cs.SD, and eess.SP

Abstract: Supervised speech enhancement has gained significantly from recent advancements in neural networks, especially due to their ability to non-linearly fit the diverse representations of target speech, such as waveform or spectrum. However, these direct-fitting solutions continue to face challenges with degraded speech and residual noise in hearing evaluations. By bridging the speech enhancement and the Information Bottleneck principle in this letter, we rethink a universal plug-and-play strategy and propose a Refining Underlying Information framework called RUI to rise to the challenges both in theory and practice. Specifically, we first transform the objective of speech enhancement into an incremental convergence problem of mutual information between comprehensive speech characteristics and individual speech characteristics, e.g., spectral and acoustic characteristics. By doing so, compared with the existing direct-fitting solutions, the underlying information stems from the conditional entropy of acoustic characteristic given spectral characteristics. Therefore, we design a dual-path multiple refinement iterator based on the chain rule of entropy to refine this underlying information for further approximating target speech. Experimental results on DNS-Challenge dataset show that our solution consistently improves 0.3+ PESQ score over baselines, with only additional 1.18 M parameters. The source code is available at https://github.com/caoruitju/RUI_SE.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (44)
  1. K. Paliwal, K. Wójcicki, and B. Schwerin, “Single-channel speech enhancement using spectral subtraction in the short-time modulation domain,” Speech communication, vol. 52, no. 5, pp. 450–475, 2010.
  2. J. Chen, J. Benesty, Y. Huang, and S. Doclo, “New insights into the noise reduction wiener filter,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 14, no. 4, pp. 1218–1234, 2006.
  3. F. Asano, S. Hayamizu, T. Yamada, and S. Nakamura, “Speech enhancement based on the subspace method,” IEEE transactions on speech and audio processing, vol. 8, no. 5, pp. 497–507, 2000.
  4. Y. Luo and N. Mesgarani, “Conv-tasnet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, no. 8, pp. 1256–1266, 2019.
  5. H. R. Guimarães, H. Nagano, and D. W. Silva, “Monaural speech enhancement through deep wave-u-net,” Expert Systems with Applications, vol. 158, p. 113582, 2020.
  6. Z. Kong, W. Ping, A. Dantrey, and B. Catanzaro, “Speech denoising in the waveform domain with self-attention,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2022, pp. 7867–7871.
  7. P. C. Loizou, “Speech enhancement based on perceptually motivated bayesian estimators of the magnitude spectrum,” IEEE transactions on speech and audio processing, vol. 13, no. 5, pp. 857–869, 2005.
  8. Y. Wang, A. Narayanan, and D. Wang, “On training targets for supervised speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 12, pp. 1849–1858, 2014.
  9. K. Paliwal, K. Wójcicki, and B. Shannon, “The importance of phase in speech enhancement,” speech communication, vol. 53, no. 4, pp. 465–494, 2011.
  10. K. Tan and D. Wang, “Complex spectral mapping with a convolutional recurrent network for monaural speech enhancement,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2019, pp. 6865–6869.
  11. ——, “Learning complex spectral mapping with gated convolutional recurrent networks for monaural speech enhancement,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 28, pp. 380–390, 2019.
  12. Y. Hu, Y. Liu, S. Lv, M. Xing, S. Zhang, Y. Fu, J. Wu, B. Zhang, and L. Xie, “Dccrn: Deep complex convolution recurrent network for phase-aware speech enhancement,” in Proc. Interspeech, 2020, pp. 2472–2476.
  13. W. Jiang, Z. Liu, K. Yu, and F. Wen, “Speech enhancement with neural homomorphic synthesis,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2022, pp. 376–380.
  14. Z. Kong, W. Ping, A. Dantrey, and B. Catanzaro, “Cleanunet 2: A hybrid speech denoising model on waveform and spectrogram,” in Proc. Interspeech, 2023, pp. 790–794.
  15. D. Yin, C. Luo, Z. Xiong, and W. Zeng, “Phasen: A phase-and-harmonics-aware speech enhancement network,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 05, 2020, pp. 9458–9465.
  16. T. Wang, W. Zhu, Y. Gao, J. Feng, and S. Zhang, “Hgcn: Harmonic gated compensation network for speech enhancement,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2022, pp. 371–375.
  17. T. Wang, W. Zhu, Y. Gao, Y. Chen, J. Feng, and S. Zhang, “Harmonic gated compensation network plus for icassp 2022 dns challenge,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2022, pp. 9286–9290.
  18. S. He, W. Rao, J. Liu, J. Chen, Y. Ju, X. Zhang, Y. Wang, and S. Shang, “Speech enhancement with intelligent neural homomorphic synthesis,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2023, pp. 1–5.
  19. C. H. Coker, “A model of articulatory dynamics and control,” Proceedings of the IEEE, vol. 64, no. 4, pp. 452–460, 1976.
  20. C. P. Browman and L. Goldstein, “Articulatory phonology: An overview,” Phonetica, vol. 49, no. 3-4, pp. 155–180, 1992.
  21. L. Welling and H. Ney, “Formant estimation for speech recognition,” IEEE transactions on speech and audio processing, vol. 6, no. 1, pp. 36–48, 1998.
  22. B. C. Moore and B. R. Glasberg, “Suggested formulae for calculating auditory-filter bandwidths and excitation patterns,” The journal of the acoustical society of America, vol. 74, no. 3, pp. 750–753, 1983.
  23. R. D. Kent and H. K. Vorperian, “Static measurements of vowel formant frequencies and bandwidths: A review,” Journal of communication disorders, vol. 74, pp. 74–97, 2018.
  24. E. Plourde and B. Champagne, “Auditory-based spectral amplitude estimators for speech enhancement,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 16, no. 8, pp. 1614–1623, 2008.
  25. N. Tishby and N. Zaslavsky, “Deep learning and the information bottleneck principle,” in 2015 ieee information theory workshop (itw).   IEEE, 2015, pp. 1–5.
  26. N. Das, S. Chakraborty, J. Chaki, N. Padhy, and N. Dey, “Fundamentals, present and future perspectives of speech enhancement,” International Journal of Speech Technology, vol. 24, pp. 883–901, 2021.
  27. N. Saleem and M. I. Khattak, “A review of supervised learning algorithms for single channel speech enhancement,” International Journal of Speech Technology, vol. 22, no. 4, pp. 1051–1075, 2019.
  28. A. Li, W. Liu, X. Luo, C. Zheng, and X. Li, “Icassp 2021 deep noise suppression challenge: Decoupling magnitude and phase optimization with a two-stage deep network,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2021, pp. 6628–6632.
  29. A. Li, C. Zheng, L. Zhang, and X. Li, “Glance and gaze: A collaborative learning framework for single-channel speech enhancement,” Applied Acoustics, vol. 187, p. 108499, 2022.
  30. A. Li, S. You, G. Yu, C. Zheng, and X. Li, “Taylor, can you hear me now? a taylor-unfolding framework for monaural speech enhancement,” in Proceedings of the International Joint Conferences on Artificial Intelligence. (IJCAI), 2022, pp. 4193–4200.
  31. A. Li, C. Zheng, G. Yu, J. Cai, and X. Li, “Filtering and refining: A collaborative-style framework for single-channel speech enhancement,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 30, pp. 2156–2172, 2022.
  32. Z. Goldfeld, E. Van Den Berg, K. Greenewald, I. Melnyk, N. Nguyen, B. Kingsbury, and Y. Polyanskiy, “Estimating information flow in deep neural networks,” in Proc. Int. Conf. Mach. Learn. (ICML), Jun. 2019, pp. 2299–2308.
  33. T. Wang, W. Zhu, Y. Gao, S. Zhang, and J. Feng, “Harmonic attention for monaural speech enhancement,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 2424–2436, 2023.
  34. Y. Isik, J. L. Roux, Z. Chen, S. Watanabe, and J. R. Hershey, “Single-channel multi-speaker separation using deep clustering,” in Proc. Interspeech, 2016, pp. 545–549.
  35. J. M. Martin-Donas, A. M. Gomez, J. A. Gonzalez, and A. M. Peinado, “A deep learning loss function based on the perceptual evaluation of the speech quality,” IEEE Signal processing letters, vol. 25, no. 11, pp. 1680–1684, 2018.
  36. C. K. Reddy, V. Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braun, P. Rana, S. Srinivasan, and J. Gehrke, “The interspeech 2020 deep noise suppression challenge: Datasets, subjective testing framework, and challenge results,” in Proc. Interspeech, 2020, pp. 2492–2496.
  37. A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual evaluation of speech quality (pesq)-a new method for speech quality assessment of telephone networks and codecs,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., vol. 2, 2001, pp. 749–752.
  38. J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “Sdr–half-baked or well done?” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2019, pp. 626–630.
  39. J. Jensen and C. H. Taal, “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, no. 11, pp. 2009–2022, 2016.
  40. X. Le, H. Chen, K. Chen, and J. Lu, “Dpcrn: Dual-path convolution recurrent network for single channel speech enhancement,” in Proc. Interspeech, 2021, pp. 2811–2815.
  41. Y. Xia, S. Braun, C. K. Reddy, H. Dubey, R. Cutler, and I. Tashev, “Weighted speech distortion losses for neural-network-based real-time speech enhancement,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2020, pp. 871–875.
  42. S. Lv, Y. Hu, S. Zhang, and L. Xie, “Dccrn+: Channel-wise subband dccrn with snr estimation for speech enhancement,” in Proc. Interspeech, 2021, pp. 2816–2820.
  43. X. Hao, X. Su, R. Horaud, and X. Li, “Fullsubnet: A full-band and sub-band fusion model for real-time single-channel speech enhancement,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2021, pp. 6633–6637.
  44. J. Chen, Z. Wang, D. Tuo, Z. Wu, S. Kang, and H. Meng, “Fullsubnet+: Channel attention fullsubnet with complex spectrograms for speech enhancement,” in Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2022, pp. 7857–7861.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Lightbulb Streamline Icon: https://streamlinehq.com

Continue Learning

We haven't generated follow-up questions for this paper yet.

List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube