Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Speech Enhancement and Dereverberation with Diffusion-based Generative Models (2208.05830v2)

Published 11 Aug 2022 in eess.AS, cs.LG, and cs.SD

Abstract: In this work, we build upon our previous publication and use diffusion-based generative models for speech enhancement. We present a detailed overview of the diffusion process that is based on a stochastic differential equation and delve into an extensive theoretical examination of its implications. Opposed to usual conditional generation tasks, we do not start the reverse process from pure Gaussian noise but from a mixture of noisy speech and Gaussian noise. This matches our forward process which moves from clean speech to noisy speech by including a drift term. We show that this procedure enables using only 30 diffusion steps to generate high-quality clean speech estimates. By adapting the network architecture, we are able to significantly improve the speech enhancement performance, indicating that the network, rather than the formalism, was the main limitation of our original approach. In an extensive cross-dataset evaluation, we show that the improved method can compete with recent discriminative models and achieves better generalization when evaluating on a different corpus than used for training. We complement the results with an instrumental evaluation using real-world noisy recordings and a listening experiment, in which our proposed method is rated best. Examining different sampler configurations for solving the reverse process allows us to balance the performance and computational speed of the proposed method. Moreover, we show that the proposed method is also suitable for dereverberation and thus not limited to additive background noise removal. Code and audio examples are available online, see https://github.com/sp-uhh/sgmse

Definition Search Book Streamline Icon: https://streamlinehq.com
References (84)
  1. T. Gerkmann and E. Vincent, “Spectral masking and filtering,” in Audio Source Separation and Speech Enhancement, E. Vincent, T. Virtanen, and S. Gannot, Eds.   John Wiley & Sons, 2018.
  2. D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE Trans. on Audio, Speech, and Language Proc. (TASLP), vol. 26, no. 10, pp. 1702–1726, 2018.
  3. D. S. Williamson, Y. Wang, and D. Wang, “Complex ratio masking for monaural speech separation,” IEEE Trans. on Audio, Speech, and Language Proc. (TASLP), vol. 24, no. 3, pp. 483–492, 2015.
  4. S.-W. Fu, T.-Y. Hu, Y. Tsao, and X. Lu, “Complex spectrogram enhancement by convolutional neural network with multi-metrics learning,” IEEE Int. Workshop on Machine Learning for Signal Proc. (MLSP), pp. 1–6, 2017.
  5. S.-W. Fu, Y. Tsao, X. Lu, and H. Kawai, “Raw waveform-based speech enhancement by fully convolutional networks,” IEEE Asia-Pacific Signal and Inf. Proc. Assoc. Annual Summit and Conf. (APSIPA ASC), 2017.
  6. P. Wang, K. Tan, and D. L. Wang, “Bridging the gap between monaural speech enhancement and recognition with distortion-independent acoustic modeling,” IEEE Trans. on Audio, Speech, and Language Proc. (TASLP), vol. 28, pp. 39–48, 2020.
  7. S. Pascual, A. Bonafonte, and J. Serrà, “SEGAN: Speech enhancement generative adversarial network,” ISCA Interspeech, pp. 3642–3646, 2017.
  8. Y. Bando, M. Mimura, K. Itoyama, K. Yoshii, and T. Kawahara, “Statistical speech enhancement based on probabilistic integration of variational autoencoder and non-negative matrix factorization,” IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), pp. 716–720, 2018.
  9. S. Leglaive, L. Girin, and R. Horaud, “A variance modeling framework based on variational autoencoders for speech enhancement,” IEEE Int. Workshop on Machine Learning for Signal Proc. (MLSP), pp. 1–6, 2018.
  10. J. Richter, G. Carbajal, and T. Gerkmann, “Speech enhancement with stochastic temporal convolutional networks.” ISCA Interspeech, pp. 4516–4520, 2020.
  11. G. Carbajal, J. Richter, and T. Gerkmann, “Guided variational autoencoder for speech enhancement with a supervised classifier,” IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), pp. 681–685, 2021.
  12. ——, “Disentanglement learning for variational autoencoders applied to audio-visual speech enhancement,” IEEE Workshop on Applications of Signal Proc. to Audio and Acoustics (WASPAA), pp. 126–130, 2021.
  13. Y. Bando, K. Sekiguchi, and K. Yoshii, “Adaptive neural speech enhancement with a denoising variational autoencoder.” ISCA Interspeech, pp. 2437–2441, 2020.
  14. H. Fang, G. Carbajal, S. Wermter, and T. Gerkmann, “Variational autoencoder for speech enhancement with a noise-aware encoder,” IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), pp. 676–680, 2021.
  15. A. A. Nugraha, K. Sekiguchi, and K. Yoshii, “A flow-based deep latent variable model for speech spectrogram modeling and enhancement,” IEEE Trans. on Audio, Speech, and Language Proc. (TASLP), vol. 28, pp. 1104–1117, 2020.
  16. X. Bie, S. Leglaive, X. Alameda-Pineda, and L. Girin, “Unsupervised speech enhancement using dynamical variational autoencoders,” IEEE Trans. on Audio, Speech, and Language Proc. (TASLP), vol. 30, pp. 2993–3007, 2022.
  17. D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” Int. Conf. on Learning Representations (ICLR), 2014.
  18. I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial networks,” Advances in Neural Inf. Proc. Systems (NeurIPS), vol. 27, 2014.
  19. Y.-J. Lu, Y. Tsao, and S. Watanabe, “A study on speech enhancement based on diffusion probabilistic model,” IEEE Asia-Pacific Signal and Inf. Proc. Assoc. Annual Summit and Conf. (APSIPA ASC), pp. 659–666, 2021.
  20. Y.-J. Lu, Z.-Q. Wang, S. Watanabe, A. Richard, C. Yu, and Y. Tsao, “Conditional diffusion probabilistic model for speech enhancement,” IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), pp. 7402–7406, 2022.
  21. S. Welker, J. Richter, and T. Gerkmann, “Speech enhancement with score-based generative models in the complex STFT domain,” ISCA Interspeech, pp. 2928–2932, 2022.
  22. J. Serrà, S. Pascual, J. Pons, R. O. Araz, and D. Scaini, “Universal speech enhancement with score-based diffusion,” arXiv preprint arXiv:2206.03065, 2022.
  23. J. Sohl-Dickstein, E. Weiss, N. Maheswaranathan, and S. Ganguli, “Deep unsupervised learning using nonequilibrium thermodynamics,” Int. Conf. on Machine Learning (ICML), pp. 2256–2265, 2015.
  24. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Advances in Neural Inf. Proc. Systems (NeurIPS), vol. 33, pp. 6840–6851, 2020.
  25. Y. Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” Advances in Neural Inf. Proc. Systems (NeurIPS), vol. 32, 2019.
  26. A. Hyvärinen and P. Dayan, “Estimation of non-normalized statistical models by score matching.” Journal of Machine Learning Research, vol. 6, no. 4, 2005.
  27. Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” Int. Conf. on Learning Representations (ICLR), 2021.
  28. B. D. Anderson, “Reverse-time diffusion equation models,” Stochastic Processes and their Applications, vol. 12, no. 3, pp. 313–326, 1982.
  29. Y. Koizumi, H. Zen, K. Yatabe, N. Chen, and M. Bacchiani, “SpecGrad: Diffusion probabilistic model based neural vocoder with adaptive noise spectral shaping,” ISCA Interspeech, pp. 803–807, 2022.
  30. P. Vincent, “A connection between score matching and denoising autoencoders,” Neural Computation, vol. 23, no. 7, pp. 1661–1674, 2011.
  31. J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann, “Analysing diffusion-based generative approaches versus discriminative approaches for speech restoration,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP).   IEEE, 2023, pp. 1–5.
  32. T. Gerkmann and R. Martin, “Empirical distributions of DFT-domain speech coefficients based on estimated speech variances,” Int. Workshop on Acoustic Echo and Noise Control, 2010.
  33. S. Braun and I. Tashev, “A consolidated view of loss functions for supervised deep learning-based speech enhancement,” Int. Conf. on Telecom. and Signal Proc. (TSP), pp. 72–76, 2021.
  34. C. H. You, S. N. Koh, and S. Rahardja, “/spl beta/-order MMSE spectral amplitude estimation for speech enhancement,” IEEE Trans. on Audio, Speech, and Language Proc. (TASLP), vol. 13, no. 4, pp. 475–486, 2005.
  35. C. Breithaupt and R. Martin, “Analysis of the decision-directed snr estimator for speech enhancement with respect to low-snr and transient conditions,” IEEE Trans. on Audio, Speech, and Language Proc. (TASLP), vol. 19, no. 2, pp. 277–289, 2010.
  36. N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan, “WaveGrad: Estimating gradients for waveform generation,” Int. Conf. on Learning Representations (ICLR), 2021.
  37. G. Batzolis, J. Stanczuk, C.-B. Schönlieb, and C. Etmann, “Conditional image generation with score-based diffusion models,” arXiv preprint arXiv:2111.13606, 2021.
  38. P. Dhariwal and A. Nichol, “Diffusion models beat GANs on image synthesis,” Advances in Neural Inf. Proc. Systems (NeurIPS), vol. 34, 2021.
  39. C.-W. Huang, J. H. Lim, and A. C. Courville, “A variational perspective on diffusion-based generative models and score matching,” Advances in Neural Inf. Proc. Systems (NeurIPS), vol. 34, 2021.
  40. T. Karras, M. Aittala, T. Aila, and S. Laine, “Elucidating the design space of diffusion-based generative models,” Advances in Neural Inf. Proc. Systems (NeurIPS), vol. 35, 2022.
  41. J. R. Dormand and P. J. Prince, “A family of embedded Runge-Kutta formulae,” Journal of Computational and Applied Mathematics, vol. 6, pp. 19–26, 1980.
  42. O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” Int. Conf. on Medical image computing and computer-assisted intervention, pp. 234–241, 2015.
  43. A. Brock, J. Donahue, and K. Simonyan, “Large scale GAN training for high fidelity natural image synthesis,” Int. Conf. on Learning Representations (ICLR), 2018.
  44. Y. Wu and K. He, “Group normalization,” Proc. of the European conference on computer vision (ECCV), pp. 3–19, 2018.
  45. R. Zhang, “Making convolutional networks shift-invariant again,” Int. Conf. on Machine Learning (ICML), pp. 7324–7334, 2019.
  46. P. Ramachandran, B. Zoph, and Q. V. Le, “Swish: a self-gated activation function,” arXiv preprint arXiv:1710.05941, 2017.
  47. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in Neural Inf. Proc. Systems (NeurIPS), vol. 30, 2017.
  48. T. Karras, S. Laine, M. Aittala, J. Hellsten, J. Lehtinen, and T. Aila, “Analyzing and improving the image quality of StyleGAN,” IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), pp. 8110–8119, 2020.
  49. J. S. Garofolo, D. Graff, D. Paul, and D. Pallett, “CSR-I (WSJ0) Complete.” [Online]. Available: https://catalog.ldc.upenn.edu/LDC93S6A
  50. J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines,” IEEE Workshop on Automatic Speech Recognition and Understanding (ASRU), pp. 504–511, 2015.
  51. C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Investigating RNN-based speech enhancement methods for noise-robust text-to-speech,” ISCA Speech Synthesis Workshop (SSW), pp. 146–152, 2016.
  52. J. Thiemann, N. Ito, and E. Vincent, “The Diverse Environments Multi-channel Acoustic Noise Database (DEMAND): A database of multichannel environmental noise recordings,” The Journal of the Acoustical Society of America, vol. 133, no. 5, pp. 3591–3591, 2013.
  53. R. Scheibler, E. Bezzam, and I. Dokmanic, “Pyroomacoustics: A python package for audio room simulation and array processing algorithms,” in IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2018.
  54. ITU-T Rec. P.863, “Perceptual objective listening quality prediction,” Int. Telecom. Union (ITU), 2018. [Online]. Available: https://www.itu.int/rec/T-REC-P.863-201803-I/en
  55. A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (PESQ) - a new method for speech quality assessment of telephone networks and codecs,” IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), vol. 2, pp. 749–752, 2001.
  56. J. Jensen and C. H. Taal, “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,” IEEE Trans. on Audio, Speech, and Language Proc. (TASLP), vol. 24, no. 11, pp. 2009–2022, 2016.
  57. J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR–half-baked or well done?” IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), pp. 626–630, 2019.
  58. C. K. Reddy, V. Gopal, and R. Cutler, “DNSMOS: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), pp. 6493–6497, 2021.
  59. B. Naderi and R. Cutler, “An open source implementation of ITU-T recommendation P.808 with validation,” ISCA Interspeech, pp. 2862–2866, 2020.
  60. ITU-T Rec. P.808, “Subjective evaluation of speech quality with a crowdsourcing approach,” Int. Telecom. Union (ITU), 2021. [Online]. Available: https://www.itu.int/rec/T-REC-P.808-202106-I/en
  61. C. K. Reddy, V. Gopal, and R. Cutler, “DNSMOS P.835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), 2022.
  62. ITU-T Rec. P.835, “Subjective test methodology for evaluating speech communication systems that include noise suppression algorithm,” Int. Telecom. Union (ITU), 2003. [Online]. Available: https://www.itu.int/rec/T-REC-P.835-200311-I/en
  63. P. Andreev, A. Alanov, O. Ivanov, and D. Vetrov, “Hifi++: a unified framework for bandwidth extension and speech enhancement,” arXiv preprint arXiv:2203.13086, 2022.
  64. A. Baevski, Y. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in Neural Inf. Proc. Systems (NeurIPS), vol. 33, 2020.
  65. ITU-R Rec. BS.1534-3, “Method for the subjective assessment of intermediate quality level of audio systems,” Int. Telecom. Union (ITU), 2014. [Online]. Available: https://www.itu.int/rec/R-REC-BS.1534
  66. M. Schoeffler, S. Bartoschek, F.-R. Stöter, M. Roess, S. Westphal, B. Edler, and J. Herre, “webmushra—a comprehensive framework for web-based listening tests,” Journal of Open Research Software, vol. 6, no. 1, 2018.
  67. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” Int. Conf. on Learning Representations (ICLR), 2015.
  68. Y. Song and S. Ermon, “Improved techniques for training score-based generative models,” Advances in Neural Inf. Proc. Systems (NeurIPS), vol. 33, pp. 12 438–12 448, 2020.
  69. E. Aksan and O. Hilliges, “Stcn: Stochastic temporal convolutional networks,” Int. Conf. on Learning Representations (ICLR), 2018.
  70. L. Girin, S. Leglaive, X. Bie, J. Diard, T. Hueber, and X. Alameda-Pineda, “Dynamical variational autoencoders: A comprehensive review,” Foundations and Trends in Machine Learning, vol. 15, no. 1-2, pp. 1–175, 2021.
  71. H.-S. Choi, J.-H. Kim, J. Huh, A. Kim, J.-W. Ha, and K. Lee, “Phase-aware Speech Enhancement with Deep Complex U-Net,” arXiv preprint arXiv:1903.03107, 2019.
  72. S.-W. Fu, C. Yu, T.-A. Hsieh, P. Plantinga, M. Ravanelli, X. Lu, and Y. Tsao, “MetricGAN+: An improved version of MetricGAN for speech enhancement,” arXiv preprint arXiv:2104.03538, 2021.
  73. Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time–frequency magnitude masking for speech separation,” IEEE Trans. on Audio, Speech, and Language Proc. (TASLP), vol. 27, no. 8, pp. 1256–1266, 2019.
  74. A. Li, C. Zheng, L. Zhang, and X. Li, “Glance and gaze: A collaborative learning framework for single-channel speech enhancement,” Applied Acoustics, vol. 187, p. 108499, 2022.
  75. Y. Zhao, D. Wang, B. Xu, and T. Zhang, “Monaural speech dereverberation using temporal convolutional networks with self attention,” IEEE Trans. on Audio, Speech, and Language Proc. (TASLP), vol. 28, pp. 1598–1607, 2020.
  76. G. Yu, A. Li, H. Wang, Y. Wang, Y. Ke, and C. Zheng, “DBT-Net: Dual-branch federative magnitude and phase estimation with attention-in-attention transformer for monaural speech enhancement,” IEEE Trans. on Audio, Speech, and Language Proc. (TASLP), vol. 30, pp. 2629–2644, 2022.
  77. R. Cao, S. Abdulatif, and B. Yang, “CMGAN: Conformer-based metric GAN for speech enhancement,” in ISCA Interspeech, 2022, pp. 936–940.
  78. ITU-T Rec. P.862.3, “Application guide for objective quality measurement based on Recommendations P.862, P.862.1 and P.862.2,” Int. Telecom. Union (ITU), 2007. [Online]. Available: https://www.itu.int/rec/T-REC-P.862.3/en
  79. S.-W. Fu, C. Yu, K.-H. Hung, M. Ravanelli, and Y. Tsao, “MetricGAN-U: Unsupervised speech enhancement/dereverberation based only on noisy/reverberated speech,” IEEE Int. Conf. on Acoustics, Speech and Signal Proc. (ICASSP), pp. 7412–7416, 2022.
  80. S. Uhlich and Y. Mitsufuji, “Open-unmix for speech enhancement (UMX SE),” 2020. [Online]. Available: https://github.com/sigsep/open-unmix-pytorch
  81. T. Peer and T. Gerkmann, “Phase-aware deep speech enhancement: It’s all about the frame length,” JASA Express Letters, vol. 2, no. 10, p. 104802, 2022.
  82. C. K. Reddy, V. Gopal, R. Cutler, E. Beyrami, R. Cheng, H. Dubey, S. Matusevych, R. Aichner, A. Aazami, S. Braun et al., “The Interspeech 2020 Deep Noise Suppression Challenge: Datasets, subjective testing framework, and challenge results,” ISCA Interspeech, pp. 2492–2496, 2020.
  83. M. Lincoln, I. McCowan, J. Vepa, and H. K. Maganti, “The multi-channel wall street journal audio visual corpus (MC-WSJ-AV): Specification and initial experiments,” IEEE Workshop on Automatic Speech Recognition and Understanding, pp. 357–362, 2005.
  84. D. Watson, W. Chan, J. Ho, and M. Norouzi, “Learning fast samplers for diffusion models by differentiating through sample quality,” Int. Conf. on Learning Representations (ICLR), 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Julius Richter (20 papers)
  2. Simon Welker (22 papers)
  3. Jean-Marie Lemercier (19 papers)
  4. Bunlong Lay (9 papers)
  5. Timo Gerkmann (70 papers)
Citations (146)

Summary

We haven't generated a summary for this paper yet.