Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Investigating the Design Space of Diffusion Models for Speech Enhancement (2312.04370v2)

Published 7 Dec 2023 in eess.AS, cs.LG, and cs.SD

Abstract: Diffusion models are a new class of generative models that have shown outstanding performance in image generation literature. As a consequence, studies have attempted to apply diffusion models to other tasks, such as speech enhancement. A popular approach in adapting diffusion models to speech enhancement consists in modelling a progressive transformation between the clean and noisy speech signals. However, one popular diffusion model framework previously laid in image generation literature did not account for such a transformation towards the system input, which prevents from relating the existing diffusion-based speech enhancement systems with the aforementioned diffusion model framework. To address this, we extend this framework to account for the progressive transformation between the clean and noisy speech signals. This allows us to apply recent developments from image generation literature, and to systematically investigate design aspects of diffusion models that remain largely unexplored for speech enhancement, such as the neural network preconditioning, the training loss weighting, the stochastic differential equation (SDE), or the amount of stochasticity injected in the reverse process. We show that the performance of previous diffusion-based speech enhancement systems cannot be attributed to the progressive transformation between the clean and noisy speech signals. Moreover, we show that a proper choice of preconditioning, training loss weighting, SDE and sampler allows to outperform a popular diffusion-based speech enhancement system while using fewer sampling steps, thus reducing the computational cost by a factor of four.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (95)
  1. A. K. Nabelek et al., “Monaural and binaural speech perception through hearing aids under noise and reverberation with normal and hearing-impaired listeners,” J. Speech Hear. Res., 1974.
  2. ——, “Effect of noise and reverberation on binaural and monaural word identification by subjects with various audiograms,” J. Speech Lang. Hear. Res., 1981.
  3. K. J. Palomäki et al., “A binaural processor for missing data speech recognition in the presence of noise and small-room reverberation,” Speech Commun., 2004.
  4. T. May et al., “A binaural scene analyzer for joint localization and recognition of speakers in the presence of interfering noise sources and reverberation,” IEEE Audio, Speech, Lang. Process., 2012.
  5. Y. Xu et al., “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Trans. Audio, Speech, Lang. Process., 2015.
  6. D. Wang et al., “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Trans. Audio, Speech, Lang. Process., 2018.
  7. T. Bentsen et al., “The benefit of combining a deep neural network architecture with ideal ratio mask estimation in computational speech segregation to improve speech intelligibility,” PLoS One, 2018.
  8. P. Wang et al., “Bridging the gap between monaural speech enhancement and recognition with distortion-independent acoustic modeling,” IEEE/ACM Trans. Audio, Speech, Lang. Process., 2019.
  9. I. Goodfellow et al., “Generative adversarial nets,” in Proc. NeurIPS, 2014.
  10. D. P. Kingma et al., “Auto-encoding variational bayes,” in Proc. ICLR, 2014.
  11. D. J. Rezende et al., “Variational inference with normalizing flows,” in Proc. ICML, 2015.
  12. D. Michelsanti et al., “Conditional generative adversarial networks for speech enhancement and noise-robust speaker verification,” in Proc. INTERSPEECH, 2017.
  13. Y. Bando et al., “Statistical speech enhancement based on probabilistic integration of variational autoencoder and non-negative matrix factorization,” in Proc. ICASSP, 2018.
  14. A. A. Nugraha et al., “A flow-based deep latent variable model for speech spectrogram modeling and enhancement,” IEEE/ACM Trans. Audio, Speech, Lang. Process., 2020.
  15. J. Sohl-Dickstein et al., “Deep unsupervised learning using nonequilibrium thermodynamics,” in Proc. ICML, 2015.
  16. Y. Song et al., “Score-based generative modeling through stochastic differential equations,” in Proc. ICLR, 2021.
  17. J. Ho et al., “Denoising diffusion probabilistic models,” in Proc. NeurIPS, 2020.
  18. Y.-J. Lu et al., “A study on speech enhancement based on diffusion probabilistic model,” in Proc. APSIPA ASC, 2021.
  19. ——, “Conditional diffusion probabilistic model for speech enhancement,” in Proc. ICASSP, 2022.
  20. S. Welker et al., “Speech enhancement with score-based generative models in the complex STFT domain,” in Proc. INTERSPEECH, 2022.
  21. J. Richter et al., “Speech enhancement and dereverberation with diffusion-based generative models,” IEEE/ACM Trans. Audio, Speech, Lang. Process., 2023.
  22. H. Yen et al., “Cold diffusion for speech enhancement,” in Proc. ICASSP, 2023.
  23. C. Chen et al., “Metric-oriented speech enhancement using diffusion probabilistic model,” in Proc. ICASSP, 2023.
  24. H. Wang et al., “Cross-domain diffusion based speech enhancement for very noisy speech,” in Proc. ICASSP, 2023.
  25. R. Sawata et al., “Diffiner: A versatile diffusion-based generative refiner for speech enhancement,” in Proc. INTERSPEECH, 2023.
  26. P. Dhariwal et al., “Diffusion models beat GANs on image synthesis,” in Proc. NeurIPS, 2021.
  27. R. Rombach et al., “High-resolution image synthesis with latent diffusion models,” in Proc. CVPR, 2022.
  28. Z. Kong et al., “DiffWave: A versatile diffusion model for audio synthesis,” in Proc. ICLR, 2021.
  29. V. Popov et al., “Grad-TTS: A diffusion probabilistic model for text-to-speech,” in Proc. ICML, 2021.
  30. H. Liu et al., “AudioLDM: Text-to-audio generation with latent diffusion models,” in Proc. ICML, 2023.
  31. J. Ho et al., “Video diffusion models,” in Proc. NeurIPS, 2022.
  32. ——, “Imagen Video: High definition video generation with diffusion models,” arXiv preprint arXiv:2210.02303, 2022.
  33. J.-M. Lemercier et al., “StoRM: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., 2023.
  34. Z. Qiu et al., “SRTNET: Time domain speech enhancement via stochastic refinement,” in Proc. ICASSP, 2023.
  35. B. Kawar et al., “Denoising diffusion restoration models,” in Proc. NeurIPS, 2022.
  36. J. Whang et al., “Deblurring via stochastic refinement,” in Proc. CVPR, 2022.
  37. H. Chung et al., “Improving diffusion models for inverse problems using manifold constraints,” in Proc. NeurIPS, 2022.
  38. ——, “Diffusion posterior sampling for general noisy inverse problems,” in Proc. ICLR, 2023.
  39. E. Moliner et al., “Solving audio inverse problems with a diffusion model,” in Proc. ICASSP, 2023.
  40. J.-M. Lemercier et al., “Diffusion posterior sampling for informed single-channel dereverberation,” in Proc. WASPAA, 2023.
  41. A. Iashchenko et al., “UnDiff: Unsupervised voice restoration with unconditional diffusion model,” in Proc. INTERSPEECH, 2023.
  42. T. Karras et al., “Elucidating the design space of diffusion-based generative models,” in Proc. NeurIPS, 2022.
  43. J.-M. Lemercier et al., “Analysing diffusion-based generative approaches versus discriminative approaches for speech restoration,” in Proc. ICASSP, 2023.
  44. B. Lay et al., “Reducing the prior mismatch of stochastic differential equations for diffusion-based speech enhancement,” in Proc. INTERSPEECH, 2023.
  45. P. Gonzalez et al., “Diffusion-based speech enhancement in matched and mismatched conditions using a heun-based sampler,” arXiv preprint arXiv:2312.02683, 2023.
  46. B. D. Anderson, “Reverse-time diffusion equation models,” Stochastic Processes and their Applications, 1982.
  47. A. Hyvärinen, “Estimation of non-normalized statistical models by score matching,” J. Mach. Learn. Res., 2005.
  48. P. Vincent, “A connection between score matching and denoising autoencoders,” Neural Comput., 2011.
  49. Y. Song et al., “Sliced score matching: A scalable approach to density and score estimation,” in Proc. UAI, 2020.
  50. D. P. Kingma et al., “Variational diffusion models,” in Proc. NeurIPS, 2021.
  51. T. Salimans et al., “Progressive distillation for fast sampling of diffusion models,” in Proc. ICLR, 2022.
  52. E. Hoogeboom et al., “simple diffusion: End-to-end diffusion for high resolution images,” in Proc. ICML, 2023.
  53. D. P. Kingma et al., “Understanding diffusion objectives as the ELBO with simple data augmentation,” in Proc. NeurIPS, 2023.
  54. A. Nichol et al., “Improved denoising diffusion probabilistic models,” in Proc. ICML, 2021.
  55. G. E. Uhlenbeck et al., “On the theory of the brownian motion,” Physical review, 1930.
  56. Z. Guo et al., “Variance-preserving-based interpolation diffusion models for speech enhancement,” in Proc. INTERSPEECH, 2023.
  57. C. Valentini-Botinhao et al., “Speech enhancement for a noise-robust text-to-speech synthesis system using deep recurrent neural networks,” in Proc. INTERSPEECH, 2016.
  58. C. Veaux et al., “The Voice Bank corpus: Design, collection and data analysis of a large regional accent speech database,” in Proc. O-COCOSDA/CASLRE, 2013.
  59. J. Thiemann et al., “The diverse environments multi-channel acoustic noise database (DEMAND): A database of multichannel environmental noise recordings,” in Proc. Mtgs. Acoust., 2013.
  60. P. Gonzalez et al., “Assessing the generalization gap of learning-based speech enhancement systems in noisy and reverberant environments,” IEEE/ACM Trans. Audio, Speech, Lang. Process., 2023.
  61. J. S. Garofolo et al., “DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1,” National Institute of Standards and Technology, Gaithersburg, MD, 1993.
  62. V. Panayotov et al., “LibriSpeech: An ASR corpus based on public domain audio books,” in Proc. ICASSP, 2015.
  63. D. B. Paul et al., “The design for the Wall Street Journal-based CSR corpus,” in Proc. Workshop Speech Natural Lang., 1992.
  64. S. Graetzer et al., “Dataset of British English speech recordings for psychoacoustics and speech processing research: The Clarity speech corpus,” Data in Brief, 2022.
  65. T. Heittola et al., “TAU urban acoustic scenes 2019, development dataset,” 2019. https://doi.org/10.5281/zenodo.2589280
  66. A. Varga et al., “Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems,” Speech Commun., 1993.
  67. W. A. Dreschler et al., “ICRA noises: Artificial noise signals with speech-like spectral and temporal properties for hearing instrument assessment,” Audiology, 2001.
  68. A. Weisser et al., “The ambisonic recordings of typical environments (ARTE) database,” Acta Acustica United With Acustica, 2019.
  69. C. Hummersone et al., “Dynamic precedence effect modeling for source separation in reverberant environments,” IEEE Audio, Speech, Lang. Process., 2010.
  70. S. Pearce, “Audio Spatialisation for Headphones - Impulse Response Dataset,” 2021. https://doi.org/10.5281/zenodo.4780815
  71. F. Brinkmann et al., “A benchmark for room acoustical simulation. Concept and database,” Applied Acoustics, 2021.
  72. “Simulated room impulse responses,” University of Surrey. http://iosr.surrey.ac.uk/software/index.php#CATT_RIRs
  73. L. McCormack et al., “Higher-order spatial impulse response rendering: Investigating the perceived effects of spherical order, dedicated diffuse rendering, and frequency resolution,” J. Audio Eng. Soc., 2020.
  74. N. Roman et al., “Speech intelligibility in reverberation with ideal binary masking: Effects of early reflections and signal-to-noise ratio threshold,” J. Acoust. Soc. Am., 2013.
  75. P. Zahorik, “Direct-to-reverberant energy ratio sensitivity,” J. Acoust. Soc. Am., 2002.
  76. O. Ronneberger et al., “U-Net: Convolutional networks for biomedical image segmentation,” in Proc. MICCAI, 2015.
  77. D. P. Kingma et al., “Adam: A method for stochastic optimization,” in Proc. ICLR, 2015.
  78. P. Gonzalez et al., “On batching variable size inputs for training end-to-end speech enhancement systems,” in Proc. ICASSP, 2023.
  79. Y. Song et al., “Improved techniques for training score-based generative models,” in Proc. NeurIPS, 2020.
  80. Y. Luo et al., “Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., 2019.
  81. Y. Hu et al., “DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement,” in Proc. INTERSPEECH, 2020.
  82. H. J. Park et al., “MANNER: Multi-view attention network for noise erasure,” in Proc. ICASSP, 2022.
  83. Y. Koyama et al., “Exploring the best loss function for DNN-based low-latency speech enhancement with temporal convolutional networks,” arXiv preprint arXiv:2005.11611, 2020.
  84. K. Kinoshita et al., “Improving noise robust automatic speech recognition with single-channel time-domain enhancement network,” in Proc. ICASSP, 2020.
  85. J. Le Roux et al., “SDR – half-baked or well done?” in Proc. ICASSP, 2019.
  86. C. Trabelsi et al., “Deep complex networks,” in Proc. ICLR, 2018.
  87. S. Hochreiter et al., “Long short-term memory,” Neural Comput., 1997.
  88. S. Ioffe et al., “Batch normalization: Accelerating deep network training by reducing internal covariate shift,” in Proc. ICML, 2015.
  89. H. Wu et al., “Rethinking complex-valued deep neural networks for monaural speech enhancement,” in Proc. INTERSPEECH, 2023.
  90. S. Woo et al., “CBAM: Convolutional block attention module,” in Proc. ECCV, 2018.
  91. Y. Luo et al., “Dual-path RNN: Efficient long sequence modeling for time-domain single-channel speech separation,” in Proc. ICASSP, 2020.
  92. R. Yamamoto et al., “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” in Proc. ICASSP, 2020.
  93. A. Défossez et al., “Real time speech enhancement in the waveform domain,” in Proc. INTERSPEECH, 2020.
  94. L. N. Smith et al., “Super-convergence: Very fast training of neural networks using large learning rates,” in Artificial intelligence and machine learning for multi-domain operations applications, 2019.
  95. J. Jensen et al., “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,” IEEE/ACM Trans. Audio, Speech, Lang. Process., 2016.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Philippe Gonzalez (5 papers)
  2. Zheng-Hua Tan (85 papers)
  3. Jan Østergaard (60 papers)
  4. Jesper Jensen (41 papers)
  5. Tommy Sonne Alstrøm (9 papers)
  6. Tobias May (6 papers)
Citations (3)

Summary

We haven't generated a summary for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets