Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
139 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Diffusion-Based Speech Enhancement in Matched and Mismatched Conditions Using a Heun-Based Sampler (2312.02683v2)

Published 5 Dec 2023 in eess.AS, cs.LG, and cs.SD

Abstract: Diffusion models are a new class of generative models that have recently been applied to speech enhancement successfully. Previous works have demonstrated their superior performance in mismatched conditions compared to state-of-the art discriminative models. However, this was investigated with a single database for training and another one for testing, which makes the results highly dependent on the particular databases. Moreover, recent developments from the image generation literature remain largely unexplored for speech enhancement. These include several design aspects of diffusion models, such as the noise schedule or the reverse sampler. In this work, we systematically assess the generalization performance of a diffusion-based speech enhancement model by using multiple speech, noise and binaural room impulse response (BRIR) databases to simulate mismatched acoustic conditions. We also experiment with a noise schedule and a sampler that have not been applied to speech enhancement before. We show that the proposed system substantially benefits from using multiple databases for training, and achieves superior performance compared to state-of-the-art discriminative models in both matched and mismatched conditions. We also show that a Heun-based sampler achieves superior performance at a smaller computational cost compared to a sampler commonly used for speech enhancement.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (49)
  1. D. Wang and J. Chen, “Supervised speech separation based on deep learning: An overview,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 26, pp. 1702–1726, 2018.
  2. P. Gonzalez, T. S. Alstrøm, and T. May, “Assessing the generalization gap of learning-based speech enhancement systems in noisy and reverberant environments,” IEEE/ACM Trans. Audio, Speech, Lang. Process., pp. 1–15, 2023.
  3. J. Sohl-Dickstein et al., “Deep unsupervised learning using nonequilibrium thermodynamics,” in Proc. ICML.   PMLR, 2015.
  4. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Proc. NeurIPS, 2020.
  5. Y. Song et al., “Score-based generative modeling through stochastic differential equations,” in Proc. ICLR, 2021.
  6. P. Dhariwal and A. Nichol, “Diffusion models beat GANs on image synthesis,” in Proc. NeurIPS, 2021.
  7. R. Rombach et al., “High-resolution image synthesis with latent diffusion models,” in Proc. CVPR.   IEEE/CVF, 2022.
  8. Z. Kong et al., “DiffWave: A versatile diffusion model for audio synthesis,” in Proc. ICLR, 2021.
  9. V. Popov et al., “Grad-TTS: A diffusion probabilistic model for text-to-speech,” in Proc. ICML.   PMLR, 2021.
  10. H. Liu et al., “AudioLDM: Text-to-audio generation with latent diffusion models,” in Proc. ICML.   PMLR, 2023.
  11. J. Ho et al., “Video diffusion models,” in Proc. NeurIPS, 2022.
  12. Y.-J. Lu, Y. Tsao, and S. Watanabe, “A study on speech enhancement based on diffusion probabilistic model,” in Proc. APSIPA ASC.   IEEE, 2021.
  13. Y.-J. Lu et al., “Conditional diffusion probabilistic model for speech enhancement,” in Proc. ICASSP.   IEEE, 2022.
  14. S. Welker, J. Richter, and T. Gerkmann, “Speech enhancement with score-based generative models in the complex STFT domain,” in Proc. INTERSPEECH.   ISCA, 2022.
  15. J. Richter et al., “Speech enhancement and dereverberation with diffusion-based generative models,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 31, pp. 2351–2364, 2023.
  16. H. Yen et al., “Cold diffusion for speech enhancement,” in Proc. ICASSP.   IEEE, 2023.
  17. H. Wang and D. Wang, “Cross-domain diffusion based speech enhancement for very noisy speech,” in Proc. ICASSP.   IEEE, 2023.
  18. C. Chen et al., “Metric-oriented speech enhancement using diffusion probabilistic model,” in Proc. ICASSP.   IEEE, 2023.
  19. C. Valentini-Botinhao et al., “Speech enhancement for a noise-robust text-to-speech synthesis system using deep recurrent neural networks,” in Proc. INTERSPEECH.   ISCA, 2016.
  20. E. Vincent et al., “An analysis of environment, microphone and data simulation mismatches in robust speech recognition,” Computer Speech & Language, vol. 46, pp. 535–557, 2017.
  21. D. B. Paul and J. Baker, “The design for the Wall Street Journal-based CSR corpus,” in Proceedings of the Workshop on Speech and Natural Language.   Association for Computational Linguistics, 1992.
  22. J. Barker et al., “The third ‘CHiME’ speech separation and recognition challenge: Dataset, task and baselines,” in Proc. ASRU.   IEEE, 2015.
  23. J. S. Garofolo et al., “DARPA TIMIT acoustic-phonetic continuous speech corpus CD-ROM, NIST speech disc 1-1.1,” National Institute of Standards and Technology, Gaithersburg, MD, 1993.
  24. A. Nichol and P. Dhariwal, “Improved denoising diffusion probabilistic models,” in Proc. ICML.   PMLR, 2021.
  25. E. Hoogeboom, J. Heek, and T. Salimans, “simple diffusion: End-to-end diffusion for high resolution images,” in Proc. ICML.   PMLR, 2023.
  26. D. P. Kingma and R. Gao, “Understanding diffusion objectives as the ELBO with simple data augmentation,” in Proc. NeurIPS, 2023.
  27. T. Karras et al., “Elucidating the design space of diffusion-based generative models,” in Proc. NeurIPS, 2022.
  28. J.-M. Lemercier et al., “Analysing diffusion-based generative approaches versus discriminative approaches for speech restoration,” in Proc. ICASSP.   IEEE, 2023.
  29. B. D. Anderson, “Reverse-time diffusion equation models,” Stochastic Processes and their Applications, vol. 12, pp. 313–326, 1982.
  30. V. Panayotov et al., “LibriSpeech: An ASR corpus based on public domain audio books,” in Proc. ICASSP.   IEEE, 2015.
  31. S. Graetzer et al., “Dataset of British English speech recordings for psychoacoustics and speech processing research: The Clarity speech corpus,” Data in Brief, vol. 41, p. 107951, 2022.
  32. C. Veaux, J. Yamagishi, and S. King, “The Voice Bank corpus: Design, collection and data analysis of a large regional accent speech database,” in Proc. O-COCOSDA/CASLRE.   IEEE, 2013.
  33. T. Heittola, A. Mesaros, and T. Virtanen, “TAU urban acoustic scenes 2019, development dataset,” 2019. [Online]. Available: https://doi.org/10.5281/zenodo.2589280
  34. A. Varga and H. J. Steeneken, “Assessment for automatic speech recognition: II. NOISEX-92: A database and an experiment to study the effect of additive noise on speech recognition systems,” Speech Communication, vol. 12, pp. 247–251, 1993.
  35. W. A. Dreschler et al., “ICRA noises: Artificial noise signals with speech-like spectral and temporal properties for hearing instrument assessment,” Audiology, vol. 40, pp. 148–157, 2001.
  36. J. Thiemann, N. Ito, and E. Vincent, “The diverse environments multi-channel acoustic noise database (DEMAND): A database of multichannel environmental noise recordings,” in Proc. Mtgs. Acoust.   ASA, 2013.
  37. A. Weisser et al., “The ambisonic recordings of typical environments (ARTE) database,” Acta Acustica United With Acustica, vol. 105, pp. 695–713, 2019.
  38. C. Hummersone, R. Mason, and T. Brookes, “Dynamic precedence effect modeling for source separation in reverberant environments,” IEEE Audio, Speech, Lang. Process., vol. 18, pp. 1867–1871, 2010.
  39. S. Pearce, “Audio Spatialisation for Headphones - Impulse Response Dataset,” 2021. [Online]. Available: https://doi.org/10.5281/zenodo.4780815
  40. F. Brinkmann et al., “A benchmark for room acoustical simulation. Concept and database,” Applied Acoustics, vol. 176, p. 107867, 2021.
  41. “Simulated room impulse responses,” University of Surrey. [Online]. Available: http://iosr.surrey.ac.uk/software/index.php#CATT_RIRs
  42. L. McCormack et al., “Higher-order spatial impulse response rendering: Investigating the perceived effects of spherical order, dedicated diffuse rendering, and frequency resolution,” J. Audio Eng. Soc., vol. 68, pp. 338–354, 2020.
  43. N. Roman and J. Woodruff, “Speech intelligibility in reverberation with ideal binary masking: Effects of early reflections and signal-to-noise ratio threshold,” J. Acoust. Soc. Am., vol. 133, pp. 1707–1717, 2013.
  44. D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in Proc. ICLR, 2015.
  45. P. Gonzalez, T. S. Alstrøm, and T. May, “On batching variable size inputs for training end-to-end speech enhancement systems,” in Proc. ICASSP.   IEEE, 2023.
  46. J. Jensen and C. H. Taal, “An algorithm for predicting the intelligibility of speech masked by modulated noise maskers,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, pp. 2009–2022, 2016.
  47. Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time-frequency magnitude masking for speech separation,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 27, pp. 1256–1266, 2019.
  48. Y. Hu et al., “DCCRN: Deep complex convolution recurrent network for phase-aware speech enhancement,” in Proc. INTERSPEECH.   ISCA, 2020.
  49. H. J. Park et al., “MANNER: Multi-view attention network for noise erasure,” in Proc. ICASSP.   IEEE, 2022.
Citations (3)

Summary

We haven't generated a summary for this paper yet.