Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 78 tok/s
Gemini 2.5 Pro 43 tok/s Pro
GPT-5 Medium 23 tok/s
GPT-5 High 29 tok/s Pro
GPT-4o 93 tok/s
GPT OSS 120B 470 tok/s Pro
Kimi K2 183 tok/s Pro
2000 character limit reached

Non-intrusive Speech Quality Assessment with Diffusion Models Trained on Clean Speech (2410.17834v2)

Published 23 Oct 2024 in eess.AS, cs.LG, and cs.SD

Abstract: Diffusion models have found great success in generating high quality, natural samples of speech, but their potential for density estimation for speech has so far remained largely unexplored. In this work, we leverage an unconditional diffusion model trained only on clean speech for the assessment of speech quality. We show that the quality of a speech utterance can be assessed by estimating the likelihood of a corresponding sample in the terminating Gaussian distribution, obtained via a deterministic noising process. The resulting method is purely unsupervised, trained only on clean speech, and therefore does not rely on annotations. Our diffusion-based approach leverages clean speech priors to assess quality based on how the input relates to the learned distribution of clean data. Our proposed log-likelihoods show promising results, correlating well with intrusive speech quality metrics and showing the best correlation with human scores in a listening experiment.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (28)
  1. J. Le Roux, S. Wisdom, H. Erdogan, and J. R. Hershey, “SDR - half-baked or well done?” in IEEE Int. Conf. on Acoustics, Speech and Signal Process. (ICASSP), 2019, pp. 626–630.
  2. A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in IEEE Int. Conf. on Acoustics, Speech and Signal Process. (ICASSP), 2001.
  3. J. G. Beerends, C. Schmidmer, J. Berger, M. Obermann, R. Ullmann, J. Pomy, and M. Keyhl, “Perceptual objective listening quality assessment (POLQA), the third generation ITU-T standard for end-to-end speech quality measurement part i - temporal alignment,” Journal of the Audio Engineering Society (AES), vol. 61, no. 6, pp. 366–384, 2013.
  4. M. Chinen, F. S. C. Lim, J. Skoglund, N. Gureev, F. O’Gorman, and A. Hines, “ViSQOL v3: An open source production ready objective speech and audio metric,” in Proc. QoMEX, 2020.
  5. C. K. A. Reddy, V. Gopal, and R. Cutler, “DNSMOS P.835: A non-intrusive perceptual objective speech quality metric to evaluate noise suppressors,” in IEEE Int. Conf. on Acoustics, Speech and Signal Process. (ICASSP), 2022, pp. 886–890.
  6. G. Mittag, B. Naderi, A. Chehadi, and S. Möller, “Nisqa: A deep cnn-self-attention model for multidimensional speech quality prediction with crowdsourced datasets,” in Interspeech, 2021, pp. 2127–2131.
  7. P. Manocha and A. Kumar, “Speech quality assessment through mos using non-matching references,” in Interspeech, 2022.
  8. S. Maiti, Y. Peng, T. Saeki, and S. Watanabe, “Speechlmscore: Evaluating speech generation using speech language model,” in IEEE Int. Conf. on Acoustics, Speech and Signal Process. (ICASSP), 2023, pp. 1–5.
  9. S.-W. Fu, K.-H. Hung, Y. Tsao, and Y.-C. F. Wang, “Self-supervised speech quality estimation and enhancement using only clean speech,” in Int. Conf. on Learning Representations (ICLR), 2024.
  10. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” in Advances in Neural Inf. Proc. Systems (NeurIPS), vol. 33, 2020, pp. 6840–6851.
  11. Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” in Int. Conf. on Learning Representations (ICLR), 2021.
  12. S. Emura, “Estimation of output SI-SDR solely from enhanced speech signals in diffusion-based generative speech enhancement method,” in EURASIP EUSIPCO, 2024.
  13. N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan, “WaveGrad: Estimating gradients for waveform generation,” Int. Conf. on Learning Representations (ICLR), 2021.
  14. E. Moliner, J. Lehtinen, and V. Välimäki, “Solving audio inverse problems with a diffusion model,” in IEEE Int. Conf. on Acoustics, Speech and Signal Process. (ICASSP), 2023.
  15. T. Karras, M. Aittala, T. Aila, and S. Laine, “Elucidating the design space of diffusion-based generative models,” in Advances in Neural Inf. Proc. Systems (NeurIPS), vol. 35.   Curran Associates, Inc., 2022, pp. 26 565–26 577.
  16. R. T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. Duvenaud, “Neural ordinary differential equations,” in Advances in Neural Inf. Proc. Systems (NeurIPS), 2018, p. 6572–6583.
  17. W. Grathwohl, R. T. Q. Chen, J. Bettencourt, and D. Duvenaud, “Scalable reversible generative models with free-form continuous dynamics,” in Int. Conf. on Learning Representations (ICLR), 2019.
  18. M. Hutchinson, “A stochastic estimator of the trace of the influence matrix for laplacian smoothing splines,” Communications in Statistics - Simulation and Computation, vol. 19, no. 2, pp. 433–450, 1990.
  19. P. Dhariwal and A. Q. Nichol, “Diffusion models beat GANs on image synthesis,” in Advances in Neural Inf. Proc. Systems (NeurIPS), A. Beygelzimer, Y. Dauphin, P. Liang, and J. W. Vaughan, Eds., 2021.
  20. T. Karras, M. Aittala, J. Lehtinen, J. Hellsten, T. Aila, and S. Laine, “Analyzing and improving the training dynamics of diffusion models,” in IEEE/CVF Conf. on Computer Vision and Pattern Recognition (CVPR), June 2024, pp. 24 174–24 184.
  21. J. Richter, Y.-C. Wu, S. Krenn, S. Welker, B. Lay, S. Watanabe, A. Richard, and T. Gerkmann, “EARS: An anechoic fullband speech dataset benchmarked for speech enhancement and dereverberation,” in Interspeech, 2024, pp. 4873–4877.
  22. C. Valentini-Botinhao, X. Wang, S. Takaki, and J. Yamagishi, “Investigating RNN-based speech enhancement methods for noise-robust text-to-speech,” in 9th ISCA Workshop on Speech Synthesis Workshop (SSW 9), 2016.
  23. S. Rouard, F. Massa, and A. Défossez, “Hybrid transformers for music source separation,” in IEEE Int. Conf. on Acoustics, Speech and Signal Process. (ICASSP), 2023, pp. 1–5.
  24. J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and T. Gerkmann, “Speech enhancement and dereverberation with diffusion-based generative models,” IEEE Trans. on Audio, Speech, and Lang. Process. (TASLP), vol. 31, pp. 2351–2364, 2023.
  25. J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann, “Analysing diffusion-based generative approaches versus discriminative approaches for speech restoration,” in IEEE Int. Conf. on Acoustics, Speech and Signal Process. (ICASSP), 2023, pp. 1–5.
  26. D. de Oliveira, J. Richter, J.-M. Lemercier, T. Peer, and T. Gerkmann, “On the behavior of intrusive and non-intrusive speech enhancement metrics in predictive and generative settings,” in Speech Communication; 15th ITG Conference, 2023, pp. 260–264.
  27. J. Pirklbauer, M. Sach, K. Fluyt, W. Tirry, W. Wardah, S. Moeller, and T. Fingscheidt, “Evaluation metrics for generative speech enhancement methods: Issues and perspectives,” in Speech Communication; 15th ITG Conference, 2023, pp. 265–269.
  28. D. de Oliveira, S. Welker, J. Richter, and T. Gerkmann, “The pesqetarian: On the relevance of goodhart’s law for speech enhancement,” in Interspeech, 2024, pp. 3854–3858.
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Ai Generate Text Spark Streamline Icon: https://streamlinehq.com

Paper Prompts

Sign up for free to create and run prompts on this paper using GPT-5.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.