Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GLA-Grad: A Griffin-Lim Extended Waveform Generation Diffusion Model (2402.15516v1)

Published 9 Feb 2024 in cs.SD, cs.LG, eess.AS, and eess.SP

Abstract: Diffusion models are receiving a growing interest for a variety of signal generation tasks such as speech or music synthesis. WaveGrad, for example, is a successful diffusion model that conditionally uses the mel spectrogram to guide a diffusion process for the generation of high-fidelity audio. However, such models face important challenges concerning the noise diffusion process for training and inference, and they have difficulty generating high-quality speech for speakers that were not seen during training. With the aim of minimizing the conditioning error and increasing the efficiency of the noise diffusion process, we propose in this paper a new scheme called GLA-Grad, which consists in introducing a phase recovery algorithm such as the Griffin-Lim algorithm (GLA) at each step of the regular diffusion process. Furthermore, it can be directly applied to an already-trained waveform generation model, without additional training or fine-tuning. We show that our algorithm outperforms state-of-the-art diffusion models for speech generation, especially when generating speech for a previously unseen target speaker.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (32)
  1. H. Lu, Z. Wu, X. Wu, X. Li, S. Kang, X. Liu et al., “VAENAR-TTS: Variational Auto-Encoder Based Non-AutoRegressive Text-to-Speech Synthesis,” in Proc. Interspeech, 2021, pp. 3775–3779.
  2. A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves et al., “Wavenet: A generative model for raw audio,” in Proc. ISCA SSW, 2016, pp. 125–125.
  3. R. Prenger, R. Valle, and B. Catanzaro, “Waveglow: A flow-based generative network for speech synthesis,” in Proc. ICASSP, 2019, pp. 3617–3621.
  4. D. Rezende and S. Mohamed, “Variational inference with normalizing flows,” in Proc. ICML, 2015, pp. 1530–1538.
  5. J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” Proc. NeurIPS, vol. 33, pp. 17 022–17 033, 2020.
  6. I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair et al., “Generative adversarial networks,” arXviv preprint arXiv:1406.2661, 2014.
  7. J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” Proc. NeurIPS, vol. 33, pp. 6840–6851, 2020.
  8. Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” in Proc. ICLR, 2021.
  9. H. Huang, P. S. Yu, and C. Wang, “An introduction to image synthesis with generative adversarial nets,” arXiv preprint arXiv:1803.04469, 2018.
  10. M.-Y. Liu, X. Huang, J. Yu, T.-C. Wang, and A. Mallya, “Generative adversarial networks for image and video synthesis: Algorithms and applications,” Proc. IEEE, 2021.
  11. P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,” Proc. NeurIPS, vol. 34, pp. 8780–8794, 2021.
  12. J.-M. Lemercier, J. Richter, S. Welker, and T. Gerkmann, “Storm: A diffusion-based stochastic regeneration model for speech enhancement and dereverberation,” IEEE Trans. Audio, Speech, Lang. Process., 2023.
  13. J. Richter, S. Welker, J.-M. Lemercier, B. Lay, and T. Gerkmann, “Speech enhancement and dereverberation with diffusion-based generative models,” IEEE Trans. Audio, Speech, Lang. Process., 2023.
  14. Y.-J. Lu, Z.-Q. Wang, S. Watanabe, A. Richard, C. Yu, and Y. Tsao, “Conditional diffusion probabilistic model for speech enhancement,” in Proc. ICASSP, 2022, pp. 7402–7406.
  15. R. Scheibler, Y. Ji, S.-W. Chung, J. Byun, S. Choe, and M.-S. Choi, “Diffusion-based generative speech source separation,” in Proc. ICASSP, 2023, pp. 1–5.
  16. B. Chen, C. Wu, and W. Zhao, “Sepdiff: Speech separation based on denoising diffusion model,” in Proc. ICASSP, 2023, pp. 1–5.
  17. N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan, “Wavegrad: Estimating gradients for waveform generation,” in Proc. ICLR, 2020.
  18. Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “Diffwave: A versatile diffusion model for audio synthesis,” Proc. ICLR, 2021.
  19. S.-g. Lee, H. Kim, C. Shin, X. Tan, C. Liu, Q. Meng et al., “Priorgrad: Improving conditional denoising diffusion models with data-dependent adaptive prior,” Proc. ICLR, 2021.
  20. Y. Koizumi, H. Zen, K. Yatabe, N. Chen, and M. Bacchiani, “SpecGrad: Diffusion probabilistic model based neural vocoder with adaptive noise spectral shaping,” in Proc. Interspeech, 2022.
  21. Z. Wang, Y. Jiang, H. Zheng, P. Wang, P. He, Z. Wang et al., “Patch diffusion: Faster and more data-efficient training of diffusion models,” arXiv preprint arXiv:2304.12526, 2023.
  22. R. Huang, M. W. Lam, J. Wang, D. Su, D. Yu, Y. Ren et al., “Fastdiff: A fast conditional diffusion model for high-quality speech synthesis,” in Proc. IJCAI, 2022.
  23. A. Farahani, S. Voghoei, K. Rasheed, and H. R. Arabnia, “A brief review of domain adaptation,” Proc. ICDATA and IKE, pp. 877–894, 2021.
  24. D. Griffin and J. Lim, “Signal estimation from modified short-time fourier transform,” IEEE Transactions on acoustics, speech, and signal processing, vol. 32, no. 2, pp. 236–243, 1984.
  25. J. Le Roux, N. Ono, and S. Sagayama, “Explicit consistency constraints for STFT spectrograms and their application to phase reconstruction,” in Proc. SAPA, Sep. 2008, pp. 23–28.
  26. N. Perraudin, P. Balazs, and P. L. Søndergaard, “A fast griffin-lim algorithm,” in Proc. WASPAA, 2013, pp. 1–4.
  27. M. S. Graham, W. H. Pinaya, P.-D. Tudosiu, P. Nachev, S. Ourselin, and J. Cardoso, “Denoising diffusion models for out-of-distribution detection,” in Proc. CVPR, 2023, pp. 2947–2956.
  28. K. Ito and L. Johnson, “The lj speech dataset,” https://keithito.com/LJ-Speech-Dataset/, 2017.
  29. J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK Corpus: English multi-speaker corpus for CSTR voice cloning toolkit (version 0.92),” University of Edinburgh. The Centre for Speech Technology Research (CSTR), Tech. Rep., 2019.
  30. A. Rix, J. Beerends, M. Hollier, and A. Hekstra, “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in Proc. ICASSP, vol. 2, 2001, pp. 749–752 vol.2.
  31. C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “A short-time objective intelligibility measure for time-frequency weighted noisy speech,” in Proc. ICASSP, 2010, pp. 4214–4217.
  32. W. Jassim, J. Skoglund, M. Chinen, and A. Hines, “Speech quality assessment with WARP‐Q: From similarity to subsequence dynamic time warp cost,” IET Signal Processing, vol. 16, 08 2022.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Haocheng Liu (3 papers)
  2. Teysir Baoueb (4 papers)
  3. Mathieu Fontaine (15 papers)
  4. Jonathan Le Roux (82 papers)
  5. Gael Richard (14 papers)
Citations (2)

Summary

We haven't generated a summary for this paper yet.