Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Boosting Fast and High-Quality Speech Synthesis with Linear Diffusion (2306.05708v2)

Published 9 Jun 2023 in cs.SD, cs.LG, and eess.AS

Abstract: Denoising Diffusion Probabilistic Models have shown extraordinary ability on various generative tasks. However, their slow inference speed renders them impractical in speech synthesis. This paper proposes a linear diffusion model (LinDiff) based on an ordinary differential equation to simultaneously reach fast inference and high sample quality. Firstly, we employ linear interpolation between the target and noise to design a diffusion sequence for training, while previously the diffusion path that links the noise and target is a curved segment. When decreasing the number of sampling steps (i.e., the number of line segments used to fit the path), the ease of fitting straight lines compared to curves allows us to generate higher quality samples from a random noise with fewer iterations. Secondly, to reduce computational complexity and achieve effective global modeling of noisy speech, LinDiff employs a patch-based processing approach that partitions the input signal into small patches. The patch-wise token leverages Transformer architecture for effective modeling of global information. Adversarial training is used to further improve the sample quality with decreased sampling steps. We test proposed method with speech synthesis conditioned on acoustic feature (Mel-spectrograms). Experimental results verify that our model can synthesize high-quality speech even with only one diffusion step. Both subjective and objective evaluations demonstrate that our model can synthesize speech of a quality comparable to that of autoregressive models with faster synthesis speed (3 diffusion steps).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (26)
  1. Layer normalization. arXiv preprint arXiv:1607.06450, 2016.
  2. All are worth words: a vit backbone for score-based diffusion models. arXiv preprint arXiv:2209.12152, 2022.
  3. Wavegrad: Estimating gradients for waveform generation. arXiv preprint arXiv:2009.00713, 2020.
  4. Generative adversarial networks. Communications of the ACM, 63(11):139–144, 2020.
  5. Denoising diffusion probabilistic models. Advances in Neural Information Processing Systems, 33:6840–6851, 2020.
  6. Fastdiff: A fast conditional diffusion model for high-quality speech synthesis. arXiv preprint arXiv:2204.09934, 2022.
  7. The lj speech dataset, 2017.
  8. Adam: A method for stochastic optimization, 2017.
  9. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning, 12(4):307–392, 2019.
  10. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33:17022–17033, 2020.
  11. Diffwave: A versatile diffusion model for audio synthesis. arXiv preprint arXiv:2009.09761, 2020.
  12. Robert Kubichek. Mel-cepstral distance measure for objective speech quality assessment. In Proceedings of IEEE pacific rim conference on communications computers and signal processing, volume 1, pages 125–128. IEEE, 1993.
  13. Melgan: Generative adversarial networks for conditional waveform synthesis. Advances in neural information processing systems, 32, 2019.
  14. Bddm: Bilateral denoising diffusion models for fast and high-quality speech synthesis. arXiv preprint arXiv:2203.13508, 2022.
  15. Diffgan-tts: High-fidelity and efficient text-to-speech with denoising diffusion gans, 2022.
  16. Flow straight and fast: Learning to generate and transfer data with rectified flow. arXiv preprint arXiv:2209.03003, 2022.
  17. Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
  18. Waveglow: A flow-based generative network for speech synthesis. In ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 3617–3621. IEEE, 2019.
  19. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10684–10695, 2022.
  20. Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502, 2020.
  21. Consistency models, 2023.
  22. Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456, 2020.
  23. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  24. Tackling the generative learning trilemma with denoising diffusion gans, 2022.
  25. Libritts: A corpus derived from librispeech for text-to-speech. arXiv preprint arXiv:1904.02882, 2019.
  26. Lvcnet: Efficient condition-dependent modeling network for waveform generation. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6054–6058. IEEE, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Haogeng Liu (8 papers)
  2. Tao Wang (700 papers)
  3. Jie Cao (79 papers)
  4. Ran He (172 papers)
  5. Jianhua Tao (139 papers)
Citations (2)