Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
101 tokens/sec
Gemini 2.5 Pro Premium
50 tokens/sec
GPT-5 Medium
28 tokens/sec
GPT-5 High Premium
27 tokens/sec
GPT-4o
101 tokens/sec
DeepSeek R1 via Azure Premium
90 tokens/sec
GPT OSS 120B via Groq Premium
515 tokens/sec
Kimi K2 via Groq Premium
220 tokens/sec
2000 character limit reached

Matcha-TTS: A fast TTS architecture with conditional flow matching (2309.03199v2)

Published 6 Sep 2023 in eess.AS, cs.HC, cs.LG, and cs.SD

Abstract: We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM). This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching. Careful design choices additionally ensure each synthesis step is fast to run. The method is probabilistic, non-autoregressive, and learns to speak from scratch without external alignments. Compared to strong pre-trained baseline models, the Matcha-TTS system has the smallest memory footprint, rivals the speed of the fastest models on long utterances, and attains the highest mean opinion score in a listening test. Please see https://shivammehta25.github.io/Matcha-TTS/ for audio examples, code, and pre-trained models.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (42)
  1. Y. Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” Proc. NeurIPS, 2019.
  2. P. Dhariwal and A. Nichol, “Diffusion models beat GANs on image synthesis,” in Proc. NeurIPS, 2021, pp. 8780–8794.
  3. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proc. CVPR, 2022, pp. 10 684–10 695.
  4. S. Alexanderson, R. Nagy, J. Beskow, and G. E. Henter, “Listen, denoise, action! Audio-driven motion synthesis with diffusion models,” ACM ToG, vol. 42, no. 4, 2023, article 44.
  5. S. Mehta, S. Wang, S. Alexanderson, J. Beskow, É. Székely, and G. E. Henter, “Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis,” in Proc. SSW, 2023.
  6. N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan, “WaveGrad: Estimating gradients for waveform generation,” in Proc. ICLR, 2021.
  7. N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, N. Dehak, and W. Chan, “WaveGrad 2: Iterative refinement for text-to-speech synthesis,” in Proc. Interspeech, 2021, pp. 3765–3769.
  8. V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, and M. Kudinov, “Grad-TTS: A diffusion probabilistic model for text-to-speech,” in Proc. ICML, 2021, pp. 8599–8608.
  9. M. Jeong, H. Kim, S. J. Cheon, B. J. Choi, and N. S. Kim, “Diff-TTS: A denoising diffusion model for text-to-speech,” in Proc. Interspeech, 2021, pp. 3605–3609.
  10. Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “DiffWave: A versatile diffusion model for audio synthesis,” in Proc. ICLR, 2021.
  11. Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” in Proc. ICLR, 2021.
  12. M. S. Albergo and E. Vanden-Eijnden, “Building normalizing flows with stochastic interpolants,” in Proc. ICLR, 2022.
  13. R. T. Q. Chen, Y. Rubanova, J. Bettencourt et al., “Neural ordinary differential equations,” in Proc. NeurIPS, 2018.
  14. Y. Lipman, R. T. Q. Chen, H. Ben-Hamu et al., “Flow matching for generative modeling,” in Proc. ICLR, 2023.
  15. I. Vovk, T. Sadekova, V. Gogoryan, V. Popov, M. Kudinov, and J. Wei, “Fast Grad-TTS: Towards efficient diffusion-based speech generation on CPU,” in Proc. Interspeech, 2022.
  16. J. Betker, “Better speech synthesis through scaling,” arXiv preprint arXiv:2305.07243, 2023.
  17. Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “FastSpeech: Fast, robust and controllable text to speech,” in Proc. NeurIPS, 2019.
  18. Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “FastSpeech 2: Fast and high-quality end-to-end text to speech,” in Proc. ICLR, 2021.
  19. O. Press, N. A. Smith, and M. Lewis, “Train short, test long: Attention with linear biases enables input length extrapolation,” in Proc. ICLR, 2022.
  20. J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-TTS: A generative flow for text-to-speech via monotonic alignment search,” in Proc. NeurIPS, 2020, pp. 8067–8077.
  21. J. Kim, J. Kong, and J. Son, “VITS: Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in Proc. ICML, 2021, pp. 5530–5540.
  22. P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative position representations,” in Proc. NAACL, 2018.
  23. H. Zhang, Z. Huang, Z. Shang, P. Zhang, and Y. Yan, “LinearSpeech: Parallel text-to-speech with linear complexity,” in Proc. Interspeech, 2021, pp. 4129–4133.
  24. J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, and Y. Liu, “RoFormer: Enhanced Transformer with rotary position embedding,” arXiv preprint arXiv:2104.09864, 2021.
  25. U. Wennberg and G. E. Henter, “The case for translation-invariant self-attention in Transformer-based language models,” in Proc. ACL-IJCNLP Vol. 2, 2021, pp. 130–140.
  26. S. Mehta, A. Kirkland, H. Lameris, J. Beskow, É. Székely, and G. E. Henter, “OverFlow: Putting flows on top of neural transducers for better TTS,” in Proc. Interspeech, 2023.
  27. R. Huang, Z. Zhao, H. Liu, J. Liu, C. Cui, and Y. Ren, “ProDiff: Progressive fast diffusion model for high-quality text-to-speech,” in Proc. MM, 2022, pp. 2595–2605.
  28. O. Watts, G. E. Henter, J. Fong, and C. Valentini-Botinhao, “Where do the improvements come from in sequence-to-sequence neural TTS?” in Proc. SSW, 2019, pp. 217–222.
  29. S. Mehta, É. Székely, J. Beskow, and G. E. Henter, “Neural HMMs are all you need (for high-quality attention-free TTS),” in Proc. ICASSP, 2022, pp. 7457–7461.
  30. S.-g. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, “BigVGAN: A universal neural vocoder with large-scale training,” in Proc. ICLR, 2023.
  31. X. Liu et al., “Flow straight and fast: Learning to generate and transfer data with rectified flow,” in Proc. ICLR, 2023.
  32. Z. Ye, W. Xue, X. Tan, J. Chen, Q. Liu, and Y. Guo, “CoMoSpeech: One-step speech and singing voice synthesis via consistency model,” in Proc. ACM MM, 2023, pp. 1831–1839.
  33. M. Le, A. Vyas, B. Shi, B. Karrer, L. Sari, R. Moritz et al., “Voicebox: Text-guided multilingual universal speech generation at scale,” arXiv preprint arXiv:2306.15687, 2023.
  34. M. Bernard and H. Titeux, “Phonemizer: Text to phones transcription for multiple languages in Python,” J. Open Source Softw., vol. 6, no. 68, p. 3958, 2021.
  35. J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Proc. NeurIPS, 2020, pp. 17 022–17 033.
  36. R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A flow-based generative network for speech synthesis,” in Proc. ICASSP, 2019, pp. 3617–3621.
  37. A. Radford, J. W. Kim, T. Xu, G. Brockman, C. Mcleavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in Proc. ICML, 2023, pp. 28 492–28 518.
  38. J. Taylor and K. Richmond, “Confidence intervals for ASR-based TTS evaluation,” in Proc. Interspeech, 2021.
  39. K. Prahallad, A. Vadapalli, N. Elluru, G. Mantena, B. Pulugundla et al., “The Blizzard Challenge 2013 – Indian language task,” in Proc. Blizzard Challenge Workshop, 2013.
  40. C.-H. Chiang, W.-P. Huang, and H. yi Lee, “Why we should report the details in subjective evaluation of TTS more rigorously,” in Proc. Interspeech, 2023, pp. 5551–5555.
  41. A. Kirkland, S. Mehta, H. Lameris, G. E. Henter, E. Szekely et al., “Stuck in the MOS pit: A critical analysis of MOS test methodology in TTS evaluation,” in Proc. SSW, 2023.
  42. É. Székely, G. E. Henter, J. Beskow, and J. Gustafson, “Spontaneous conversational speech synthesis from found data,” in Proc. Interspeech, 2019, pp. 4435–4439.
Citations (40)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube