Matcha-TTS: A fast TTS architecture with conditional flow matching (2309.03199v2)
Abstract: We introduce Matcha-TTS, a new encoder-decoder architecture for speedy TTS acoustic modelling, trained using optimal-transport conditional flow matching (OT-CFM). This yields an ODE-based decoder capable of high output quality in fewer synthesis steps than models trained using score matching. Careful design choices additionally ensure each synthesis step is fast to run. The method is probabilistic, non-autoregressive, and learns to speak from scratch without external alignments. Compared to strong pre-trained baseline models, the Matcha-TTS system has the smallest memory footprint, rivals the speed of the fastest models on long utterances, and attains the highest mean opinion score in a listening test. Please see https://shivammehta25.github.io/Matcha-TTS/ for audio examples, code, and pre-trained models.
- Y. Song and S. Ermon, “Generative modeling by estimating gradients of the data distribution,” Proc. NeurIPS, 2019.
- P. Dhariwal and A. Nichol, “Diffusion models beat GANs on image synthesis,” in Proc. NeurIPS, 2021, pp. 8780–8794.
- R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High-resolution image synthesis with latent diffusion models,” in Proc. CVPR, 2022, pp. 10 684–10 695.
- S. Alexanderson, R. Nagy, J. Beskow, and G. E. Henter, “Listen, denoise, action! Audio-driven motion synthesis with diffusion models,” ACM ToG, vol. 42, no. 4, 2023, article 44.
- S. Mehta, S. Wang, S. Alexanderson, J. Beskow, É. Székely, and G. E. Henter, “Diff-TTSG: Denoising probabilistic integrated speech and gesture synthesis,” in Proc. SSW, 2023.
- N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, and W. Chan, “WaveGrad: Estimating gradients for waveform generation,” in Proc. ICLR, 2021.
- N. Chen, Y. Zhang, H. Zen, R. J. Weiss, M. Norouzi, N. Dehak, and W. Chan, “WaveGrad 2: Iterative refinement for text-to-speech synthesis,” in Proc. Interspeech, 2021, pp. 3765–3769.
- V. Popov, I. Vovk, V. Gogoryan, T. Sadekova, and M. Kudinov, “Grad-TTS: A diffusion probabilistic model for text-to-speech,” in Proc. ICML, 2021, pp. 8599–8608.
- M. Jeong, H. Kim, S. J. Cheon, B. J. Choi, and N. S. Kim, “Diff-TTS: A denoising diffusion model for text-to-speech,” in Proc. Interspeech, 2021, pp. 3605–3609.
- Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “DiffWave: A versatile diffusion model for audio synthesis,” in Proc. ICLR, 2021.
- Y. Song, J. Sohl-Dickstein, D. P. Kingma, A. Kumar, S. Ermon, and B. Poole, “Score-based generative modeling through stochastic differential equations,” in Proc. ICLR, 2021.
- M. S. Albergo and E. Vanden-Eijnden, “Building normalizing flows with stochastic interpolants,” in Proc. ICLR, 2022.
- R. T. Q. Chen, Y. Rubanova, J. Bettencourt et al., “Neural ordinary differential equations,” in Proc. NeurIPS, 2018.
- Y. Lipman, R. T. Q. Chen, H. Ben-Hamu et al., “Flow matching for generative modeling,” in Proc. ICLR, 2023.
- I. Vovk, T. Sadekova, V. Gogoryan, V. Popov, M. Kudinov, and J. Wei, “Fast Grad-TTS: Towards efficient diffusion-based speech generation on CPU,” in Proc. Interspeech, 2022.
- J. Betker, “Better speech synthesis through scaling,” arXiv preprint arXiv:2305.07243, 2023.
- Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “FastSpeech: Fast, robust and controllable text to speech,” in Proc. NeurIPS, 2019.
- Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu, “FastSpeech 2: Fast and high-quality end-to-end text to speech,” in Proc. ICLR, 2021.
- O. Press, N. A. Smith, and M. Lewis, “Train short, test long: Attention with linear biases enables input length extrapolation,” in Proc. ICLR, 2022.
- J. Kim, S. Kim, J. Kong, and S. Yoon, “Glow-TTS: A generative flow for text-to-speech via monotonic alignment search,” in Proc. NeurIPS, 2020, pp. 8067–8077.
- J. Kim, J. Kong, and J. Son, “VITS: Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech,” in Proc. ICML, 2021, pp. 5530–5540.
- P. Shaw, J. Uszkoreit, and A. Vaswani, “Self-attention with relative position representations,” in Proc. NAACL, 2018.
- H. Zhang, Z. Huang, Z. Shang, P. Zhang, and Y. Yan, “LinearSpeech: Parallel text-to-speech with linear complexity,” in Proc. Interspeech, 2021, pp. 4129–4133.
- J. Su, Y. Lu, S. Pan, A. Murtadha, B. Wen, and Y. Liu, “RoFormer: Enhanced Transformer with rotary position embedding,” arXiv preprint arXiv:2104.09864, 2021.
- U. Wennberg and G. E. Henter, “The case for translation-invariant self-attention in Transformer-based language models,” in Proc. ACL-IJCNLP Vol. 2, 2021, pp. 130–140.
- S. Mehta, A. Kirkland, H. Lameris, J. Beskow, É. Székely, and G. E. Henter, “OverFlow: Putting flows on top of neural transducers for better TTS,” in Proc. Interspeech, 2023.
- R. Huang, Z. Zhao, H. Liu, J. Liu, C. Cui, and Y. Ren, “ProDiff: Progressive fast diffusion model for high-quality text-to-speech,” in Proc. MM, 2022, pp. 2595–2605.
- O. Watts, G. E. Henter, J. Fong, and C. Valentini-Botinhao, “Where do the improvements come from in sequence-to-sequence neural TTS?” in Proc. SSW, 2019, pp. 217–222.
- S. Mehta, É. Székely, J. Beskow, and G. E. Henter, “Neural HMMs are all you need (for high-quality attention-free TTS),” in Proc. ICASSP, 2022, pp. 7457–7461.
- S.-g. Lee, W. Ping, B. Ginsburg, B. Catanzaro, and S. Yoon, “BigVGAN: A universal neural vocoder with large-scale training,” in Proc. ICLR, 2023.
- X. Liu et al., “Flow straight and fast: Learning to generate and transfer data with rectified flow,” in Proc. ICLR, 2023.
- Z. Ye, W. Xue, X. Tan, J. Chen, Q. Liu, and Y. Guo, “CoMoSpeech: One-step speech and singing voice synthesis via consistency model,” in Proc. ACM MM, 2023, pp. 1831–1839.
- M. Le, A. Vyas, B. Shi, B. Karrer, L. Sari, R. Moritz et al., “Voicebox: Text-guided multilingual universal speech generation at scale,” arXiv preprint arXiv:2306.15687, 2023.
- M. Bernard and H. Titeux, “Phonemizer: Text to phones transcription for multiple languages in Python,” J. Open Source Softw., vol. 6, no. 68, p. 3958, 2021.
- J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis,” in Proc. NeurIPS, 2020, pp. 17 022–17 033.
- R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A flow-based generative network for speech synthesis,” in Proc. ICASSP, 2019, pp. 3617–3621.
- A. Radford, J. W. Kim, T. Xu, G. Brockman, C. Mcleavey, and I. Sutskever, “Robust speech recognition via large-scale weak supervision,” in Proc. ICML, 2023, pp. 28 492–28 518.
- J. Taylor and K. Richmond, “Confidence intervals for ASR-based TTS evaluation,” in Proc. Interspeech, 2021.
- K. Prahallad, A. Vadapalli, N. Elluru, G. Mantena, B. Pulugundla et al., “The Blizzard Challenge 2013 – Indian language task,” in Proc. Blizzard Challenge Workshop, 2013.
- C.-H. Chiang, W.-P. Huang, and H. yi Lee, “Why we should report the details in subjective evaluation of TTS more rigorously,” in Proc. Interspeech, 2023, pp. 5551–5555.
- A. Kirkland, S. Mehta, H. Lameris, G. E. Henter, E. Szekely et al., “Stuck in the MOS pit: A critical analysis of MOS test methodology in TTS evaluation,” in Proc. SSW, 2023.
- É. Székely, G. E. Henter, J. Beskow, and J. Gustafson, “Spontaneous conversational speech synthesis from found data,” in Proc. Interspeech, 2019, pp. 4435–4439.