CM-TTS: Enhancing Real Time Text-to-Speech Synthesis Efficiency through Weighted Samplers and Consistency Models (2404.00569v1)
Abstract: Neural Text-to-Speech (TTS) systems find broad applications in voice assistants, e-learning, and audiobook creation. The pursuit of modern models, like Diffusion Models (DMs), holds promise for achieving high-fidelity, real-time speech synthesis. Yet, the efficiency of multi-step sampling in Diffusion Models presents challenges. Efforts have been made to integrate GANs with DMs, speeding up inference by approximating denoising distributions, but this introduces issues with model convergence due to adversarial training. To overcome this, we introduce CM-TTS, a novel architecture grounded in consistency models (CMs). Drawing inspiration from continuous-time diffusion models, CM-TTS achieves top-quality speech synthesis in fewer steps without adversarial training or pre-trained model dependencies. We further design weighted samplers to incorporate different sampling positions into model training with dynamic probabilities, ensuring unbiased learning throughout the entire training process. We present a real-time mel-spectrogram generation consistency model, validated through comprehensive evaluations. Experimental results underscore CM-TTS's superiority over existing single-step speech synthesis systems, representing a significant advancement in the field.
- Deep speech 2 : End-to-end speech recognition in english and mandarin. In Proceedings of ICML.
- Objective measure for estimating mean opinion score of synthesized speech.
- Wei Chu and Abeer Alwan. 2009. Reducing f0 frame error of f0 tracking algorithms under noisy conditions with an unvoiced/voiced classification frontend. In Proceedings of ICASSP.
- Glow-wavegan: Learning speech representations from gan-based variational auto-encoder for high fidelity flow-based speech synthesis. In Proceedings of Interspeech.
- Prafulla Dhariwal and Alexander Nichol. 2021. Diffusion models beat gans on image synthesis. In Proceedings of NeurIPS.
- End-to-end adversarial text-to-speech. In Proceedings of ICLR.
- Automatic chemical design using A data-driven continuous representation of molecules. ACS central science.
- Bootstrap your own latent - A new approach to self-supervised learning. In Proceedings of NeurIPS.
- Denoising diffusion probabilistic models. In Proceedings of NeurIPS.
- Keith Ito and Linda Johnson. 2017. The lj speech dataset. https://keithito.com/LJ-Speech-Dataset/.
- Diff-tts: A denoising diffusion model for text-to-speech. In Proceedings of Interspeech.
- Elucidating the design space of diffusion-based generative models. In Proceedings of NeurIPS.
- Guided-TTS: A diffusion model for text-to-speech via classifier guidance. In Proceedings of the ICML.
- Glow-tts: A generative flow for text-to-speech via monotonic alignment search. In Processing of NeurIPS.
- Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In Proceedings of ICML.
- Guided-tts 2: A diffusion model for high-quality adaptive text-to-speech with untranscribed data. arXiv preprint arXiv:2205.15370.
- Flowavenet : A generative flow for raw audio. In Proceedings of ICML.
- Diederik P. Kingma and Max Welling. 2019. An introduction to variational autoencoders. Foundations and Trends® in Machine Learning.
- Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. In Proceedings of NeurIPS.
- Melgan: Generative adversarial networks for conditional waveform synthesis. In Proceedings of NeurIPS.
- Improved precision and recall metric for assessing generative models. In Processing of NeurIPS.
- Deep speaker: An end-to-end neural speaker embedding system. arXiv preprint arXiv:1705.02304.
- Diffsinger: Singing voice synthesis via shallow diffusion mechanism. In Proceedings of AAAI.
- Diffgan-tts: High-fidelity and efficient text-to-speech with denoising diffusion gans. arXiv preprint arXiv:2201.11972.
- A review of deep learning techniques for speech processing. Information Fusion.
- Alexander Quinn Nichol and Prafulla Dhariwal. 2021. Improved denoising diffusion probabilistic models. In Proceedings of ICML.
- Librispeech: An ASR corpus based on public domain audio books. In Proceedings of ICASSP.
- Deep voice 3: 2000-speaker neural text-to-speech. In Proceedings of ICLR.
- Grad-tts: A diffusion probabilistic model for text-to-speech. In Proceedings of ICML.
- Fastspeech 2: Fast and high-quality end-to-end text to speech. In Proceedings of ICLR.
- Fastspeech: Fast, robust and controllable text to speech. In Proceedings of NeurIPS.
- High-resolution image synthesis with latent diffusion models. In Proceedings of CVPR.
- Natural TTS synthesis by conditioning wavenet on MEL spectrogram predictions. In Proceedings of ICASSP.
- Rad-tts: Parallel flow-based tts with robust alignment learning and diverse synthesis. In Proceedings of ICML(Workshop).
- Consistency models. In Proceedings of ICML.
- Score-based generative modeling through stochastic differential equations. In Proceedings of ICLR.
- Integrating structure-based approaches in generative molecular design. Current Opinion in Structural Biology.
- Flowtron: an autoregressive flow-based generative network for text-to-speech synthesis. In Proceedings of ICLR.
- Wavenet: A generative model for raw audio. In Proceedings of SSW.
- Neural discrete representation learning. In Proceedings of NeurIPS.
- Attention is all you need. In Processing of NeurIPS.
- The voice bank corpus: Design, collection and data analysis of a large regional accent speech database. In Proceedings of COCOSDA.
- Tacotron: Towards end-to-end speech synthesis. In Proceedings of Interspeech.
- Diffsound: Discrete diffusion model for text-to-sound generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
- Comospeech: One-step speech and singing voice synthesis via consistency model. In Proceedings of MM.
- Graph Convolutional Policy Network for Goal-Directed Molecular Graph Generation. In Proceedings of NeurIPS.
- Libritts: A corpus derived from librispeech for text-to-speech. In Proceedings of Interspeech.
- Xiang Li (1003 papers)
- Fan Bu (23 papers)
- Ambuj Mehrish (15 papers)
- Yingting Li (8 papers)
- Jiale Han (14 papers)
- Bo Cheng (51 papers)
- Soujanya Poria (138 papers)