FlashSpeech: Efficient Zero-Shot Speech Synthesis (2404.14700v4)

Published 23 Apr 2024 in eess.AS, cs.AI, cs.CL, cs.LG, and cs.SD

Abstract: Recent progress in large-scale zero-shot speech synthesis has been significantly advanced by LLMs and diffusion models. However, the generation process of both methods is slow and computationally intensive. Efficient speech synthesis using a lower computing budget to achieve quality on par with previous work remains a significant challenge. In this paper, we present FlashSpeech, a large-scale zero-shot speech synthesis system with approximately 5\% of the inference time compared with previous work. FlashSpeech is built on the latent consistency model and applies a novel adversarial consistency training approach that can train from scratch without the need for a pre-trained diffusion model as the teacher. Furthermore, a new prosody generator module enhances the diversity of prosody, making the rhythm of the speech sound more natural. The generation processes of FlashSpeech can be achieved efficiently with one or two sampling steps while maintaining high audio quality and high similarity to the audio prompt for zero-shot speech generation. Our experimental results demonstrate the superior performance of FlashSpeech. Notably, FlashSpeech can be about 20 times faster than other zero-shot speech synthesis systems while maintaining comparable performance in terms of voice quality and similarity. Furthermore, FlashSpeech demonstrates its versatility by efficiently performing tasks like voice conversion, speech editing, and diverse speech sampling. Audio samples can be found in https://flashspeech.github.io/.

PDF Abstract

Accelerating Large-Scale Zero-Shot Speech Synthesis with FlashSpeech

Introduction

Speech synthesis technologies have made significant strides but are hindered by substantial computational resources and time constraints imposed by most state-of-the-art methods. Attempting to resolve these issues, this paper introduces FlashSpeech, a new approach to efficient zero-shot speech synthesis. FlashSpeech utilizes a Latent Consistency Model (LCM) combined with an innovative adversarial consistency training method, eliminating the need for a pre-trained diffusion model as a teacher. It achieves speech synthesis in one or two sampling steps with high audio quality and speaker similarity, demonstrating about 20 times faster inference than existing systems.

Key Contributions

FlashSpeech System: An efficient zero-shot speech synthesis system with drastically reduced inference time and computational requirements.
Adversarial Consistency Training: A unique training methodology combining adversarial training and consistency training to effectively leverage pre-trained speech LLMs.
Prosody Generation: Introduction of a prosody generator module that enhances the diversity of prosody, achieving natural rhythm in synthesized speech without compromising stability.

System Architecture and Training

FlashSpeech incorporates a prosody generator alongside a novel consistency model, conditioned on prior vectors obtained from a phoneme encoder and a prompt encoder. The adversarial consistency training employs pre-trained speech models as discriminators, training the LCM efficiently from scratch. This system shows remarkable efficiency improvements during inference, essentially reducing the computation down to a constant time complexity, $\mathcal{O}(1)$ , irrespective of the input sequence length.

Experimental Results

The experimental evaluations underscore FlashSpeech's superior performance over other zero-shot speech synthesis systems in terms of synthesis speed, maintaining comparable voice quality and speaker similarity. The system also demonstrates robustness across varied tasks such as voice conversion, speech editing, and diverse speech sampling, with concrete applications reflected in real-world scenarios like virtual assistants and interactive educational content.

Practical Implications and Theoretical Implications

Practically, FlashSpeech's speed and efficiency facilitate real-time speech synthesis applications, reducing hardware demands and operational costs markedly. Theoretically, the introduction of adversarial consistency training provides a novel way of leveraging pre-trained models for speech synthesis tasks, offering potential insights into model training optimizations across other domains of generative modeling.

Future Directions

Continuing this line of research could involve scaling the system to handle more extensive datasets and more languages, which might further refine its capabilities in capturing nuances in speech. Additionally, further exploration into refining the adversarial consistency training could yield even more efficient training methodologies or adaptations to other forms of media like music or environmental sounds.

In conclusion, FlashSpeech sets a new precedence for speed and efficiency in speech synthesis while maintaining high standards for audio quality and speaker accuracy, marking a significant step forward in the practical deployment of zero-shot speech synthesis technologies.