Better speech synthesis through scaling (2305.07243v2)

Published 12 May 2023 in cs.SD, cs.CL, and eess.AS

Abstract: In recent years, the field of image generation has been revolutionized by the application of autoregressive transformers and DDPMs. These approaches model the process of image generation as a step-wise probabilistic processes and leverage large amounts of compute and data to learn the image distribution. This methodology of improving performance need not be confined to images. This paper describes a way to apply advances in the image generative domain to speech synthesis. The result is TorToise -- an expressive, multi-voice text-to-speech system. All model code and trained weights have been open-sourced at https://github.com/neonbjb/tortoise-tts.

PDF Abstract

Better Speech Synthesis Through Scaling

Overview:

The paper, "Better Speech Synthesis through Scaling" by James Betker, presents a novel approach to text-to-speech (TTS) synthesis, leveraging methodologies derived from the image generation domain. Specifically, it explores the application of autoregressive transformers and denoising diffusion probabilistic models (DDPMs) to the task of speech synthesis. The resulting system, termed TorToise, is noted for its expressive, multi-voice capabilities.

Background:

The progress in text-to-speech has traditionally been limited by the reliance on efficient models optimized for high sampling rates and the unavailability of extensive transcribed speech datasets. Modern TTS systems predominantly utilize MEL spectrograms, a highly compressed form of the audio waveform data, decoded back into audio using neural MEL inverters, commonly referred to as vocoders.

In contrast, the field of image generation, particularly with systems like DALL-E, prioritizes high-quality output over latency, often employing extensive computational resources. DALL-E uses autoregressive decoders to transform text to image sequences, albeit with limitations due to self-attention costs and the need for discrete domain operations. DDPMs have emerged as a robust alternative, capable of producing high-quality, diverse images by reconstructing high-dimensional spaces from low-quality guidance signals.

Methodology:

TorToise integrates autoregressive decoders and DDPMs for generating high-quality speech synthesis outputs. The system framework consists of:

Autoregressive Decoder: Predicts a distribution for speech tokens from text inputs.
Contrastive Model: Employs a model similar to CLIP, named CLVP (Contrastive Language-Voice Pretrained Transformer), to rank autoregressive outputs.
DDPM: Converts the speech tokens into high-fidelity speech spectrograms.
Vocoder: Converts MEL spectrograms back into audio waveforms.

A unique aspect of TorToise is the use of speech conditioning input, formed by MEL spectrograms of audio clips from the target speaker. This input facilitates the autoregressive and DDPM components to infer vocal characteristics, thus narrowing the search space for suitable speech outputs.

Key Techniques:

The “TorToise Trick”: The DDPM is fine-tuned on the latent spaces derived from the autoregressive model outputs, which enhances the efficiency and quality of the downstream diffusion processes.

Inference Process:

Decode multiple speech candidates using the autoregressive model, conditioned by input text and reference audio.
Utilize CLVP to rank correlation scores between each candidate and text.
Select top candidates and convert them to MEL spectrograms using the DDPM.
Employ a vocoder to transform MEL spectrograms into final audio waveforms.

The implementation of these models and the necessary training was conducted on a cluster of 8 NVIDIA RTX-3090 GPUs over approximately one year, using an extended dataset composed of 49,000 hours of scraped audio data.

Experiments and Results:

The evaluation involved a custom suite using CLVP to measure the closeness between real and generated samples, akin to the FID score in image generation. Additionally, speech intelligibility was assessed via the wav2vec model. Although specific numerical results were not disclosed, qualitative comparisons indicate that TorToise yields high realism in generated speech, outperforming previous state-of-the-art TTS models.

Implications and Future Work:

The implications of this research are manifold, impacting both theoretical and practical advancements in TTS. The generalist architectural approach, leveraging large datasets and substantial computational resources, might be extended to other digitized modalities, suggesting a broader applicability of this framework.

Future research could explore several enhancements, such as optimizing VQVAE codebook embedding dimensions, implementing relative positional encodings, and training on larger datasets with extended audio sequences. Additionally, improvements to the diffusion decoder architecture and cross-modal integrations could further elevate the system's performance.

Conclusion:

This paper signifies a significant step in speech synthesis by incorporating methodologies from image generation. While it establishes a new benchmark in TTS realism through the TorToise system, it also opens avenues for further exploration and optimization in generative modeling for speech and potentially other modalities. The future trajectory of this work holds promise for substantial contributions to the field of AI-driven speech synthesis.