Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech (2105.06337v2)

Published 13 May 2021 in cs.LG, cs.CL, and stat.ML

Abstract: Recently, denoising diffusion probabilistic models and generative score matching have shown high potential in modelling complex data distributions while stochastic calculus has provided a unified point of view on these techniques allowing for flexible inference schemes. In this paper we introduce Grad-TTS, a novel text-to-speech model with score-based decoder producing mel-spectrograms by gradually transforming noise predicted by encoder and aligned with text input by means of Monotonic Alignment Search. The framework of stochastic differential equations helps us to generalize conventional diffusion probabilistic models to the case of reconstructing data from noise with different parameters and allows to make this reconstruction flexible by explicitly controlling trade-off between sound quality and inference speed. Subjective human evaluation shows that Grad-TTS is competitive with state-of-the-art text-to-speech approaches in terms of Mean Opinion Score. We will make the code publicly available shortly.

PDF Abstract

Overview of Grad-TTS: A Diffusion Probabilistic Model for Text-to-Speech

Grad-TTS represents a significant contribution to the field of text-to-speech (TTS) synthesis by introducing a diffusion probabilistic model as a compelling alternative for generating acoustic features. The core innovation of Grad-TTS lies in using denoising diffusion probabilistic models (DDPM) to create mel-spectrograms, employing a score-based decoder. This process entails a gradual transformation of noise, which is intelligently aligned with text inputs via Monotonic Alignment Search (MAS).

Background and Motivation

Recent advances in generative modeling for TTS have leveraged frameworks such as GANs, Normalizing Flows, and Diffusion Probabilistic Modeling to overcome challenges in synthesis speed and quality. However, existing methods exhibit limitations in terms of computational efficiency and robustness against alignment errors. Grad-TTS addresses these issues by building on the diffusion probabilistic models' ability to handle complex data distributions and leveraging stochastic calculus for flexible inference schemes.

Methodology

Grad-TTS models mel-spectrogram generation as a process of transforming Gaussian noise, parameterized by text-aligned outputs from an encoder, into coherent acoustic features. This transformation is governed by a stochastic differential equation (SDE), allowing explicit control over the trade-off between sound quality and inference speed. The primary components of Grad-TTS include:

Encoder and Duration Predictor: Similar to Glow-TTS, the encoder transforms text input into feature representations, while the duration predictor forecasts token durations for alignment purposes.
Score-Based Decoder: The decoder utilizes a score-based diffusion model to generate mel-spectrograms. It models data generation as a reverse diffusion process, learning trajectories that closely follow forward diffusion in reverse order.

The model features flexibility in synthesizing high-quality speech samples with a variable number of diffusion steps, thus optimizing both quality and speed.

Results

Subjective evaluations demonstrate that Grad-TTS attains competitive Mean Opinion Scores (MOS) compared to state-of-the-art models, including Tacotron2, with a significant improvement in inference speed. Objective evaluation in terms of log-likelihood further supports its superiority, as Grad-TTS achieves better log-likelihood without requiring a more complex decoder architecture.

Implications and Future Prospects

The Grad-TTS approach provides a robust framework for TTS synthesis, offering a scalable model that can balance quality and computational requirements. Future avenues for research include:

End-to-End TTS: Preliminary experiments suggest potential for an end-to-end model that incorporates waveform synthesis directly.
Optimizing Diffusion Processes: Enhancing noise schedules and loss weighting could lead to refined control and further improvements in synthesis quality.
Exploring Alternative Architectures: Investigating other architectures for encoder and decoder components could yield additional performance gains.

In conclusion, Grad-TTS stands as a promising text-to-speech model bolstered by diffusion probabilistic methodologies, setting a foundation for future innovations in the domain.