End-to-End Adversarial Text-to-Speech (2006.03575v3)

Published 5 Jun 2020 in cs.SD, cs.LG, and eess.AS

Abstract: Modern text-to-speech synthesis pipelines typically involve multiple processing stages, each of which is designed or learnt independently from the rest. In this work, we take on the challenging task of learning to synthesise speech from normalised text or phonemes in an end-to-end manner, resulting in models which operate directly on character or phoneme input sequences and produce raw speech audio outputs. Our proposed generator is feed-forward and thus efficient for both training and inference, using a differentiable alignment scheme based on token length prediction. It learns to produce high fidelity audio through a combination of adversarial feedback and prediction losses constraining the generated audio to roughly match the ground truth in terms of its total duration and mel-spectrogram. To allow the model to capture temporal variation in the generated audio, we employ soft dynamic time warping in the spectrogram-based prediction loss. The resulting model achieves a mean opinion score exceeding 4 on a 5 point scale, which is comparable to the state-of-the-art models relying on multi-stage training and additional supervision.

View on arXiv

Authors (5)

Jeff Donahue (26 papers)
Sander Dieleman (29 papers)
Erich Elsen (28 papers)
Karen Simonyan (54 papers)
Mikołaj Bińkowski (8 papers)

Citations (177)

View on Semantic Scholar

Summary

End-to-End Adversarial Text-to-Speech

The paper "End-to-End Adversarial Text-to-Speech" proposes an innovative approach towards synthesizing human-like speech from text or phoneme inputs. The authors present a model named EATS (End-to-End Adversarial Text-to-Speech), which addresses several limitations of traditional text-to-speech (TTS) systems that involve multi-stage pipelines requiring significant amounts of supervision and complex sequential training. This research introduces the concept of adversarial training in the domain of TTS to develop a model that is both efficient and able to produce high-quality audio outputs with less supervision.

Methodology

The authors detail an adversarial approach where the generator, a feed-forward neural network, processes input sequences of characters or phonemes into raw audio waveforms at a 24 kHz sampling rate. This is accomplished through two main components:

Aligner: Utilizes dilated convolutions followed by token-length predictions to align audio representations with input tokens, producing low-frequency features at 200 Hz.
Decoder: Upsamples the features from the aligner using 1D convolutions to generate high-fidelity audio waveforms.

Adversarial feedback is employed via random window discriminators and a spectrogram discriminator that operates on the mel-spectrogram domain. This adversarial setup allows the model to prioritize realism, significantly enhancing its ability to produce natural-sounding speech. Additionally, the loss function integrates a dynamic time warping-based prediction loss to capture the temporal variability in speech and enforce alignment with input conditioning.

Results and Discussions

The EATS model achieves a mean opinion score (MOS) of 4.083, closely approaching the benchmarks set by leading state-of-the-art models, like Tacotron 2 and GAN-TTS, without relying on excessive supervision. Various ablation studies underscore the significance of components such as adversarial discriminators and dynamic time warping, reinforcing the necessity of each component within the model's architecture.

One of the most noteworthy aspects is the model's ability to generate speech two orders of magnitude faster than real-time inference on modern hardware (NVIDIA V100 GPU, Google Cloud TPU v3), positioning EATS as a highly efficient solution for real-world applications.

Implications and Future Directions

The implications of this research are profound, both theoretically and practically. From a theoretical standpoint, the paper demonstrates the feasibility and effectiveness of integrating adversarial learning within TTS systems, advocating for the potential benefits of end-to-end learning without the burdens of complex multi-stage pipelines.

Practically, the EATS model promises substantial advancements in generating synthetic speech more efficiently and accurately with reduced reliance on extensive supervision. As training datasets grow and computational capabilities enhance, further refinement of end-to-end models like EATS could lead to unprecedented quality in text-to-speech synthesis, extending its applicability across languages and dialects.

The methodologies and findings laid out in this paper open avenues for future exploration of not only improved model architectures but also the development of more sophisticated evaluation metrics that better reflect human perception, thus driving the evolution of TTS technologies.

Related Papers

Find Related Papers