High Fidelity Speech Synthesis with Adversarial Networks (1909.11646v2)

Published 25 Sep 2019 in cs.SD, cs.LG, and eess.AS

Abstract: Generative adversarial networks have seen rapid development in recent years and have led to remarkable improvements in generative modelling of images. However, their application in the audio domain has received limited attention, and autoregressive models, such as WaveNet, remain the state of the art in generative modelling of audio signals such as human speech. To address this paucity, we introduce GAN-TTS, a Generative Adversarial Network for Text-to-Speech. Our architecture is composed of a conditional feed-forward generator producing raw speech audio, and an ensemble of discriminators which operate on random windows of different sizes. The discriminators analyse the audio both in terms of general realism, as well as how well the audio corresponds to the utterance that should be pronounced. To measure the performance of GAN-TTS, we employ both subjective human evaluation (MOS - Mean Opinion Score), as well as novel quantitative metrics (Fr\'echet DeepSpeech Distance and Kernel DeepSpeech Distance), which we find to be well correlated with MOS. We show that GAN-TTS is capable of generating high-fidelity speech with naturalness comparable to the state-of-the-art models, and unlike autoregressive models, it is highly parallelisable thanks to an efficient feed-forward generator. Listen to GAN-TTS reading this abstract at https://storage.googleapis.com/deepmind-media/research/abstract.wav.

Authors (8)

Jeff Donahue (26 papers)
Sander Dieleman (29 papers)
Aidan Clark (13 papers)
Erich Elsen (28 papers)
Norman Casagrande (8 papers)
Luis C. Cobo (8 papers)
Karen Simonyan (54 papers)
Mikołaj Bińkowski (8 papers)

Citations (231)

View on Semantic Scholar

Summary

High Fidelity Speech Synthesis with Adversarial Networks: An Expert Overview

The paper presents GAN-TTS, a novel approach to text-to-speech (TTS) synthesis using Generative Adversarial Networks (GANs). This work explores audio generation by leveraging advancements in GANs, which have notably transformed image generation. While prior efforts in audio synthesis have predominantly relied on autoregressive models such as WaveNet, GAN-TTS offers a fully adversarial and parallelisable solution, addressing computational limitations inherent in sequential processing models.

Architecture and Methodology

The GAN-TTS architecture comprises a conditional feed-forward generator alongside an ensemble of discriminators. These discriminators evaluate audio realism via random windows of varying sizes, ensuring scalability and flexible assessment both in terms of audio generality and alignment with text input. Unlike autoregressive methods, the GAN-TTS model harnesses a convolutional neural network for high-efficiency and simultaneous generation of speech signals, significantly enhancing the feasibility of real-time deployment.

Key contributions of the paper include:

The introduction of an advanced GAN framework for text-conditional speech synthesis, aptly named GAN-TTS.
Implementation of multiple discriminators that assess both general audio realism and linguistic consistency.
Innovative quantitative metrics, dubbed Fréchet DeepSpeech Distance and Kernel DeepSpeech Distance, using DeepSpeech audio recognition features to evaluate generated speech quality against human evaluation scores.

Empirical Validation

The authors present comprehensive evaluations of the GAN-TTS model, exhibiting performance comparable to state-of-the-art solutions like WaveNet. The GAN-TTS achieves a Mean Opinion Score (MOS) of 4.2, closely mirroring the 4.4 benchmark set by WaveNet, demonstrating the competitive edge of GAN-TTS in producing natural-sounding speech.

Extensive ablation studies underscore the importance of architectural components such as random window discriminators, reinforcing their role in GAN-TTS's effective performance. These discriminators not only offer a rigorous accuracy check on generated samples but also contribute to expedited training processes through efficient data handling.

Metrics and Comparisons

The introduction of new metrics for assessing text-to-speech synthesis models, based on Inception-inspired Distance metrics adapted for audio, showcases a meaningful stride towards standardizing performance evaluation in the domain. These metrics provide an objective comparison framework that is well-aligned with qualitative MOS assessments provided by human evaluators.

Implications and Future Directions

The proposed approach heralds a robust alternative to traditional TTS models, enabling quicker synthesis and potential integration within various applications where real-time audio generation is essential. This advancement also unlocks further exploration into non-autoregressive networks for complex sequential tasks beyond TTS.

Future work could expand on enhancing the complexity and robustness of GAN architectures for audio, such as introducing multi-speaker capabilities, fine-tuning the generator for diverse acoustic environments, and applying the lever of adversarial networks in unsupervised or semi-supervised contexts to reduce the need for extensive labeled datasets.

In conclusion, GAN-TTS exemplifies the potential of GANs in the audio generation space, offering a noteworthy balance between computational efficiency and audio fidelity. This paper positions GAN-TTS as a valuable contribution to the TTS landscape and a catalyst for ongoing advancements in generative audio modeling.

PDF Markdown

Related Papers

Find Related Papers

YouTube

Show All Videos