Parallel WaveNet: Fast High-Fidelity Speech Synthesis (1711.10433v1)

Published 28 Nov 2017 in cs.LG

Abstract: The recently-developed WaveNet architecture is the current state of the art in realistic speech synthesis, consistently rated as more natural sounding for many different languages than any previous system. However, because WaveNet relies on sequential generation of one audio sample at a time, it is poorly suited to today's massively parallel computers, and therefore hard to deploy in a real-time production setting. This paper introduces Probability Density Distillation, a new method for training a parallel feed-forward network from a trained WaveNet with no significant difference in quality. The resulting system is capable of generating high-fidelity speech samples at more than 20 times faster than real-time, and is deployed online by Google Assistant, including serving multiple English and Japanese voices.

Authors (22)

Aaron van den Oord (44 papers)
Yazhe Li (17 papers)
Igor Babuschkin (14 papers)
Karen Simonyan (54 papers)
Oriol Vinyals (116 papers)
Koray Kavukcuoglu (57 papers)
George van den Driessche (7 papers)
Edward Lockhart (11 papers)
Luis C. Cobo (8 papers)
Florian Stimberg (10 papers)
Norman Casagrande (8 papers)
Dominik Grewe (8 papers)
Seb Noury (7 papers)
Sander Dieleman (29 papers)
Erich Elsen (28 papers)
Nal Kalchbrenner (27 papers)
Heiga Zen (36 papers)
Alex Graves (29 papers)
Helen King (4 papers)
Tom Walters (1 paper)

Citations (839)

View on Semantic Scholar

Summary

Parallel WaveNet: Fast High-Fidelity Speech Synthesis

The paper "Parallel WaveNet: Fast High-Fidelity Speech Synthesis" presents a significant advancement in the domain of neural network-based speech synthesis by introducing a method termed Probability Density Distillation. This method addresses the critical issue of the slow generation speed inherent in the original WaveNet architecture, enabling high-quality speech synthesis at real-time speeds.

Introduction

WaveNet, an autoregressive model, has established a new benchmark in speech synthesis quality, significantly reducing the gap between synthetic and natural human speech. While powerful, WaveNet's inherent sequential generation of audio samples poses a substantial barrier to real-time applications due to its lack of compatibility with parallel processing on modern hardware.

This paper proposes a novel approach to overcome these limitations by distilling WaveNet into a parallel feed-forward neural network, achieving comparable output quality while significantly accelerating audio sample generation.

WaveNet Architecture

WaveNet models the distribution of raw audio signals using autoregressive techniques, processing high-dimensional temporal data using causal and dilated convolutions. This approach allows it to handle long-range dependencies efficiently during training. However, generation remains sequential, as each sample must be produced before generating the next.

Parallel WaveNet and Inverse-Autoregressive Flows (IAFs)

The core innovation of the paper is the adaptation of inverse-autoregressive flows (IAFs) to the problem of speech synthesis. IAFs are generative models that allow parallel sample computation by reversing the dependency direction of autoregressive models. The primary goal is to harness the rapid sampling capabilities of IAFs while maintaining the statistically robust training framework of WaveNet.

Probability Density Distillation

Crucially, the authors introduce Probability Density Distillation, a new technique to transfer knowledge from a pre-trained WaveNet (teacher) to a parallel feed-forward network (student). This process involves minimizing the Kullback-Leibler (KL) divergence between the teacher and student distributions. The distillation method ensures the student network approximates the teacher’s output distribution, thus preserving synthesis quality.

Experimental Results

Speed and Quality

Experimental evaluations demonstrated that the distilled WaveNet maintains the high fidelity of the original WaveNet while achieving orders of magnitude improvements in generation speed. Specifically, the parallel model achieved a sampling rate exceeding 500,000 timesteps per second on an NVIDIA P100 GPU, compared to 172 timesteps per second for the original autoregressive WaveNet.

Mean Opinion Scores (MOS)

Subjective evaluations through Mean Opinion Scores (MOS) confirmed the equivalent perceived quality between the distilled and original WaveNet models. The distilled model achieved similar, if not better, MOS scores compared to competitive baselines, including parametric and concatenative systems. For instance, the distilled WaveNet achieved a MOS of 4.41 compared to 3.67 for LSTM-RNN parametric models.

Multi-Speaker and Cross-Language Evaluation

The paper also evaluated the model's capacity for multi-speaker and cross-language synthesis. By conditioning on speaker IDs, a single parallel WaveNet model could generate multiple voices, sustaining high quality across various speakers and languages (e.g., English and Japanese). This showcases the model's versatility and robustness in diverse settings.

Implications and Future Directions

The introduction of Probability Density Distillation and parallel WaveNet has profound implications for the deployment of neural speech synthesis in practical applications. Real-time synthesis is now feasible without compromising quality, paving the way for broad adoption in virtual assistants, automated announcements, and more.

Future research may explore further optimizing the distillation process, enhancing model scalability, and adapting the technique to other domains requiring high-fidelity and rapid generation, such as music synthesis and other types of audio generation. Additionally, integrating more sophisticated conditioning mechanisms could further improve the model's adaptability and fine-grained control over synthesized speech characteristics.

Conclusion

This paper successfully addresses the critical challenge of slow sample generation in the original WaveNet architecture by introducing Probability Density Distillation. The proposed parallel WaveNet model retains high-fidelity speech synthesis capabilities while achieving real-time generation speeds. This advancement not only enhances the practicality of neural network-based speech synthesis but also establishes a robust framework for future development in real-time audio generation technologies.

PDF Markdown

Related Papers

YouTube

Show All Videos