Deep Voice: Real-time Neural Text-to-Speech (1702.07825v2)

Published 25 Feb 2017 in cs.CL, cs.LG, cs.NE, and cs.SD

Abstract: We present Deep Voice, a production-quality text-to-speech system constructed entirely from deep neural networks. Deep Voice lays the groundwork for truly end-to-end neural speech synthesis. The system comprises five major building blocks: a segmentation model for locating phoneme boundaries, a grapheme-to-phoneme conversion model, a phoneme duration prediction model, a fundamental frequency prediction model, and an audio synthesis model. For the segmentation model, we propose a novel way of performing phoneme boundary detection with deep neural networks using connectionist temporal classification (CTC) loss. For the audio synthesis model, we implement a variant of WaveNet that requires fewer parameters and trains faster than the original. By using a neural network for each component, our system is simpler and more flexible than traditional text-to-speech systems, where each component requires laborious feature engineering and extensive domain expertise. Finally, we show that inference with our system can be performed faster than real time and describe optimized WaveNet inference kernels on both CPU and GPU that achieve up to 400x speedups over existing implementations.

Citations (597)

View on Semantic Scholar

Summary

The paper introduces a complete end-to-end deep neural network system that replaces traditional TTS pipelines with modular models.
It achieves competitive error rates using a novel WaveNet variant and optimized CPU/GPU inference for faster-than-real-time synthesis.
The work demonstrates near-human audio quality and paves the way for adaptable, high-quality voice synthesis in real-world applications.

Deep Voice: Real-time Neural Text-to-Speech

The paper "Deep Voice: Real-time Neural Text-to-Speech" introduces a comprehensive text-to-speech (TTS) system implemented entirely with deep neural networks (DNNs). The work addresses the complexities associated with traditional TTS pipelines by proposing a modular framework that eschews hand-engineered features in favor of neural network models, which simplifies adaptation to new datasets and voices.

System Architecture and Components

Deep Voice consists of five primary components, each represented by a neural model:

Grapheme-to-Phoneme (G2P) Model: Converts text into phoneme sequences using a sequence-to-sequence architecture with a gated recurrent unit (GRU). The model achieves a phoneme error rate of 5.8% and a word error rate of 28.7%.
Segmentation Model: Utilizes a convolutional recurrent neural network architecture with connectionist temporal classification (CTC) loss for phoneme boundary detection, achieving a phoneme pair error rate of 7%.
Phoneme Duration and Fundamental Frequency Model: Jointly predicts phoneme duration and fundamental frequency using a GRU-based architecture, with mean absolute errors of 38 milliseconds for phoneme duration.
WaveNet-based Audio Synthesis Model: A variant of the WaveNet model with optimized inference that reduces computational complexity and facilitates faster-than-real-time synthesis. This model generates high-quality 16 kHz audio.
Inference Optimization: Implements optimized CPU and GPU inference kernels, achieving up to 400x speedups over prior implementations.

Key Results and Findings

The segmentation and grapheme-to-phoneme models successfully integrate into the end-to-end system and show competitive error rates without reliance on pre-existing systems.
The WaveNet architecture is enhanced with QRNN layers for upsampling, resulting in less noise and improved training times. The model synthesizes audio in real-time without significant perceptual loss compared to longer processing times.
Mean opinion scores (MOS) collected via crowd-sourcing demonstrate that synthesized audio closely approaches the perceptual quality of human speech when using real phoneme durations and frequencies. However, synthesized durations and frequencies reduce the MOS, identifying an avenue for future enhancement.

Implications and Future Directions

Practically, Deep Voice offers significant advancements for applications such as virtual assistants and accessibility technologies. The ability to retrain quickly on new datasets without extensive tuning lowers the barrier for diverse voice synthesis. Theoretically, the move towards a complete end-to-end neural framework indicates meaningful progress in AI-driven TTS systems.

Future directions may focus on further fusing the models to eliminate intermediate stages, creating a truly seamless sequence-to-sequence system, and improving duration and frequency predictions. Real-time applications can benefit from continued optimizations in inference kernels and exploring alternative architectures potentially even beyond QRNNs.

Additionally, the insights on computational efficiency highlight the potential of extending the techniques to other generative models, such as image synthesis, marking a significant step towards real-time multimedia generation.