SpeedySpeech: Efficient Neural Speech Synthesis (2008.03802v1)

Published 9 Aug 2020 in eess.AS, cs.CL, cs.LG, and cs.SD

Abstract: While recent neural sequence-to-sequence models have greatly improved the quality of speech synthesis, there has not been a system capable of fast training, fast inference and high-quality audio synthesis at the same time. We propose a student-teacher network capable of high-quality faster-than-real-time spectrogram synthesis, with low requirements on computational resources and fast training time. We show that self-attention layers are not necessary for generation of high quality audio. We utilize simple convolutional blocks with residual connections in both student and teacher networks and use only a single attention layer in the teacher model. Coupled with a MelGAN vocoder, our model's voice quality was rated significantly higher than Tacotron 2. Our model can be efficiently trained on a single GPU and can run in real time even on a CPU. We provide both our source code and audio samples in our GitHub repository.

Citations (38)

View on Semantic Scholar

Summary

The paper presents a novel teacher-student framework that simplifies traditional TTS architectures by eliminating self-attention in the student network.
The model achieves synthesis 49× faster than real time on GPU and 5× on CPU while delivering superior user-rated audio quality.
It introduces a data augmentation technique to mitigate sequential error propagation, enhancing overall training robustness and stability.

SpeedySpeech: Efficient Neural Speech Synthesis

The paper presents SpeedySpeech, a novel approach to neural text-to-speech (TTS) synthesis designed to address the persistent challenges of fast training, speedy inference, and high-quality audio output. Traditional sequence-to-sequence models such as Tacotron 2, though successful in generating high-quality speech, often demand significant computational resources and lengthy training and inference times. SpeedySpeech seeks to mitigate these limitations, presenting a streamlined, fully convolutional architecture that offers both efficiency and quality.

Model Architecture

SpeedySpeech innovatively employs a teacher-student framework reminiscent of FastSpeech, but with several critical differences. The teacher network in SpeedySpeech is responsible for phoneme duration extraction and employs a deep convolutional network fused with a single attention layer to align phonemes with corresponding spectrogram frames, using the framework of Deep Voice 3 and DCTTS. On the other hand, the student network is entirely convolutional and non-autoregressive, tasked with spectrogram synthesis. It is designed to predict phoneme durations before decoding them into mel-scale spectrograms.

A distinctive aspect of SpeedySpeech is the omission of self-attention layers in the student network. These are frequently integral to similar models for order induction and feature aggregation but are shown here to be unnecessary for achieving high-quality synthesised speech. This simplification contributes to a more efficient model by reducing computational complexity without compromising the output quality.

Performance Evaluation

The results presented in the paper indicate significant improvements over existing benchmarks. SpeedySpeech's voice quality, when paired with a MelGAN vocoder, surpasses Tacotron 2, as evidenced by higher user ratings in a MUSHRA-like survey.

In terms of computational performance, the model exhibits impressive training and inference capabilities. Training can be completed on a single 8GB GPU within 40 hours, and the model is capable of real-time speech generation on both GPU and CPU platforms. Notably, SpeedySpeech achieves synthesis that is 49 times faster than real-time on a GPU and maintains five-fold speed on a CPU.

Contributions and Implications

The paper makes several noteworthy contributions:

Simplification of the FastSpeech teacher-student architecture, optimizing it for speed and model stability.
Elimination of self-attention mechanisms in the student network, contrary to previous assumptions about their necessity, simplifying the model further and enhancing training efficiency.
Introduction of a data augmentation technique to bolster the teacher network's tolerance to sequential error propagation.

Through these innovations, SpeedySpeech represents a notable advancement in the efficient synthesis of high-quality speech. Its ability to function effectively on systems with limited computational resources has significant practical implications. The model could expand the accessibility of advanced speech synthesis technologies and support real-time applications on less powerful hardware.

Future Directions

Looking forward, the authors suggest extending SpeedySpeech to accommodate multi-speaker datasets. Such an expansion could further broaden the model's applicability, particularly in scenarios that require diverse voice outputs.

In conclusion, SpeedySpeech not only addresses prevailing challenges in the field of TTS but also offers a pathway for future innovations. Its contributions underscore the potential for utilizing simpler, non-autoregressive architectures to achieve efficient and high-quality speech synthesis, potentially impacting both theoretical explorations and practical deployments in AI-driven voice technology.

PDF Markdown

Related Papers

YouTube

Show All Videos