FastSpeech 2: Fast and High-Quality End-to-End Text to Speech (2006.04558v8)

Published 8 Jun 2020 in eess.AS, cs.CL, cs.LG, and cs.SD

Abstract: Non-autoregressive text to speech (TTS) models such as FastSpeech can synthesize speech significantly faster than previous autoregressive models with comparable quality. The training of FastSpeech model relies on an autoregressive teacher model for duration prediction (to provide more information as input) and knowledge distillation (to simplify the data distribution in output), which can ease the one-to-many mapping problem (i.e., multiple speech variations correspond to the same text) in TTS. However, FastSpeech has several disadvantages: 1) the teacher-student distillation pipeline is complicated and time-consuming, 2) the duration extracted from the teacher model is not accurate enough, and the target mel-spectrograms distilled from teacher model suffer from information loss due to data simplification, both of which limit the voice quality. In this paper, we propose FastSpeech 2, which addresses the issues in FastSpeech and better solves the one-to-many mapping problem in TTS by 1) directly training the model with ground-truth target instead of the simplified output from teacher, and 2) introducing more variation information of speech (e.g., pitch, energy and more accurate duration) as conditional inputs. Specifically, we extract duration, pitch and energy from speech waveform and directly take them as conditional inputs in training and use predicted values in inference. We further design FastSpeech 2s, which is the first attempt to directly generate speech waveform from text in parallel, enjoying the benefit of fully end-to-end inference. Experimental results show that 1) FastSpeech 2 achieves a 3x training speed-up over FastSpeech, and FastSpeech 2s enjoys even faster inference speed; 2) FastSpeech 2 and 2s outperform FastSpeech in voice quality, and FastSpeech 2 can even surpass autoregressive models. Audio samples are available at https://speechresearch.github.io/fastspeech2/.

Authors (7)

Yi Ren (215 papers)
Chenxu Hu (12 papers)
Xu Tan (164 papers)
Tao Qin (201 papers)
Sheng Zhao (75 papers)
Zhou Zhao (219 papers)
Tie-Yan Liu (242 papers)

Citations (1,253)

View on Semantic Scholar

Summary

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech

The paper "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech" presents an improved methodology for text-to-speech (TTS) synthesis by addressing limitations in the FastSpeech model. The key contributions discussed are: simplification of the training pipeline, provision of additional variance information for pitch and energy, and development of an end-to-end text-to-waveform synthesis model named FastSpeech 2s.

FastSpeech 2 Model Enhancements

The FastSpeech model is notable for its non-autoregressive TTS architecture, which generates mel-spectrograms significantly faster than autoregressive models like Tacotron and Transformer TTS. However, training FastSpeech relied on a complex teacher-student distillation process, which introduced information loss due to data simplification, and inaccurate duration predictions compromising voice quality. FastSpeech 2 addresses these issues through the following enhancements:

Rather than training through a teacher-student distillation process, FastSpeech 2 directly uses ground-truth mel-spectrograms. This approach simplifies the training pipeline and prevents information loss observed in the distilled mel-spectrograms.
FastSpeech 2 incorporates additional variance information like pitch, energy, and more accurate duration as conditional inputs during training. This additional variance information helps to solve the one-to-many mapping problem of TTS, where multiple speech variations correspond to the same text input. By doing so, it reduces the information gap between input text and the target speech waveform, resulting in superior voice quality.
To further enhance the pitch prediction, a continuous wavelet transform (CWT) converts the pitch contour into a pitch spectrogram, which is more accurately predicted and then inverse transformed back into a contour using inverse continuous wavelet transform (iCWT). This method provides a more precise modeling of pitch variations, critical for improving prosody in synthesized speech.

Introduction of FastSpeech 2s

FastSpeech 2s builds upon FastSpeech 2 by pushing the model towards fully end-to-end text-to-waveform generation. This model bypasses intermediate mel-spectrogram generation and directly outputs waveforms from text. Key aspects of FastSpeech 2s include:

A waveform decoder based on the WaveNet-like architecture, which generates speech waveforms from the hidden sequences adapted by variance information.
Implementation of adversarial training to recover phase information and manage spectral details directly in waveform synthesis.
Utilization of the mel-spectrogram decoder to assist in feature extraction during the training phase, although it is discarded during inference to maintain the benefit of end-to-end efficiency.

Experimental Results and Implications

The experimental results validate the efficacy of FastSpeech 2 and 2s in terms of quality, speed, and robustness. Key outcomes include:

FastSpeech 2 achieves a 3x faster training pipeline compared to the original FastSpeech while maintaining or exceeding the voice quality benchmarks set by autoregressive models.
MOS evaluations indicate that FastSpeech 2 can outperform autoregressive models like Tacotron 2 and Transformer TTS.
FastSpeech 2s achieves even faster inference speed due to its streamlined end-to-end architecture.

In practical terms, these advancements have notable implications. The reduced training time and increased inference speed position FastSpeech 2 and 2s as more viable for real-time applications. The enhanced pitch and energy modeling capabilities present new opportunities for nuanced and expressive vocal syntheses, potentially advancing applications in virtual assistants, dubbing systems, and interactive voice response systems.

Speculative Future Directions

Future research could focus on several potential improvements. Firstly, addressing the remaining gap towards truly fully end-to-end TTS without external pitch and alignment tools would streamline the pipeline further. Additionally, incorporating more variance information such as emotion, style, and multi-speaker characteristics could improve the versatility and adaptability of the models. Fine-tuning the model parameters to reduce computational loads could also make these models more accessible for deployment on edge devices.

Ultimately, FastSpeech 2 and FastSpeech 2s mark significant strides in the field of non-autoregressive TTS, introducing efficient methodologies poised to push the boundaries of high-fidelity speech synthesis.

PDF Markdown

Related Papers

Find Related Papers

GitHub

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech - Speech Research