Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

FastSpeech: Fast, Robust and Controllable Text to Speech (1905.09263v5)

Published 22 May 2019 in cs.CL, cs.LG, cs.SD, and eess.AS

Abstract: Neural network based end-to-end text to speech (TTS) has significantly improved the quality of synthesized speech. Prominent methods (e.g., Tacotron 2) usually first generate mel-spectrogram from text, and then synthesize speech from the mel-spectrogram using vocoder such as WaveNet. Compared with traditional concatenative and statistical parametric approaches, neural network based end-to-end models suffer from slow inference speed, and the synthesized speech is usually not robust (i.e., some words are skipped or repeated) and lack of controllability (voice speed or prosody control). In this work, we propose a novel feed-forward network based on Transformer to generate mel-spectrogram in parallel for TTS. Specifically, we extract attention alignments from an encoder-decoder based teacher model for phoneme duration prediction, which is used by a length regulator to expand the source phoneme sequence to match the length of the target mel-spectrogram sequence for parallel mel-spectrogram generation. Experiments on the LJSpeech dataset show that our parallel model matches autoregressive models in terms of speech quality, nearly eliminates the problem of word skipping and repeating in particularly hard cases, and can adjust voice speed smoothly. Most importantly, compared with autoregressive Transformer TTS, our model speeds up mel-spectrogram generation by 270x and the end-to-end speech synthesis by 38x. Therefore, we call our model FastSpeech.

An Overview of FastSpeech: Enhancing Text-to-Speech Systems with Parallel Processing

The paper "FastSpeech: Fast, Robust, and Controllable Text to Speech" introduces a novel approach to improving the efficiency and robustness of text-to-speech (TTS) systems using a non-autoregressive model based on a feed-forward Transformer architecture. Traditional neural network-based TTS models, such as Tacotron 2, typically operate by generating mel-spectrograms from text input in a sequential manner, subsequently converting these into speech with vocoders like WaveNet. While these autoregressive methods have shown impressive results in terms of speech quality compared to older concatenative and statistical parametric approaches, they face significant shortcomings, including slow inference speeds, errors in speech synthesis (such as word skipping and repetition), and limited control over speech characteristics like prosody and speed.

FastSpeech: Architecture and Technique

FastSpeech addresses these issues through several key innovations. The model employs a feed-forward Transformer structure that generates mel-spectrograms in parallel rather than sequentially. The foundational concept underlying this method is the extraction of attention alignments from an autoregressive teacher model, which are then utilized for phoneme duration prediction. A length regulator subsequently ensures the phoneme sequence is appropriately aligned with the mel-spectrogram sequence, enabling parallel generation.

Key components in FastSpeech's architecture, as per the paper, include:

  • Feed-Forward Transformer: This network, composed of self-attention mechanisms and 1D convolutional layers, is designed for the non-sequential transformation from phoneme to spectrogram.
  • Length Regulator: It adjusts the length of phoneme sequences to match the target mel-spectrogram, allowing dynamic changes in voice speed and prosody.
  • Duration Predictor: Trained using teacher-student techniques with extracted ground-truth durations, this component predicts phoneme durations to facilitate robust and precise alignment.

Experimental Evaluation and Impacts

Utilizing the LJSpeech dataset, the authors demonstrated that FastSpeech can nearly equal autoregressive models in speech quality. Importantly, the new model eliminates severe synthesis errors like word skipping and repetition—issues prevalent in autoregressive systems due to cumulative error in attention mechanisms. The most notable result is a substantial increase in processing speed: FastSpeech achieves a 270x speedup in mel-spectrogram generation and 38x overall speedup for text-to-speech conversion when compared with autoregressive models.

Through these enhancements, FastSpeech not only solves existing problems in inference speed and error robustness but also introduces new capabilities in adjusting speech output characteristics, thus providing unprecedented control over synthesized speech.

Theoretical and Practical Implications

The theoretical implications of FastSpeech indicate a shift towards non-autoregressive techniques in TTS tasks, leveraging parallel data processing to circumvent the limitations associated with sequential generation. By doing so, it opens avenues for further research into efficient model architectures that balance quality, speed, and control.

Practically, FastSpeech can significantly impact applications requiring rapid and high-quality speech synthesis, such as virtual assistants, automated customer service systems, and real-time language translation. The enhanced control over spoken attributes also makes it viable for personalized and varied speech synthesis conforming to user preferences.

Future Directions

The authors suggest exploring further enhancements to improve speech quality potentially beyond FastSpeech's current outputs. Additionally, extending FastSpeech to support more complex settings like multi-speaker synthesis or low-resource languages could broaden its application scope. A parallel, end-to-end model integrating neural vocoders could also enhance efficiency and simplicity in deployment.

In summary, FastSpeech represents a significant advancement in TTS technology, showcasing how non-autoregressive models can provide robust, rapid, and versatile speech synthesis suitable for diverse use cases.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Yi Ren (215 papers)
  2. Yangjun Ruan (13 papers)
  3. Xu Tan (164 papers)
  4. Tao Qin (201 papers)
  5. Sheng Zhao (75 papers)
  6. Zhou Zhao (218 papers)
  7. Tie-Yan Liu (242 papers)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com