A Survey on Neural Speech Synthesis (2106.15561v3)

Published 29 Jun 2021 in eess.AS, cs.CL, cs.LG, cs.MM, and cs.SD

Abstract: Text to speech (TTS), or speech synthesis, which aims to synthesize intelligible and natural speech given text, is a hot research topic in speech, language, and machine learning communities and has broad applications in the industry. As the development of deep learning and artificial intelligence, neural network-based TTS has significantly improved the quality of synthesized speech in recent years. In this paper, we conduct a comprehensive survey on neural TTS, aiming to provide a good understanding of current research and future trends. We focus on the key components in neural TTS, including text analysis, acoustic models and vocoders, and several advanced topics, including fast TTS, low-resource TTS, robust TTS, expressive TTS, and adaptive TTS, etc. We further summarize resources related to TTS (e.g., datasets, opensource implementations) and discuss future research directions. This survey can serve both academic researchers and industry practitioners working on TTS.

Authors (4)

Xu Tan (164 papers)
Tao Qin (201 papers)
Frank Soong (9 papers)
Tie-Yan Liu (242 papers)

Citations (325)

View on Semantic Scholar

Summary

The paper provides a comprehensive examination of neural TTS evolution, highlighting modern acoustic models and vocoders for enhanced voice quality.
The paper details innovations like non-autoregressive generation, low-resource adaptation, and robust alignment techniques to address synthesis challenges.
The paper identifies future research directions in expressive control, adaptive voice cloning, and energy-efficient architectures for scalable TTS systems.

A Survey on Neural Speech Synthesis

The landscape of text-to-speech (TTS) synthesis has seen considerable advancements, primarily driven by developments in neural network-based methodologies. The paper "A Survey on Neural Speech Synthesis" by Xu Tan, Tao Qin, Frank Soong, and Tie-Yan Liu offers a comprehensive examination of the current state and evolution of neural TTS systems. This survey methodically dissects the modern neural TTS pipeline, exploring both fundamental components and advanced topics, thereby offering insights for researchers and practitioners in the field.

Fundamental Components of Neural TTS

Neural TTS consists of several key components: text analysis, acoustic models, and vocoders. Text analysis transforms input text into linguistic features to aid in speech synthesization. Historically, TTS systems required extensive text processing including tasks such as text normalization and grapheme-to-phoneme conversion. However, modern neural systems often rely on simplified inputs, either characters or phonemes, streamlining the preprocessing requirements.

The survey outlines the evolution from early HMM and DNN-based models to sophisticated end-to-end systems like Tacotron and FastSpeech. Modern acoustic models, leveraging architectures such as CNNs and Transformers, have achieved notable improvements in voice quality. These systems generally convert text directly into mel-spectrograms, which are then synthesized into waveforms by neural vocoders like WaveNet.

Vocoder development has similarly progressed, moving from traditional STRAIGHT and WORLD systems to advanced neural architectures. Contemporary approaches employ autoregressive, flow-based, and GAN-based models to efficiently generate high-quality audio.

Advancements in Neural TTS

The paper also explores several advanced TTS topics, focusing on issues like synthesis speed, data scarcity, robustness, expressiveness, and adaptability:

Fast TTS: Techniques such as non-autoregressive generation allow for significant improvements in synthesis speed. Models like FastSpeech demonstrate the capability to generate speech more efficiently without compromising quality.
Low-Resource TTS: Cross-lingual and cross-speaker transfer methods, along with self-supervised learning, are critical for expanding TTS capabilities to languages and speakers with limited available data.
Robustness: Addressing issues like word skipping and repetition, the paper highlights enhanced attention mechanisms and robust alignment methods (e.g., duration prediction) to ensure more reliable synthesis.
Expressiveness and Control: Modeling variance in speech (e.g., prosody, style, and emotion) is crucial for generating expressive, human-like speech. Techniques using reference encoders and style tokens enable better prosody representation and manipulation.
Adaptive TTS: The ability to efficiently clone voices using few-shot or zero-shot adaptations enhances TTS applicability in personalizing voice for individual users.

Future Directions

The survey suggests that further research in high-quality and efficient speech synthesis, robust to various domains and expressively controllable, remains an ongoing challenge. The integration of powerful generative models, such as diffusion models, better representation learning, and pre-training methods are pointed out as key areas. Additionally, advances in energy-efficient models are imperative for deploying sophisticated TTS systems in real-world scenarios.

By consolidating recent advancements and highlighting potential research directions, this survey serves as a crucial resource for advancing the development and deployment of TTS technologies.

PDF Markdown

Related Papers

GitHub

GitHub - tts-tutorial/survey: A Survey on Neural Speech Synthesis https://arxiv.org/pdf/2106.15561.pdf (369 stars)

YouTube

Show All Videos