Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NaturalSpeech 2: Latent Diffusion Models are Natural and Zero-Shot Speech and Singing Synthesizers (2304.09116v3)

Published 18 Apr 2023 in eess.AS, cs.AI, cs.CL, cs.LG, and cs.SD

Abstract: Scaling text-to-speech (TTS) to large-scale, multi-speaker, and in-the-wild datasets is important to capture the diversity in human speech such as speaker identities, prosodies, and styles (e.g., singing). Current large TTS systems usually quantize speech into discrete tokens and use LLMs to generate these tokens one by one, which suffer from unstable prosody, word skipping/repeating issue, and poor voice quality. In this paper, we develop NaturalSpeech 2, a TTS system that leverages a neural audio codec with residual vector quantizers to get the quantized latent vectors and uses a diffusion model to generate these latent vectors conditioned on text input. To enhance the zero-shot capability that is important to achieve diverse speech synthesis, we design a speech prompting mechanism to facilitate in-context learning in the diffusion model and the duration/pitch predictor. We scale NaturalSpeech 2 to large-scale datasets with 44K hours of speech and singing data and evaluate its voice quality on unseen speakers. NaturalSpeech 2 outperforms previous TTS systems by a large margin in terms of prosody/timbre similarity, robustness, and voice quality in a zero-shot setting, and performs novel zero-shot singing synthesis with only a speech prompt. Audio samples are available at https://speechresearch.github.io/naturalspeech2.

Overview of NaturalSpeech 2: Diffusion Models for Text-to-Speech and Singing Synthesis

The paper presented in the paper explores innovations in text-to-speech (TTS) systems through the development of NaturalSpeech 2, which adopts latent diffusion models for generating both speech and singing outputs in a zero-shot context. The challenges addressed encompass achieving high fidelity, expressiveness, and robust synthesis across varied speaker identities and styles. Traditional TTS systems often encounter limitations, especially in maintaining stable prosody and dealing with practical issues like word skipping or repetitions. These systems typically model speech as sequences of discrete tokens within autoregressive frameworks. NaturalSpeech 2 instead harnesses a non-autoregressive approach supported by latent diffusion models and continuous latent vectors, enriched by innovations in speech prompting for in-context learning.

Methodological Advances

  1. Neural Audio Codec and Continuous Latent Vectors: The authors propose a codec framework that generates continuous rather than discrete audio vectors. Utilizing a neural audio codec with residual vector quantization (RVQ) techniques, the system encodes speech into latent representations characterized by continuous vectors. This resolves bandwidth and robustness concerns associated with discrete token-based systems and ensures the retention of fine acoustic details.
  2. Latent Diffusion Processes: The core of NaturalSpeech 2 is its utilization of diffusion models, a departure from previous autoregressive approaches. These models engage in a bidirectional stochastic differential equation (SDE) process to iterate between noisy latent states and denoised speech representations. This mechanism mitigates the challenges of error propagation inherent in autoregressive models, enhancing output stability.
  3. Speech Prompting for Zero-Shot Learning: The paper integrates an in-context learning method using speech prompts, which effectively tunes the model to adapt to speaker identity and prosodic diversity without explicit retraining. By employing attention structures, speech prompting ensures that latent diffusion outcomes align with context-specific speech prompts.

Experimental Evidence and Performance

NaturalSpeech 2 achieves notable success across multiple metrics on traditional benchmark datasets like LibriSpeech and VCTK. It surpasses existing approaches such as YourTTS and VALL-E, demonstrating superior prosody similarity, naturalness (CMOS and SMOS scores), and low word error rates. Specific benefits include:

  • Prosody and Expressiveness: NaturalSpeech 2's innovation in pitch and duration management allows it to closely mimic both the prosody of input prompt audio and ground-truth references, showcasing advancements in expressive synthesis.
  • Robustness: Its non-autoregressive nature facilitates superior robustness against typical autoregressive failure modes, such as word repetition and misalignment, positioning it as reliable in challenging synthesis settings.

Implications and Future Directions

The research proposes a paradigm shift in TTS and singing synthesis, expanding its potential applications—from accessible voice interfaces to creative sound design in digital content. The architecture sets a foundation for future exploration of efficient model scaling, possibly integrating techniques like consistency modeling to further reduce computational overhead while maintaining synthesis quality.

This work presents implications for AI research in advancing human-computer interaction technologies through more expressive and varied vocal synthesis capabilities. Continued exploration in this domain may involve leveraging parallel developments in unsupervised learning and broader scales of data to enrich the model's adaptability and diversity further.

In summary, NaturalSpeech 2 exemplifies a forward-thinking approach to tackling core TTS challenges, offering a robust, high-quality, and versatile synthesis platform. It sets a benchmark for continued innovation in speech technology and aligns closely with the trajectory of integrating deep learning-based generative models across human-centric application domains.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Kai Shen (29 papers)
  2. Zeqian Ju (13 papers)
  3. Xu Tan (164 papers)
  4. Yanqing Liu (48 papers)
  5. Yichong Leng (27 papers)
  6. Lei He (120 papers)
  7. Tao Qin (201 papers)
  8. Sheng Zhao (75 papers)
  9. Jiang Bian (229 papers)
Citations (183)
Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com