NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models (2403.03100v3)

Published 5 Mar 2024 in eess.AS, cs.AI, cs.CL, cs.LG, and cs.SD

Abstract: While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing different attributes and generate them individually. Motivated by it, we propose NaturalSpeech 3, a TTS system with novel factorized diffusion models to generate natural speech in a zero-shot way. Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model to generate attributes in each subspace following its corresponding prompt. With this factorization design, NaturalSpeech 3 can effectively and efficiently model intricate speech with disentangled subspaces in a divide-and-conquer way. Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility, and achieves on-par quality with human recordings. Furthermore, we achieve better performance by scaling to 1B parameters and 200K hours of training data.

PDF HTML Abstract

Exploring Zero-Shot Speech Synthesis with \textit{NaturalSpeech 3}: A Leap Towards Natural and Controllable TTS Systems

Introduction

Text-to-speech (TTS) synthesis, the cornerstone of contemporary voice applications, has experienced remarkable advancements driven by the integration of deep learning. Despite these achievements, current large-scale TTS models still display limitations, particularly in achieving speech of superior quality, similarity, and prosody. To address these challenges, our paper introduces \textit{NaturalSpeech 3} (NS3), leveraging factorized diffusion models for zero-shot speech synthesis, drawing upon a novel neural codec equipped with factorized vector quantization (FVQ) for speech attribute disentanglement.

Key Contributions

\textit{NaturalSpeech 3} centers around two pivotal components: the FACodec for attribute factorization and the factorized diffusion model for efficient speech generation across disentangled subspaces.

FACodec: This new codec disentangles speech into distinct subspaces, specifically content, prosody, timbre, and acoustic details, thereby simplifying the modeling process.
Factorized Diffusion Model: Extended from FACodec's disentanglement, this diffusion model generates individual speech attributes in their respective subspaces, offering enhanced control and flexibility in speech synthesis.

Empirical Evaluation

Our comprehensive experiments demonstrate \textit{NaturalSpeech 3}'s superiority over existing TTS systems across multiple dimensions:

Significantly improved speech quality, mirroring or surpassing ground-truth speech in both qualitative and quantitative measures on the LibriSpeech test set.
Unprecedented accuracy in mimicking the prompt speech's voice and prosody, leading to state-of-the-art similarity scores.
Enhanced speech intelligibility, as evidenced by a reduction in word error rate (WER) metrics.

Furthermore, the scalability of NS3 is showcased through experiments that expand the system to 1 billion parameters and 200k hours of training data, presenting a promising avenue for future enhancements.

Theoretical Implications and Future Directions

The introduction of NS3 constitutes a crucial step forward in the quest for highly natural and controllable speech synthesis. By conceptualizing speech as a conglomeration of disentangled attributes and applying a divide-and-conquer strategy in their generation, we inherently increase the model's control over the synthesized speech's characteristics. This flexibility paves the way for a myriad of applications, from customizable voice assistants to sophisticated audio content generation.

Future research directions could extend the efficacy of the factorized diffusion model and explore its applicability in multi-lingual contexts or other forms of audio synthesis. Additionally, investigating the semantic integration between textual content and prosodic features could yield further improvements in naturalness and expressiveness.

Conclusion

\textit{NaturalSpeech 3} propels the boundary of what's achievable in text-to-speech synthesis, marking a significant leap towards the realization of truly lifelike and customizable synthetic speech. Through its novel approach to speech factorization and generation, NS3 not only achieves state-of-the-art results but also introduces a versatile framework for future innovations in the field of generative AI.

PDF Markdown Bookmark Chat (Pro)

References (71)

Authors (19)

Zeqian Ju (13 papers)
Yuancheng Wang (22 papers)
Kai Shen (29 papers)
Xu Tan (164 papers)
Detai Xin (15 papers)
Dongchao Yang (51 papers)
Yanqing Liu (48 papers)
Yichong Leng (27 papers)
Kaitao Song (46 papers)
Siliang Tang (116 papers)
Zhizheng Wu (45 papers)
Tao Qin (201 papers)
Xiang-Yang Li (77 papers)
Wei Ye (110 papers)
Shikun Zhang (82 papers)
Jiang Bian (229 papers)
Lei He (120 papers)
Jinyu Li (164 papers)
Sheng Zhao (75 papers)

Citations (103)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

NaturalSpeech 3

Tweets

https://twitter.com/realamphion/status/1767539892471619837

https://twitter.com/serrjoa/status/1765306380020228417

https://twitter.com/BrianRoemmele/status/1765281396829110584

https://twitter.com/Montreal_AI/status/1767715164046889062

https://twitter.com/AnirbanXD/status/1765442447671492805

https://twitter.com/serrjoa/status/1765292216715550907