Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models (2403.03100v3)

Published 5 Mar 2024 in eess.AS, cs.AI, cs.CL, cs.LG, and cs.SD
NaturalSpeech 3: Zero-Shot Speech Synthesis with Factorized Codec and Diffusion Models

Abstract: While recent large-scale text-to-speech (TTS) models have achieved significant progress, they still fall short in speech quality, similarity, and prosody. Considering speech intricately encompasses various attributes (e.g., content, prosody, timbre, and acoustic details) that pose significant challenges for generation, a natural idea is to factorize speech into individual subspaces representing different attributes and generate them individually. Motivated by it, we propose NaturalSpeech 3, a TTS system with novel factorized diffusion models to generate natural speech in a zero-shot way. Specifically, 1) we design a neural codec with factorized vector quantization (FVQ) to disentangle speech waveform into subspaces of content, prosody, timbre, and acoustic details; 2) we propose a factorized diffusion model to generate attributes in each subspace following its corresponding prompt. With this factorization design, NaturalSpeech 3 can effectively and efficiently model intricate speech with disentangled subspaces in a divide-and-conquer way. Experiments show that NaturalSpeech 3 outperforms the state-of-the-art TTS systems on quality, similarity, prosody, and intelligibility, and achieves on-par quality with human recordings. Furthermore, we achieve better performance by scaling to 1B parameters and 200K hours of training data.

Exploring Zero-Shot Speech Synthesis with \textit{NaturalSpeech 3}: A Leap Towards Natural and Controllable TTS Systems

Introduction

Text-to-speech (TTS) synthesis, the cornerstone of contemporary voice applications, has experienced remarkable advancements driven by the integration of deep learning. Despite these achievements, current large-scale TTS models still display limitations, particularly in achieving speech of superior quality, similarity, and prosody. To address these challenges, our paper introduces \textit{NaturalSpeech 3} (NS3), leveraging factorized diffusion models for zero-shot speech synthesis, drawing upon a novel neural codec equipped with factorized vector quantization (FVQ) for speech attribute disentanglement.

Key Contributions

\textit{NaturalSpeech 3} centers around two pivotal components: the FACodec for attribute factorization and the factorized diffusion model for efficient speech generation across disentangled subspaces.

  • FACodec: This new codec disentangles speech into distinct subspaces, specifically content, prosody, timbre, and acoustic details, thereby simplifying the modeling process.
  • Factorized Diffusion Model: Extended from FACodec's disentanglement, this diffusion model generates individual speech attributes in their respective subspaces, offering enhanced control and flexibility in speech synthesis.

Empirical Evaluation

Our comprehensive experiments demonstrate \textit{NaturalSpeech 3}'s superiority over existing TTS systems across multiple dimensions:

  • Significantly improved speech quality, mirroring or surpassing ground-truth speech in both qualitative and quantitative measures on the LibriSpeech test set.
  • Unprecedented accuracy in mimicking the prompt speech's voice and prosody, leading to state-of-the-art similarity scores.
  • Enhanced speech intelligibility, as evidenced by a reduction in word error rate (WER) metrics.

Furthermore, the scalability of NS3 is showcased through experiments that expand the system to 1 billion parameters and 200k hours of training data, presenting a promising avenue for future enhancements.

Theoretical Implications and Future Directions

The introduction of NS3 constitutes a crucial step forward in the quest for highly natural and controllable speech synthesis. By conceptualizing speech as a conglomeration of disentangled attributes and applying a divide-and-conquer strategy in their generation, we inherently increase the model's control over the synthesized speech's characteristics. This flexibility paves the way for a myriad of applications, from customizable voice assistants to sophisticated audio content generation.

Future research directions could extend the efficacy of the factorized diffusion model and explore its applicability in multi-lingual contexts or other forms of audio synthesis. Additionally, investigating the semantic integration between textual content and prosodic features could yield further improvements in naturalness and expressiveness.

Conclusion

\textit{NaturalSpeech 3} propels the boundary of what's achievable in text-to-speech synthesis, marking a significant leap towards the realization of truly lifelike and customizable synthetic speech. Through its novel approach to speech factorization and generation, NS3 not only achieves state-of-the-art results but also introduces a versatile framework for future innovations in the field of generative AI.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (71)
  1. Tacotron: Towards end-to-end speech synthesis. Proc. Interspeech 2017, pages 4006–4010, 2017.
  2. Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4779–4783. IEEE, 2018.
  3. FastSpeech: Fast, robust and controllable text to speech. In NeurIPS, 2019.
  4. NaturalSpeech: End-to-end text to speech synthesis with human-level quality. arXiv preprint arXiv:2205.04421, 2022.
  5. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116, 2023.
  6. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023.
  7. Mega-tts: Zero-shot text-to-speech at scale with intrinsic inductive bias. arXiv preprint arXiv:2306.03509, 2023.
  8. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. arXiv preprint arXiv:2106.06103, 2021.
  9. Jets: Jointly training fastspeech2 and hifi-gan for end to end text to speech. arXiv preprint arXiv:2203.16852, 2022.
  10. Grad-TTS: A diffusion probabilistic model for text-to-speech. arXiv preprint arXiv:2105.06337, 2021.
  11. Voicebox: Text-guided multilingual universal speech generation at scale. arXiv preprint arXiv:2306.15687, 2023.
  12. Audiolm: a language modeling approach to audio generation. arXiv preprint arXiv:2209.03143, 2022.
  13. Soundstorm: Efficient parallel audio generation. arXiv preprint arXiv:2305.09636, 2023.
  14. SoundStream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2021.
  15. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
  16. Unsupervised speech decomposition via triple information bottleneck. In International Conference on Machine Learning, pages 7836–7846. PMLR, 2020.
  17. AutoVC: Zero-shot voice style transfer with only autoencoder loss. In International Conference on Machine Learning, pages 5210–5219. PMLR, 2019.
  18. HiFi-GAN: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33, 2020.
  19. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. arXiv preprint arXiv:2302.03540, 2023.
  20. Make-a-voice: Unified voice synthesis with discrete representation. arXiv preprint arXiv:2305.19269, 2023.
  21. Instructtts: Modelling expressive tts in discrete latent space with natural language style prompt. arXiv preprint arXiv:2301.13662, 2023.
  22. Unicats: A unified context-aware text-to-speech framework with contextual vq-diffusion and vocoding. arXiv preprint arXiv:2306.07547, 2023.
  23. Lms with a voice: Spoken language modeling beyond speech tokens. arXiv preprint arXiv:2305.15255, 2023.
  24. Styletts 2: Towards human-level text-to-speech through style diffusion and adversarial training with large speech language models. arXiv preprint arXiv:2306.07691, 2023.
  25. Hierspeech++: Bridging the gap between semantic and acoustic representation of speech by hierarchical variational inference for zero-shot speech synthesis. arXiv preprint arXiv:2311.12454, 2023.
  26. WaveNet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 2016.
  27. Parallel WaveNet: Fast high-fidelity speech synthesis. In International conference on machine learning, pages 3918–3926. PMLR, 2018.
  28. Char2wav: End-to-end speech synthesis. 2017.
  29. Deep Voice 3: 2000-speaker neural text-to-speech. Proc. ICLR, pages 214–217, 2018.
  30. Neural speech synthesis with Transformer network. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 6706–6713, 2019.
  31. Glow-TTS: A generative flow for text-to-speech via monotonic alignment search. Advances in Neural Information Processing Systems, 33, 2020.
  32. High-fidelity audio compression with improved rvqgan. arXiv preprint arXiv:2306.06546, 2023.
  33. Parallel Tacotron: Non-autoregressive and controllable TTS. arXiv preprint arXiv:2010.11439, 2020.
  34. DiffSinger: Singing voice synthesis via shallow diffusion mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11020–11028, 2022.
  35. Uniaudio: An audio foundation model toward universal audio generation. arXiv preprint arXiv:2310.00704, 2023.
  36. Neural analysis and synthesis: Reconstructing speech from self-supervised representations. Advances in Neural Information Processing Systems, 34:16251–16265, 2021.
  37. Nansy++: Unified voice synthesis with neural analysis and synthesis. arXiv preprint arXiv:2211.09407, 2022.
  38. Speech resynthesis from discrete disentangled self-supervised representations. arXiv preprint arXiv:2104.00355, 2021.
  39. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 244–250. IEEE, 2021.
  40. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in Neural Information Processing Systems, 33:12449–12460, 2020.
  41. wav2vec: Unsupervised pre-training for speech recognition. Proc. Interspeech 2019, pages 3465–3469, 2019.
  42. Speechtokenizer: Unified speech tokenizer for speech large language models. arXiv preprint arXiv:2308.16692, 2023.
  43. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
  44. Disentangled feature learning for real-time neural speech coding. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  45. AdaSpeech: Adaptive text to speech for custom voice. In International Conference on Learning Representations, 2021.
  46. Vector-quantized image modeling with improved vqgan. arXiv preprint arXiv:2110.04627, 2021.
  47. Speech representation disentanglement with adversarial mutual information learning for one-shot voice conversion. arXiv preprint arXiv:2208.08757, 2022.
  48. Unsupervised domain adaptation by backpropagation. In International conference on machine learning, pages 1180–1189. PMLR, 2015.
  49. Diffuseq: Sequence to sequence text generation with diffusion models. arXiv preprint arXiv:2210.08933, 2022.
  50. Diffusion posterior sampling for general noisy inverse problems. arXiv preprint arXiv:2209.14687, 2022.
  51. Vector quantized diffusion model for text-to-image synthesis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 10696–10706, 2022.
  52. Improved masked image generation with token-critic. In European Conference on Computer Vision, pages 70–86. Springer, 2022.
  53. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741, 2021.
  54. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  55. Common diffusion noise schedules and sample steps are flawed. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 5404–5411, 2024.
  56. Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7669–7673. IEEE, 2020.
  57. LibriSpeech: an ASR corpus based on public domain audio books. In 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210. IEEE, 2015.
  58. The ryerson audio-visual database of emotional speech and song (ravdess): A dynamic, multimodal set of facial and vocal expressions in north american english. PloS one, 13(5):e0196391, 2018.
  59. Nonparallel expressive tts for unseen target speaker using style-controlled adaptive layer and optimized pitch embedding. In 2023 International Conference on Speech Technology and Human-Computer Dialogue (SpeD), pages 176–181. IEEE, 2023.
  60. Mega-tts 2: Zero-shot text-to-speech with arbitrary length speech prompts. arXiv preprint arXiv:2307.07218, 2023.
  61. Diffprosody: Diffusion-based latent prosody generation for expressive speech synthesis with prosody conditional adversarial training. arXiv preprint arXiv:2307.16549, 2023.
  62. Prosospeech: Enhancing prosody with quantized vector pre-training in text-to-speech. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7577–7581. IEEE, 2022.
  63. Hifi-codec: Group-residual vector quantization for high fidelity audio codec. arXiv preprint arXiv:2305.02765, 2023.
  64. Token-level ensemble distillation for grapheme-to-phoneme conversion. In INTERSPEECH, 2019.
  65. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11315–11325, 2022.
  66. Utmos: Utokyo-sarulab system for voicemos challenge 2022. arXiv preprint arXiv:2204.02152, 2022.
  67. Bigvgan: A universal neural vocoder with large-scale training. arXiv preprint arXiv:2206.04658, 2022.
  68. Conformer: Convolution-augmented transformer for speech recognition. arXiv preprint arXiv:2005.08100, 2020.
  69. Neural discrete representation learning. In Proceedings of the 31st International Conference on Neural Information Processing Systems, pages 6309–6318, 2017.
  70. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International Conference on Machine Learning, pages 2709–2720. PMLR, 2022.
  71. Lm-vc: Zero-shot voice conversion via speech generation based on language models. arXiv preprint arXiv:2306.10521, 2023.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (19)
  1. Zeqian Ju (13 papers)
  2. Yuancheng Wang (22 papers)
  3. Kai Shen (29 papers)
  4. Xu Tan (164 papers)
  5. Detai Xin (15 papers)
  6. Dongchao Yang (51 papers)
  7. Yanqing Liu (48 papers)
  8. Yichong Leng (27 papers)
  9. Kaitao Song (46 papers)
  10. Siliang Tang (116 papers)
  11. Zhizheng Wu (45 papers)
  12. Tao Qin (201 papers)
  13. Xiang-Yang Li (77 papers)
  14. Wei Ye (110 papers)
  15. Shikun Zhang (82 papers)
  16. Jiang Bian (229 papers)
  17. Lei He (120 papers)
  18. Jinyu Li (164 papers)
  19. Sheng Zhao (75 papers)
Citations (103)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com
Reddit Logo Streamline Icon: https://streamlinehq.com