Papers
Topics
Authors
Recent
Search
2000 character limit reached

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Published 5 Jan 2023 in cs.CL, cs.SD, and eess.AS | (2301.02111v1)

Abstract: We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec LLM (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis. See https://aka.ms/valle for demos of our work.

Citations (539)

Summary

  • The paper demonstrates that VALL-E redefines TTS synthesis by leveraging neural codec language models for zero-shot capabilities, outperforming existing systems.
  • It employs an autoregressive model for initial acoustic token prediction followed by a non-autoregressive decoder for efficient multi-layer token generation.
  • VALL-E nearly matches ground-truth quality on CMOS and SMOS metrics, evidencing its potential for personalized and robust speech synthesis.

Neural Codec LLMs for Zero-Shot Text to Speech

Introduction

The paper "Neural Codec LLMs are Zero-Shot Text to Speech Synthesizers" presents VALL-E, a novel approach to text-to-speech (TTS) synthesis that leverages neural codec LLMs to achieve zero-shot TTS capabilities. This approach treats TTS as a conditional language modeling task by utilizing discrete codes from a neural audio codec model, significantly differing from the traditional method of relying on continuous signal regression via mel spectrograms. Aiming to overcome the challenges of generalization in TTS, particularly for unseen speakers, the paper explores a large-scale data-driven method inspired by advancements in text synthesis.

VALL-E Architecture

VALL-E is a conditional codec LLM built upon neural audio codecs. The architecture consists of two main components: an autoregressive (AR) LLM for the first layer of acoustic tokens and a non-autoregressive (NAR) LLM for the subsequent layers. The codec models transform input phonemes into discrete acoustic tokens, with the model learning from a varied dataset of approximately 60,000 hours of English speech. Figure 1

Figure 1: The overview of VALL-E. Unlike the previous pipeline (e.g., phoneme → mel-spectrogram → waveform), the pipeline of VALL-E is phoneme → discrete code → waveform.

Conditional Codec Language Modeling

Autoregressive and Non-Autoregressive Components

The AR component handles the prediction of the first layer of quantized acoustic tokens, conditioned on both phoneme sequences and a brief acoustic prompt. This section is crucial for determining phoneme-to-acoustic code alignment. The NAR component relies on the AR output to predict subsequent layers of acoustic tokens, improving efficiency and reducing computational complexity. The embedding of both components provides a hierarchical representation of acoustic tokens, enhancing model output quality. Figure 2

Figure 2: The structure of the conditional codec language modeling, which is built in a hierarchical manner. In practice, the NAR decoder will be called seven times to generate codes in seven quantizers.

Experimentation and Evaluation

The comparative performance of VALL-E against existing TTS systems is evaluated on LibriSpeech and VCTK datasets. VALL-E demonstrates superior speech naturalness and speaker similarity, reflected by higher CMOS and SMOS scores compared to YourTTS, the state-of-the-art zero-shot TTS system. Furthermore, VALL-E nearly matches ground truth in these metrics on unseen speakers, emphasizing its robust zero-shot learning capability.

In-Context Learning via Prompting

VALL-E exhibits strong in-context learning by effectively synthesizing speech from a short acoustic prompt. It maintains speaker-specific nuances like emotion and acoustic environment without additional fine-tuning or structural modifications. This achievement is particularly significant in TTS research, as it provides diverse outputs from the same input and maintains speaker individuality even in challenging scenarios. Figure 3

Figure 3: The neural audio codec model revisit. Because RVQ is employed, the first quantizer plays the most important role in reconstruction, and the impact from others gradually decreases.

Implications and Future Directions

The implications of VALL-E extend to various speech synthesis applications, including personalized TTS systems, voice cloning, and audio content creation. Future directions include addressing robustness issues in word articulation during synthesis, enhancing data diversity, and exploring a full-scale non-autoregressive model approach.

Conclusion

VALL-E introduces a transformative method for TTS by redefining traditional paradigms with neural codec LLMs. It capitalizes on large-scale datasets and in-context learning to achieve impressively high-quality synthesis in zero-shot scenarios, strongly outperforming previous methods. Continued improvements and extensions of this work hold promising potential for the future of speech synthesis research and application development.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.