Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers (2301.02111v1)

Published 5 Jan 2023 in cs.CL, cs.SD, and eess.AS
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Abstract: We introduce a LLMing approach for text to speech synthesis (TTS). Specifically, we train a neural codec LLM (called Vall-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional LLMing task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. Vall-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that Vall-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find Vall-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis. See https://aka.ms/valle for demos of our work.

Neural Codec LLMs for Zero-Shot Text-to-Speech Synthesis

The paper "Neural Codec LLMs are Zero-Shot Text to Speech Synthesizers" presents a novel approach to addressing the challenge of high-quality Text-to-Speech (TTS) synthesis for unseen speakers using a large-scale, LLM-based framework named VALL-E. Unlike traditional cascaded TTS systems, which rely on mel-spectrograms and signal regression models, VALL-E employs discrete audio codec codes as intermediate representations and treats TTS as conditional LLMing.

Methodology

VALL-E's methodology revolves around converting the continuous speech signal into discrete tokens using a neural audio codec model, specifically EnCodec. This conversion allows the model to handle TTS tasks as a sequence prediction problem in a discrete token space. The approach consists of two primary components:

  1. Autoregressive (AR) Model: Responsible for generating the initial level of discrete tokens based on phoneme sequences and a short acoustic prompt (an enrolled recording of the target speaker).
  2. Non-Autoregressive (NAR) Model: Utilized for subsequent levels of token predictions, which refine the output by capturing finer acoustic details. This model benefits from a hierarchical structure where each level depends on the previously generated tokens and the phoneme and acoustic prompts.

The neural codec model, EnCodec, quantizes the audio into eight residual vector quantization (RVQ) levels, each providing incremental details for speech synthesis. This structured quantization enables the AR model to capture general features and the NAR model to add specific nuances, ensuring high-quality and natural speech synthesis.

Experimental Results

The evaluation of VALL-E was meticulously conducted on datasets like LibriSpeech and VCTK, focusing on zero-shot scenarios where the model encounters speakers not seen during training. The key findings include:

  • LibriSpeech Evaluation: VALL-E outperformed state-of-the-art systems in both robustness, as indicated by lower word error rates (WER), and speaker similarity, measured by similarity scores using WavLM-TDNN. Specifically, VALL-E achieved a WER of 5.9% compared to the 7.7% of the baseline, YourTTS.
  • VCTK Evaluation: VALL-E demonstrated superior speaker similarity scores despite not having seen any VCTK speakers during training. This result underscores its efficacy in generalizing speaker identity.

Human evaluation metrics, including Comparative Mean Opinion Score (CMOS) and Similarity MOS (SMOS), were also presented. VALL-E achieved a CMOS improvement of +0.12 and +0.11 over YourTTS and +0.23 over YourTTS on a mixed set of seen and unseen speakers from VCTK, showcasing its ability to produce more natural and speaker-similar synthesis.

Implications

The implications of the VALL-E framework are far-reaching:

  • Practical Applications: The proposed model's ability to perform high-quality TTS for unseen speakers with just a few seconds of enrolled speech introduces significant potential for applications in personalized digital assistants, voice cloning for content creation, and accessibility technologies for individuals with speech impairments.
  • Future TTS Systems: The shift from mel-spectrogram and signal regression models to discrete token-based LLMs represents a paradigm shift in TTS development. This method leverages advancements in LLMing to achieve better generalization and robustness in speech synthesis.
  • Acoustic and Emotional Consistency: VALL-E's capacity to maintain the acoustic environment and speaker's emotion from the acoustic prompt adds another dimension of realism and applicability, particularly useful in dynamic and context-aware voice applications.

Future Directions

The paper acknowledges several limitations and points towards future developments:

  1. Synthesis Robustness: One current challenge is the occasional errors in word clarity and alignment. Future work is proposed to incorporate non-autoregressive models or modified attention mechanisms to improve these aspects.
  2. Data Coverage: While 60K hours of training data is extensive, further expanding this dataset to include more diverse speaking styles, accents, and environments could address current limitations in generalization performance.
  3. Model Architecture: Future iterations of VALL-E could explore the integration of both AR and NAR models into a single, universal model, potentially improving efficiency and performance.

Conclusion

The VALL-E framework introduced in this paper represents a significant advancement in zero-shot TTS, leveraging neural codec LLMs to achieve superior naturalness and speaker similarity. Its ability to handle diverse acoustic conditions and speaker emotions further enhances its applicability across a wide range of speech synthesis scenarios. Continued development along the outlined future directions could potentially address current limitations and bring us closer to solving the challenge of universal, high-fidelity TTS.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (13)
  1. Chengyi Wang (32 papers)
  2. Sanyuan Chen (28 papers)
  3. Yu Wu (196 papers)
  4. Ziqiang Zhang (11 papers)
  5. Long Zhou (57 papers)
  6. Shujie Liu (101 papers)
  7. Zhuo Chen (319 papers)
  8. Yanqing Liu (48 papers)
  9. Huaming Wang (23 papers)
  10. Jinyu Li (164 papers)
  11. Lei He (120 papers)
  12. Sheng Zhao (75 papers)
  13. Furu Wei (291 papers)
Citations (539)
Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com