VALL-E 2: Neural Codec Language Models are Human Parity Zero-Shot Text to Speech Synthesizers (2406.05370v2)

Published 8 Jun 2024 in cs.CL, cs.SD, and eess.AS

Abstract: This paper introduces VALL-E 2, the latest advancement in neural codec LLMs that marks a milestone in zero-shot text-to-speech synthesis (TTS), achieving human parity for the first time. Based on its predecessor, VALL-E, the new iteration introduces two significant enhancements: Repetition Aware Sampling refines the original nucleus sampling process by accounting for token repetition in the decoding history. It not only stabilizes the decoding but also circumvents the infinite loop issue. Grouped Code Modeling organizes codec codes into groups to effectively shorten the sequence length, which not only boosts inference speed but also addresses the challenges of long sequence modeling. Our experiments on the LibriSpeech and VCTK datasets show that VALL-E 2 surpasses previous systems in speech robustness, naturalness, and speaker similarity. It is the first of its kind to reach human parity on these benchmarks. Moreover, VALL-E 2 consistently synthesizes high-quality speech, even for sentences that are traditionally challenging due to their complexity or repetitive phrases. The advantages of this work could contribute to valuable endeavors, such as generating speech for individuals with aphasia or people with amyotrophic lateral sclerosis. See https://aka.ms/valle2 for demos of VALL-E 2.

PDF HTML Abstract

Overview of VALL-E 2: Achieving Human Parity in Zero-Shot Text-to-Speech Synthesis

The paper "VALL-E 2: Neural Codec LLMs are Human Parity Zero-Shot Text to Speech Synthesizers" introduces VALL-E 2, an advanced neural codec LLM designed to achieve human parity in zero-shot text-to-speech (TTS) synthesis. This advancement builds upon its predecessor, VALL-E, and incorporates significant enhancements to improve decoding stability and modeling efficiency.

Key Innovations

VALL-E 2 employs two pivotal modifications to enhance the performance and efficiency of the predecessor models:

Repetition Aware Sampling (RAS):
- RAS refines the nucleus sampling process by considering token repetition in the decoding history. This method adjusts between random and nucleus sampling based on token repetition, enhancing stability and avoiding infinite loops previously encountered.
Grouped Code Modeling (GCM):
- GCM organizes codec codes into groups, effectively reducing the sequence length. This modification not only accelerates inference but also alleviates issues related to long sequence modeling, thereby improving overall performance.

Experimental Findings

The evaluation results on the LibriSpeech and VCTK datasets demonstrate that VALL-E 2 exceeds prior models in key areas such as robustness, naturalness, and speaker similarity. Notably, the model achieves human parity benchmarks, evidenced by superior subjective and objective metrics.

LibriSpeech Results:

Objective Metrics: VALL-E 2 exhibited enhanced performance with a significant reduction in Word Error Rate (WER) and improvements in DNSMOS scores. For instance, it achieved WER scores even better than the ground truth speech in specific settings, underscoring its robustness and accuracy.
Subjective Metrics: The model surpassed VALL-E in both Speaker Mean Opinion Score (SMOS) and Comparative Mean Opinion Score (CMOS), indicating better speaker similarity and naturalness.

VCTK Results:

Objective Metrics: Similar trends were observed, with VALL-E 2 significantly lowering WER and improving DNSMOS scores across prompt lengths of 3s, 5s, and 10s.
Subjective Metrics: VALL-E 2 demonstrated superior performance over VALL-E and achieved scores comparable to or even surpassing ground truth speech.

Implications and Future Developments

The implications of VALL-E 2 are profound for both practical and theoretical arenas in AI research. On a practical level, this advancement can lead to the development of TTS systems capable of generating natural, human-like speech from previously unseen speakers with minimal enroLLMent data. Such systems could be invaluable in applications ranging from aiding individuals with speech impairments to enhancing virtual assistants and communication aids.

Theoretically, VALL-E 2's success in reducing sequence lengths and improving decoding techniques sets a new benchmark for future TTS model developments. The introduction of RAS and GCM demonstrates innovative ways to balance stability and efficiency in autoregressive models, providing a blueprint for addressing similar challenges in other neural LLMing tasks.

Conclusion

VALL-E 2 marks a significant evolution in zero-shot TTS synthesis, achieving human parity through thoughtful advancements in sampling and modeling techniques. As researchers continue to explore and refine these innovations, the implications for AI-driven communication devices and accessibility technologies are vast. This work not only sets a new standard for speech synthesis but also opens up promising avenues for further research and application in human-computer interaction.