RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

Published 4 Apr 2024 in eess.AS, cs.AI, cs.CL, cs.LG, and cs.SD | (2404.03204v3)

Abstract: We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis. While previous work based on LLMs shows impressive performance on zero-shot TTS, such methods often suffer from poor robustness, such as unstable prosody (weird pitch and rhythm/duration) and a high word error rate (WER), due to the autoregressive prediction style of LLMs. The core idea behind RALL-E is chain-of-thought (CoT) prompting, which decomposes the task into simpler steps to enhance the robustness of LLM-based TTS. To accomplish this idea, RALL-E first predicts prosody features (pitch and duration) of the input text and uses them as intermediate conditions to predict speech tokens in a CoT style. Second, RALL-E utilizes the predicted duration prompt to guide the computing of self-attention weights in Transformer to enforce the model to focus on the corresponding phonemes and prosody features when predicting speech tokens. Results of comprehensive objective and subjective evaluations demonstrate that, compared to a powerful baseline method VALL-E, RALL-E significantly improves the WER of zero-shot TTS from $5.6\%$ (without reranking) and $1.7\%$ (with reranking) to $2.5\%$ and $1.0\%$, respectively. Furthermore, we demonstrate that RALL-E correctly synthesizes sentences that are hard for VALL-E and reduces the error rate from $68\%$ to $4\%$.

Abstract PDF HTML Upgrade to Chat

References (36)

Citations (16)

View on Semantic Scholar

Summary

The paper introduces RALL-E's innovation of using chain-of-thought prompting to predict prosody features before speech token generation.
The paper demonstrates significant robustness improvements by achieving a word error rate of 2.8% (1.0% with reranking) over traditional methods.
The paper details the use of duration-guided masking in Transformers to focus on key phonemes, effectively stabilizing prosody in synthesized speech.

Enhancing Text-to-Speech Synthesis with RALL-E: A Chain-of-Thought Prompting Approach

Introduction

Recent advancements in text-to-speech (TTS) synthesis have garnered significant attention due to the integration of LLMs and neural codecs. These innovations enable highly impressive zero-shot TTS performances, characterized by their ability to replicate a speaker's voice with only a short audio prompt. However, challenges persist in achieving robustness, especially in managing prosody patterns and maintaining word accuracy. RALL-E introduces a pioneering approach by implementing chain-of-thought (CoT) prompting, aiming at substantial improvements in robustness for LLM-based TTS systems.

RALL-E Overview

RALL-E strategically anticipates and resolves the core issue of robustness in LLM-based TTS systems through two main components:

Prosody Feature Prediction: Before the generation of speech tokens, RALL-E first predicts prosody features such as pitch and duration from the input text. This step acts as an intermediary condition, which not only aids in the precise generation of speech tokens but also guides the model to focus on relevant phonemes and prosody features during this process.
Duration-Guided Masking in Transformers: The innovative employment of predicted duration for guiding the computation of self-attention weights within Transformers plays a crucial role. It ensures the model’s attention is confined to pertinent phonemes and prosody features, enhancing token prediction accuracy.

Benchmarking RALL-E

Comparative studies and evaluations place RALL-E at an advanced standing relative to established methods like VALL-E. Notably, RALL-E exhibits a marked reduction in the word error rate (WER) on zero-shot TTS, achieving a WER of 2.8\% without reranking and 1.0\% with reranking. This represents a significant leap towards achieving robustness in performance. Furthermore, RALL-E shows a dramatic improvement over VALL-E in synthesizing sentences that are inherently challenging, reducing the error rate remarkably from 68\% to 4\%.

Contribution and Implications

RALL-E's introduction to the TTS field brings forth several key contributions:

Robustness Enhancement: By integrating CoT prompting for prosody feature prediction and employing duration-guided masking, RALL-E significantly elevates the robustness of LLM-based TTS. This innovation reduces WER and enhances the overall speech quality and naturalness.
Prosody Stabilization: Through the proactive prediction of prosody features before speech token generation, RALL-E stabilizes prosody patterns in the synthesized speech, addressing a prevalent challenge in current TTS systems.
Focusing Mechanism: The duration-guided masking technique ensures the model's attention is strategically focused, which underpins the success in accurately synthesizing even the most challenging sentences.

Future Prospects

The advancements introduced by RALL-E open avenues for further exploration and development within the TTS domain. The distinct approach of employing CoT prompting for prosody feature prediction not only paves the way for enhancements in speech synthesis quality but also invites research into application across various languages and dialects. Additionally, the implications of duration-guided masking on model attention mechanisms offer a promising direction for optimizing the computational efficiency and effectiveness of LLMs in TTS and beyond.

Conclusion

RALL-E sets a new precedent in the pursuit of robustness and quality in LLM-based TTS synthesis. Through its innovative combination of CoT prompting and duration-guided attention, RALL-E not only outperforms existing methods in terms of speech quality and error rates but also provides a scalable model for future explorations in TTS technology. As we continue to push the boundaries of what's achievable with AI in natural language generation, RALL-E signifies a significant step forward in our journey towards creating more natural, accurate, and versatile synthetic voices.

Markdown