RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis
Abstract: We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis. While previous work based on LLMs shows impressive performance on zero-shot TTS, such methods often suffer from poor robustness, such as unstable prosody (weird pitch and rhythm/duration) and a high word error rate (WER), due to the autoregressive prediction style of LLMs. The core idea behind RALL-E is chain-of-thought (CoT) prompting, which decomposes the task into simpler steps to enhance the robustness of LLM-based TTS. To accomplish this idea, RALL-E first predicts prosody features (pitch and duration) of the input text and uses them as intermediate conditions to predict speech tokens in a CoT style. Second, RALL-E utilizes the predicted duration prompt to guide the computing of self-attention weights in Transformer to enforce the model to focus on the corresponding phonemes and prosody features when predicting speech tokens. Results of comprehensive objective and subjective evaluations demonstrate that, compared to a powerful baseline method VALL-E, RALL-E significantly improves the WER of zero-shot TTS from $5.6\%$ (without reranking) and $1.7\%$ (with reranking) to $2.5\%$ and $1.0\%$, respectively. Furthermore, we demonstrate that RALL-E correctly synthesizes sentences that are hard for VALL-E and reduces the error rate from $68\%$ to $4\%$.
- Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
- Multispeech: Multi-speaker text to speech with transformer. arXiv preprint arXiv:2006.04664, 2020.
- Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022.
- High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
- Scaling vision transformers to 22 billion parameters. In Proc. ICML, pages 7480–7512. PMLR, 2023.
- Vall-t: Decoder-only generative transducer for robust and decoding-controllable text-to-speech. arXiv preprint arXiv:2401.14321, 2024.
- A. Graves. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711, 2012.
- Conformer: Convolution-augmented Transformer for Speech Recognition. In Proc. Interspeech, pages 5036–5040, 2020. doi: 10.21437/Interspeech.2020-3015.
- Robust sequence-to-sequence acoustic modeling with stepwise monotonic attention for neural tts. arXiv preprint arXiv:1906.00672, 2019.
- The curious case of neural text degeneration. In Proc. ICLR, 2019.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
- Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. arXiv preprint arXiv:2403.03100, 2024.
- Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7669–7673, 2020. https://github.com/facebookresearch/libri-light.
- Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. arXiv preprint arXiv:2302.03540, 2023.
- D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proc. ICLR, San Diego, USA, May 2015.
- Voicebox: Text-guided multilingual universal speech generation at scale. Advances in neural information processing systems, 36, 2024.
- World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE TRANSACTIONS on Information and Systems, 99(7):1877–1884, 2016.
- Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015.
- MLS: A Large-Scale Multilingual Dataset for Speech Research. In Proc. Interspeech, pages 2757–2761, 2020. doi: 10.21437/Interspeech.2020-2826.
- Language models are unsupervised multitask learners. 2019.
- Fastspeech: Fast, robust and controllable text to speech. Proc. NeurIPS, 32, 2019.
- Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925, 2023.
- Utmos: Utokyo-sarulab system for voicemos challenge 2022. arXiv preprint arXiv:2204.02152, 2022.
- Non-attentive tacotron: Robust and controllable neural tts synthesis including unsupervised duration modeling. arXiv preprint arXiv:2010.04301, 2020.
- Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116, 2023.
- Ella-v: Stable neural codec language modeling with alignment-guided sequence reordering. arXiv preprint arXiv:2401.07333, 2024.
- Token-Level Ensemble Distillation for Grapheme-to-Phoneme Conversion. In Proc. Interspeech 2019, pages 2115–2119, 2019. doi: 10.21437/Interspeech.2019-1208.
- Attention is all you need. Advances in neural information processing systems, 30, 2017.
- Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023a.
- Viola: Unified codec language models for speech recognition, synthesis, and translation. arXiv preprint arXiv:2305.16107, 2023b.
- Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022b.
- Uniaudio: An audio foundation model toward universal audio generation. arXiv preprint arXiv:2310.00704, 2023.
- Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
- Libritts: A corpus derived from librispeech for text-to-speech. Proc. Interspeech, 2019.
- Forward attention in sequence-to-sequence acoustic modeling for speech synthesis. In Proc. ICASSP, pages 4789–4793. IEEE, 2018.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.