Papers
Topics
Authors
Recent
Search
2000 character limit reached

RALL-E: Robust Codec Language Modeling with Chain-of-Thought Prompting for Text-to-Speech Synthesis

Published 4 Apr 2024 in eess.AS, cs.AI, cs.CL, cs.LG, and cs.SD | (2404.03204v3)

Abstract: We present RALL-E, a robust language modeling method for text-to-speech (TTS) synthesis. While previous work based on LLMs shows impressive performance on zero-shot TTS, such methods often suffer from poor robustness, such as unstable prosody (weird pitch and rhythm/duration) and a high word error rate (WER), due to the autoregressive prediction style of LLMs. The core idea behind RALL-E is chain-of-thought (CoT) prompting, which decomposes the task into simpler steps to enhance the robustness of LLM-based TTS. To accomplish this idea, RALL-E first predicts prosody features (pitch and duration) of the input text and uses them as intermediate conditions to predict speech tokens in a CoT style. Second, RALL-E utilizes the predicted duration prompt to guide the computing of self-attention weights in Transformer to enforce the model to focus on the corresponding phonemes and prosody features when predicting speech tokens. Results of comprehensive objective and subjective evaluations demonstrate that, compared to a powerful baseline method VALL-E, RALL-E significantly improves the WER of zero-shot TTS from $5.6\%$ (without reranking) and $1.7\%$ (with reranking) to $2.5\%$ and $1.0\%$, respectively. Furthermore, we demonstrate that RALL-E correctly synthesizes sentences that are hard for VALL-E and reduces the error rate from $68\%$ to $4\%$.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  2. Multispeech: Multi-speaker text to speech with transformer. arXiv preprint arXiv:2006.04664, 2020.
  3. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505–1518, 2022.
  4. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
  5. Scaling vision transformers to 22 billion parameters. In Proc. ICML, pages 7480–7512. PMLR, 2023.
  6. Vall-t: Decoder-only generative transducer for robust and decoding-controllable text-to-speech. arXiv preprint arXiv:2401.14321, 2024.
  7. A. Graves. Sequence transduction with recurrent neural networks. arXiv preprint arXiv:1211.3711, 2012.
  8. Conformer: Convolution-augmented Transformer for Speech Recognition. In Proc. Interspeech, pages 5036–5040, 2020. doi: 10.21437/Interspeech.2020-3015.
  9. Robust sequence-to-sequence acoustic modeling with stepwise monotonic attention for neural tts. arXiv preprint arXiv:1906.00672, 2019.
  10. The curious case of neural text degeneration. In Proc. ICLR, 2019.
  11. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
  12. Naturalspeech 3: Zero-shot speech synthesis with factorized codec and diffusion models. arXiv preprint arXiv:2403.03100, 2024.
  13. Libri-light: A benchmark for asr with limited or no supervision. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7669–7673, 2020. https://github.com/facebookresearch/libri-light.
  14. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. arXiv preprint arXiv:2302.03540, 2023.
  15. D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. In Proc. ICLR, San Diego, USA, May 2015.
  16. Voicebox: Text-guided multilingual universal speech generation at scale. Advances in neural information processing systems, 36, 2024.
  17. World: a vocoder-based high-quality speech synthesis system for real-time applications. IEICE TRANSACTIONS on Information and Systems, 99(7):1877–1884, 2016.
  18. Librispeech: an asr corpus based on public domain audio books. In 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP), pages 5206–5210. IEEE, 2015.
  19. MLS: A Large-Scale Multilingual Dataset for Speech Research. In Proc. Interspeech, pages 2757–2761, 2020. doi: 10.21437/Interspeech.2020-2826.
  20. Language models are unsupervised multitask learners. 2019.
  21. Fastspeech: Fast, robust and controllable text to speech. Proc. NeurIPS, 32, 2019.
  22. Audiopalm: A large language model that can speak and listen. arXiv preprint arXiv:2306.12925, 2023.
  23. Utmos: Utokyo-sarulab system for voicemos challenge 2022. arXiv preprint arXiv:2204.02152, 2022.
  24. Non-attentive tacotron: Robust and controllable neural tts synthesis including unsupervised duration modeling. arXiv preprint arXiv:2010.04301, 2020.
  25. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116, 2023.
  26. Ella-v: Stable neural codec language modeling with alignment-guided sequence reordering. arXiv preprint arXiv:2401.07333, 2024.
  27. Token-Level Ensemble Distillation for Grapheme-to-Phoneme Conversion. In Proc. Interspeech 2019, pages 2115–2119, 2019. doi: 10.21437/Interspeech.2019-1208.
  28. Attention is all you need. Advances in neural information processing systems, 30, 2017.
  29. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023a.
  30. Viola: Unified codec language models for speech recognition, synthesis, and translation. arXiv preprint arXiv:2305.16107, 2023b.
  31. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022a.
  32. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824–24837, 2022b.
  33. Uniaudio: An audio foundation model toward universal audio generation. arXiv preprint arXiv:2310.00704, 2023.
  34. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
  35. Libritts: A corpus derived from librispeech for text-to-speech. Proc. Interspeech, 2019.
  36. Forward attention in sequence-to-sequence acoustic modeling for speech synthesis. In Proc. ICASSP, pages 4789–4793. IEEE, 2018.
Citations (16)

Summary

  • The paper introduces RALL-E's innovation of using chain-of-thought prompting to predict prosody features before speech token generation.
  • The paper demonstrates significant robustness improvements by achieving a word error rate of 2.8% (1.0% with reranking) over traditional methods.
  • The paper details the use of duration-guided masking in Transformers to focus on key phonemes, effectively stabilizing prosody in synthesized speech.

Enhancing Text-to-Speech Synthesis with RALL-E: A Chain-of-Thought Prompting Approach

Introduction

Recent advancements in text-to-speech (TTS) synthesis have garnered significant attention due to the integration of LLMs and neural codecs. These innovations enable highly impressive zero-shot TTS performances, characterized by their ability to replicate a speaker's voice with only a short audio prompt. However, challenges persist in achieving robustness, especially in managing prosody patterns and maintaining word accuracy. RALL-E introduces a pioneering approach by implementing chain-of-thought (CoT) prompting, aiming at substantial improvements in robustness for LLM-based TTS systems.

RALL-E Overview

RALL-E strategically anticipates and resolves the core issue of robustness in LLM-based TTS systems through two main components:

  • Prosody Feature Prediction: Before the generation of speech tokens, RALL-E first predicts prosody features such as pitch and duration from the input text. This step acts as an intermediary condition, which not only aids in the precise generation of speech tokens but also guides the model to focus on relevant phonemes and prosody features during this process.
  • Duration-Guided Masking in Transformers: The innovative employment of predicted duration for guiding the computation of self-attention weights within Transformers plays a crucial role. It ensures the model’s attention is confined to pertinent phonemes and prosody features, enhancing token prediction accuracy.

Benchmarking RALL-E

Comparative studies and evaluations place RALL-E at an advanced standing relative to established methods like VALL-E. Notably, RALL-E exhibits a marked reduction in the word error rate (WER) on zero-shot TTS, achieving a WER of 2.8\% without reranking and 1.0\% with reranking. This represents a significant leap towards achieving robustness in performance. Furthermore, RALL-E shows a dramatic improvement over VALL-E in synthesizing sentences that are inherently challenging, reducing the error rate remarkably from 68\% to 4\%.

Contribution and Implications

RALL-E's introduction to the TTS field brings forth several key contributions:

  • Robustness Enhancement: By integrating CoT prompting for prosody feature prediction and employing duration-guided masking, RALL-E significantly elevates the robustness of LLM-based TTS. This innovation reduces WER and enhances the overall speech quality and naturalness.
  • Prosody Stabilization: Through the proactive prediction of prosody features before speech token generation, RALL-E stabilizes prosody patterns in the synthesized speech, addressing a prevalent challenge in current TTS systems.
  • Focusing Mechanism: The duration-guided masking technique ensures the model's attention is strategically focused, which underpins the success in accurately synthesizing even the most challenging sentences.

Future Prospects

The advancements introduced by RALL-E open avenues for further exploration and development within the TTS domain. The distinct approach of employing CoT prompting for prosody feature prediction not only paves the way for enhancements in speech synthesis quality but also invites research into application across various languages and dialects. Additionally, the implications of duration-guided masking on model attention mechanisms offer a promising direction for optimizing the computational efficiency and effectiveness of LLMs in TTS and beyond.

Conclusion

RALL-E sets a new precedent in the pursuit of robustness and quality in LLM-based TTS synthesis. Through its innovative combination of CoT prompting and duration-guided attention, RALL-E not only outperforms existing methods in terms of speech quality and error rates but also provides a scalable model for future explorations in TTS technology. As we continue to push the boundaries of what's achievable with AI in natural language generation, RALL-E signifies a significant step forward in our journey towards creating more natural, accurate, and versatile synthetic voices.

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 4 tweets with 11 likes about this paper.