Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
41 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data (2402.08093v2)

Published 12 Feb 2024 in cs.LG, cs.CL, and eess.AS
BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data

Abstract: We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf{B}$ig $\textbf{A}$daptive $\textbf{S}$treamable TTS with $\textbf{E}$mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of LLMs when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/.

Building a Billion-Parameter Text-to-Speech Model: Insights from BASE TTS

Introduction to BASE TTS

BASE TTS introduces a novel direction in Text-to-Speech (TTS) technology, leveraging the potential of large-scale LLMs and novel speech tokenization techniques. The paper demonstrates a significant leap in speech synthesis by utilizing a billion-parameter model trained on an unprecedented dataset of 100,000 hours of speech. This model, named Big Adaptive Streamable TTS with Emergent abilities (BASE TTS), encapsulates the essence of bringing text-to-speech synthesis closer to a natural human-like performance, particularly in rendering textually complex sentences with natural prosody.

Novel Contributions

The main contributions of this work are threefold:

  1. Largest TTS Model: BASE TTS sets a new benchmark in the field by being the largest model to date, with 1 billion parameters. It outperforms existing large-scale TTS models in subjective evaluations, providing more natural speech synthesis.
  2. Emergent Abilities and Benchmark: By scaling the model and dataset size, BASE TTS exhibits emergent abilities, allowing it to effectively render complex prosodic patterns and textual nuances. A specialized dataset and subjective evaluation benchmark for "emergent abilities" in TTS are also introduced, enabling systematic paper of model performance against challenging linguistic phenomena.
  3. Novel Speech Representations: The introduction of speaker-disentangled speechcodes, built atop a WavLM Self-Supervised Learning model, demonstrates a sophisticated method to capture only the essential phonemic and prosodic information, achieving high-quality waveform synthesis even at significant compression rates.

Technical Overview

BASE TTS approaches the challenge of TTS through an LLM-based paradigm, treating TTS as a next-token-prediction problem. The model architecture comprises a Transformer-based autoregressive model coupled with discrete speech representations termed speechcodes. These speechcodes, derived using a novel tokenization technique, encapsulate speaker ID disentanglement and compression. For the practical application of converting these speechcodes into waveforms, a convolution-based speechcode decoder is employed, markedly enhancing computational efficiency without sacrificing speech quality.

The dataset used for training BASE TTS, consisting of 100,000 hours of public domain speech data, is significantly more extensive than those used in prior studies, aiding the model in learning from a diverse set of linguistic and prosodic patterns. Notably, BASE TTS employs strategies such as Byte-Pair Encoding (BPE) on speechcodes to optimize sequence length and thus model performance over longer audio sequences.

Theoretical Implications and Future Prospects

The implication of this research extends beyond mere improvement in TTS quality; it explores the potential emergence of new capabilities as TTS models scale. The phenomenon, observed in LLMs, where qualitative leaps in capability occur beyond certain scale thresholds, is hypothesized to apply to LTTS as well. BASE TTS's performance on the emergent abilities benchmark underscores the lasting impact of model and data scaling on TTS quality and complexity handling.

Future directions highlighted by this work include exploring the scalability of BASE TTS further and integrating text-only LLM knowledge to close the performance gaps in syntactic complexity and emotional expression. Additionally, addressing limitations such as occasional hallucinations or synthesis cutoffs emerging from autoregressive modeling is pivotal. Coupled with ethical considerations around misuse and biases within speech models, these form critical avenues for ongoing research.

Conclusion

BASE TTS's achievements herald a new era in TTS research, promising significantly more natural and expressive synthetic speech. By combining innovative speech tokenization methods with the power of large-scale datasets and models, BASE TTS paves the way for advancements in speech synthesis that could have wide-ranging applications, from enhancing communication aids to creating more immersive interactive systems.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (96)
  1. Sparks of artificial general intelligence: Early experiments with gpt-4, 2023.
  2. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  3. OpenAI. Gpt-4 technical report, 2023.
  4. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  5. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  6. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  7. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021.
  8. A fistful of words: Learning transferable visual models from bag-of-words supervision. arXiv preprint arXiv:2112.13884, 2021.
  9. K-lite: Learning transferable visual models with external knowledge. Advances in Neural Information Processing Systems, 35:15558–15573, 2022.
  10. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  11. Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125, 1(2):3, 2022.
  12. Glide: Towards photorealistic image generation and editing with text-guided diffusion models. In International Conference on Machine Learning, pages 16784–16804. PMLR, 2022.
  13. Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35:36479–36494, 2022.
  14. Scaling autoregressive models for content-rich text-to-image generation. arXiv preprint arXiv:2206.10789, 2(3):5, 2022.
  15. Attention is all you need. CoRR, abs/1706.03762, 2017. URL http://arxiv.org/abs/1706.03762.
  16. Audiolm: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023a.
  17. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023a.
  18. Zalán Borsos et al. SoundStorm: Efficient parallel audio generation. arXiv preprint arXiv:2305.09636, 2023b.
  19. Speechx: Neural codec language model as a versatile speech transformer. arXiv preprint arXiv:2308.06873, 2023b.
  20. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. CoRR, abs/2304.09116, 2023. doi: 10.48550/arXiv.2304.09116. URL https://doi.org/10.48550/arXiv.2304.09116.
  21. James Betker. Better speech synthesis through scaling. arXiv preprint arXiv:2305.07243, 2023.
  22. Natural TTS synthesis by conditioning wavenet on mel spectrogram predictions. CoRR, abs/1712.05884, 2017. URL http://arxiv.org/abs/1712.05884.
  23. ecat: An end-to-end model for multi-speaker tts & many-to-many fine-grained prosody transfer. In Interspeech 2023, 2023. URL https://www.amazon.science/publications/ecat-an-end-to-end-model-for-multi-speaker-tts-many-to-many-fine-grained-prosody-transfer.
  24. Conditional variational autoencoder with adversarial learning for end-to-end text-to-speech. In International Conference on Machine Learning, pages 5530–5540. PMLR, 2021.
  25. Fastspeech 2: Fast and high-quality end-to-end text to speech. In 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021. OpenReview.net, 2021. URL https://openreview.net/forum?id=piLPYqxtWuA.
  26. Grad-tts: A diffusion probabilistic model for text-to-speech. In International Conference on Machine Learning, pages 8599–8608. PMLR, 2021.
  27. An overview of affective speech synthesis and conversion in the deep learning era. Proceedings of the IEEE, 2023.
  28. Can we generate emotional pronunciations for expressive speech synthesis? IEEE Transactions on Affective Computing, 11(4):684–695, 2018.
  29. Bastian Schnell. Controllability and interpretability in affective speech synthesis. Technical report, EPFL, 2022.
  30. A survey on neural speech synthesis. arXiv preprint arXiv:2106.15561, 2021.
  31. Improving the prosody of rnn-based english text-to-speech synthesis by incorporating a bert model. In INTERSPEECH 2020, pages 4412–4416, 2020.
  32. Emergent abilities of large language models. arXiv preprint arXiv:2206.07682, 2022.
  33. Emergent analogical reasoning in large language models. Nature Human Behaviour, pages 1–16, 2023.
  34. Neural discrete representation learning. In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/7a98af17e63a0ac09ce2e96d03992fbc-Paper.pdf.
  35. Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021.
  36. High fidelity neural audio compression. CoRR, abs/2210.13438, 2022. doi: 10.48550/arXiv.2210.13438. URL https://doi.org/10.48550/arXiv.2210.13438.
  37. Speak, read and prompt: High-fidelity text-to-speech with minimal supervision. arXiv preprint arXiv:2302.03540, 2023.
  38. Revisiting over-smoothness in text to speech. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8197–8213, 2022.
  39. Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16:1–14, 10 2022. doi: 10.1109/JSTSP.2022.3188113.
  40. Transformer vq-vae for unsupervised unit discovery and speech synthesis: Zerospeech 2020 challenge. arXiv preprint arXiv:2005.11676, 2020.
  41. Philip Gage. A new algorithm for data compression. The C Users Journal archive, 12:23–38, 1994. URL https://api.semanticscholar.org/CorpusID:59804030.
  42. Towards end-to-end prosody transfer for expressive speech synthesis with tacotron. In international conference on machine learning, pages 4693–4702. PMLR, 2018.
  43. Enhancing the stability of llm-based speech generation systems through self-supervised representations, 2024.
  44. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016.
  45. Hierarchical autoregressive image models with auxiliary decoders. arXiv preprint arXiv:1903.04933, 2019.
  46. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018.
  47. Unsupervised domain adaptation by backpropagation. In International conference on machine learning, pages 1180–1189. PMLR, 2015.
  48. Hisao Kuwabara. Acoustic properties of phonemes in continuous speech for different speaking rate. In Proceeding of Fourth International Conference on Spoken Language Processing. ICSLP’96, volume 4, pages 2435–2438. IEEE, 1996.
  49. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.
  50. Textually pretrained speech language models, 2023.
  51. Univnet: A neural vocoder with multi-resolution spectrogram discriminators for high-fidelity waveform generation, 2021.
  52. Long short-term memory. Neural computation, 9(8):1735–1780, 1997.
  53. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. CoRR, abs/2010.05646, 2020. URL https://arxiv.org/abs/2010.05646.
  54. BigVGAN: A universal neural vocoder with large-scale training. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=iTtGCMDEzS_.
  55. Location-relative attention mechanisms for robust long-form speech synthesis. In ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6194–6198, 2020. doi: 10.1109/ICASSP40776.2020.9054106.
  56. Libriheavy: a 50,000 hours asr corpus with punctuation casing and context. arXiv preprint arXiv:2309.08105, 2023.
  57. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2017.
  58. Weijia Ni. Sidestepping garden paths: Assessing the contributions of syntax, semantics and plausibility in resolving ambiguities. Language and Cognitive Processes, 11(3):283–334, 1996. doi: 10.1080/016909696387196. URL https://doi.org/10.1080/016909696387196.
  59. The derivation of prosody for text-to-speech from prosodic sentence structure. Computer Speech & Language, 6(1):77–98, 1992. ISSN 0885-2308. doi: https://doi.org/10.1016/0885-2308(92)90044-5. URL https://www.sciencedirect.com/science/article/pii/0885230892900445.
  60. Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International Conference on Machine Learning, pages 2709–2720. PMLR, 2022.
  61. Speak foreign languages with your own voice: Cross-lingual neural codec language modeling. CoRR, abs/2303.03926, 2023a. doi: 10.48550/arXiv.2303.03926. URL https://doi.org/10.48550/arXiv.2303.03926.
  62. Viola: Unified codec language models for speech recognition, synthesis, and translation. arXiv preprint arXiv:2305.16107, 2023c.
  63. Discrete audio representation as an alternative to mel-spectrograms for speaker and speech recognition. CoRR, abs/2309.10922, 2023. doi: 10.48550/arXiv.2309.10922. URL https://doi.org/10.48550/arXiv.2309.10922.
  64. Towards universal speech discrete tokens: A case study for asr and tts. arXiv preprint arXiv:2309.07377, 2023.
  65. Repcodec: A speech representation codec for speech tokenization. arXiv preprint arXiv:2309.00169, 2023.
  66. W2v-bert: Combining contrastive learning and masked language modeling for self-supervised speech pre-training. In 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pages 244–250. IEEE, 2021.
  67. Speechtokenizer: Unified speech tokenizer for speech large language models. arXiv preprint arXiv:2308.16692, 2023b.
  68. Self-supervised speech representation learning: A review. IEEE Journal of Selected Topics in Signal Processing, 2022.
  69. Controllable emphasis with zero data for text-to-speech. CoRR, abs/2307.07062, 2023. doi: 10.48550/arXiv.2307.07062. URL https://doi.org/10.48550/arXiv.2307.07062.
  70. Wavenet: A generative model for raw audio. In The 9th ISCA Speech Synthesis Workshop, Sunnyvale, CA, USA, 13-15 September 2016, page 125. ISCA, 2016. URL http://www.isca-speech.org/archive/SSW_2016/abstracts/ssw9_DS-4_van_den_Oord.html.
  71. Universal neural vocoding with parallel wavenet. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6044–6048. IEEE, 2021.
  72. Copycat: Many-to-many fine-grained prosody transfer for neural text-to-speech. In Helen Meng, Bo Xu, and Thomas Fang Zheng, editors, Interspeech 2020, 21st Annual Conference of the International Speech Communication Association, Virtual Event, Shanghai, China, 25-29 October 2020, pages 4387–4391. ISCA, 2020. doi: 10.21437/Interspeech.2020-1251. URL https://doi.org/10.21437/Interspeech.2020-1251.
  73. In other news: a bi-style text-to-speech model for synthesizing newscaster voice with limited data. In Anastassia Loukina, Michelle Morales, and Rohit Kumar, editors, Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 2 (Industry Papers), pages 205–213. Association for Computational Linguistics, 2019. doi: 10.18653/v1/n19-2026. URL https://doi.org/10.18653/v1/n19-2026.
  74. Neural speech synthesis with transformer network. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 6706–6713, 2019.
  75. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  76. LoRA: Low-rank adaptation of large language models. In International Conference on Learning Representations, 2022. URL https://openreview.net/forum?id=nZeVKeeFYf9.
  77. On the effectiveness of parameter-efficient fine-tuning. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 12799–12807, 2023.
  78. Copycat2: A single model for multi-speaker TTS and many-to-many fine-grained prosody transfer. In Hanseok Ko and John H. L. Hansen, editors, Interspeech 2022, 23rd Annual Conference of the International Speech Communication Association, Incheon, Korea, 18-22 September 2022, pages 3363–3367. ISCA, 2022. doi: 10.21437/Interspeech.2022-367. URL https://doi.org/10.21437/Interspeech.2022-367.
  79. Improving speech prosody of audiobook text-to-speech synthesis with acoustic and textual contexts. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  80. Self-supervised context-aware style representation for expressive speech synthesis. arXiv preprint arXiv:2206.12559, 2022a.
  81. A comparative analysis of pretrained language models for text-to-speech. CoRR, abs/2309.01576, 2023. doi: 10.48550/arXiv.2309.01576. URL https://doi.org/10.48550/arXiv.2309.01576.
  82. A chapter-wise understanding system for text-to-speech in chinese novels. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6069–6073. IEEE, 2021.
  83. Emotion-aware prosodic phrasing for expressive text-to-speech. arXiv preprint arXiv:2309.11724, 2023.
  84. Mohamed Osman. Emo-tts:parallel transformer-based text-to-speech model with emotional awareness. In 2022 5th International Conference on Computing and Informatics (ICCI), pages 169–174, 2022. doi: 10.1109/ICCI54321.2022.9756092.
  85. Text aware emotional text-to-speech with bert. Proc. Interspeech 2022, pages 4601–4605, 2022.
  86. Improving emotional tts with an emotion intensity input from unsupervised extraction. In Proc. 11th ISCA Speech Synthesis Workshop (SSW 11), pages 60–65, 2021.
  87. The hourglass of emotions. In Cognitive Behavioural Systems: COST 2102 International Training School, Dresden, Germany, February 21-26, 2011, Revised Selected Papers, pages 144–157. Springer, 2012.
  88. Low-resource expressive text-to-speech using data augmentation. In ICASSP 2021-2021 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6593–6597. IEEE, 2021.
  89. Distribution augmentation for low-resource expressive text-to-speech. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022, pages 8307–8311. IEEE, 2022. doi: 10.1109/ICASSP43922.2022.9746291. URL https://doi.org/10.1109/ICASSP43922.2022.9746291.
  90. Meysam Shamsi. Tts voice corpus reduction for audio-book generation. In 6e conférence conjointe Journées d’Études sur la Parole (JEP, 33e édition), Traitement Automatique des Langues Naturelles (TALN, 27e édition), Rencontre des Étudiants Chercheurs en Informatique pour le Traitement Automatique des Langues (RÉCITAL, 22e édition). Volume 3: Rencontre des Étudiants Chercheurs en Informatique pour le TAL, pages 193–204. ATALA; AFCP, 2020.
  91. Adaspeech 4: Adaptive text to speech in zero-shot scenarios. arXiv preprint arXiv:2204.00436, 2022b.
  92. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  93. Seamlessm4t-massively multilingual & multimodal machine translation, 2023.
  94. Acoustic correlates of sexual orientation and gender-role self-concept in women’s speech. The Journal of the Acoustical Society of America, 141(6):4793–4809, 06 2017. ISSN 0001-4966. doi: 10.1121/1.4988684. URL https://doi.org/10.1121/1.4988684.
  95. Sounding black or white: priming identity and biracial speech. Frontiers in Psychology, 6, 2015. ISSN 1664-1078. doi: 10.3389/fpsyg.2015.00457. URL https://www.frontiersin.org/articles/10.3389/fpsyg.2015.00457.
  96. The role of segments and prosody in the identification of a speaker’s dialect. Journal of Phonetics, 68:69–84, 2018. ISSN 0095-4470. doi: https://doi.org/10.1016/j.wocn.2018.02.001. URL https://www.sciencedirect.com/science/article/pii/S0095447016300365.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (19)
  1. Guillermo Cámbara (9 papers)
  2. Yang Li (1140 papers)
  3. Fatih Beyhan (4 papers)
  4. Arent van Korlaar (4 papers)
  5. Fan Yang (877 papers)
  6. Arnaud Joly (14 papers)
  7. Álvaro Martín-Cortinas (3 papers)
  8. Ammar Abbas (12 papers)
  9. Adam Michalski (2 papers)
  10. Alexis Moinet (22 papers)
  11. Sri Karlapati (13 papers)
  12. Haohan Guo (22 papers)
  13. Bartosz Putrycz (8 papers)
  14. Soledad López Gambino (1 paper)
  15. Kayeon Yoo (1 paper)
  16. Elena Sokolova (6 papers)
  17. Thomas Drugman (61 papers)
  18. Mateusz Łajszczak (4 papers)
  19. Ewa Muszyńska (1 paper)
Citations (54)