Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

41 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

41 tokens/sec

o3 Pro

7 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

487 3

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model on 100K hours of data (2402.08093v2)

Published 12 Feb 2024 in cs.LG, cs.CL, and eess.AS

Abstract: We introduce a text-to-speech (TTS) model called BASE TTS, which stands for $\textbf{B}$ig $\textbf{A}$daptive $\textbf{S}$treamable TTS with $\textbf{E}$mergent abilities. BASE TTS is the largest TTS model to-date, trained on 100K hours of public domain speech data, achieving a new state-of-the-art in speech naturalness. It deploys a 1-billion-parameter autoregressive Transformer that converts raw texts into discrete codes ("speechcodes") followed by a convolution-based decoder which converts these speechcodes into waveforms in an incremental, streamable manner. Further, our speechcodes are built using a novel speech tokenization technique that features speaker ID disentanglement and compression with byte-pair encoding. Echoing the widely-reported "emergent abilities" of LLMs when trained on increasing volume of data, we show that BASE TTS variants built with 10K+ hours and 500M+ parameters begin to demonstrate natural prosody on textually complex sentences. We design and share a specialized dataset to measure these emergent abilities for text-to-speech. We showcase state-of-the-art naturalness of BASE TTS by evaluating against baselines that include publicly available large-scale text-to-speech systems: YourTTS, Bark and TortoiseTTS. Audio samples generated by the model can be heard at https://amazon-ltts-paper.com/.

PDF HTML Abstract

Building a Billion-Parameter Text-to-Speech Model: Insights from BASE TTS

Introduction to BASE TTS

BASE TTS introduces a novel direction in Text-to-Speech (TTS) technology, leveraging the potential of large-scale LLMs and novel speech tokenization techniques. The paper demonstrates a significant leap in speech synthesis by utilizing a billion-parameter model trained on an unprecedented dataset of 100,000 hours of speech. This model, named Big Adaptive Streamable TTS with Emergent abilities (BASE TTS), encapsulates the essence of bringing text-to-speech synthesis closer to a natural human-like performance, particularly in rendering textually complex sentences with natural prosody.

Novel Contributions

The main contributions of this work are threefold:

Largest TTS Model: BASE TTS sets a new benchmark in the field by being the largest model to date, with 1 billion parameters. It outperforms existing large-scale TTS models in subjective evaluations, providing more natural speech synthesis.
Emergent Abilities and Benchmark: By scaling the model and dataset size, BASE TTS exhibits emergent abilities, allowing it to effectively render complex prosodic patterns and textual nuances. A specialized dataset and subjective evaluation benchmark for "emergent abilities" in TTS are also introduced, enabling systematic paper of model performance against challenging linguistic phenomena.
Novel Speech Representations: The introduction of speaker-disentangled speechcodes, built atop a WavLM Self-Supervised Learning model, demonstrates a sophisticated method to capture only the essential phonemic and prosodic information, achieving high-quality waveform synthesis even at significant compression rates.

Technical Overview

BASE TTS approaches the challenge of TTS through an LLM-based paradigm, treating TTS as a next-token-prediction problem. The model architecture comprises a Transformer-based autoregressive model coupled with discrete speech representations termed speechcodes. These speechcodes, derived using a novel tokenization technique, encapsulate speaker ID disentanglement and compression. For the practical application of converting these speechcodes into waveforms, a convolution-based speechcode decoder is employed, markedly enhancing computational efficiency without sacrificing speech quality.

The dataset used for training BASE TTS, consisting of 100,000 hours of public domain speech data, is significantly more extensive than those used in prior studies, aiding the model in learning from a diverse set of linguistic and prosodic patterns. Notably, BASE TTS employs strategies such as Byte-Pair Encoding (BPE) on speechcodes to optimize sequence length and thus model performance over longer audio sequences.

Theoretical Implications and Future Prospects

The implication of this research extends beyond mere improvement in TTS quality; it explores the potential emergence of new capabilities as TTS models scale. The phenomenon, observed in LLMs, where qualitative leaps in capability occur beyond certain scale thresholds, is hypothesized to apply to LTTS as well. BASE TTS's performance on the emergent abilities benchmark underscores the lasting impact of model and data scaling on TTS quality and complexity handling.

Future directions highlighted by this work include exploring the scalability of BASE TTS further and integrating text-only LLM knowledge to close the performance gaps in syntactic complexity and emotional expression. Additionally, addressing limitations such as occasional hallucinations or synthesis cutoffs emerging from autoregressive modeling is pivotal. Coupled with ethical considerations around misuse and biases within speech models, these form critical avenues for ongoing research.

Conclusion

BASE TTS's achievements herald a new era in TTS research, promising significantly more natural and expressive synthetic speech. By combining innovative speech tokenization methods with the power of large-scale datasets and models, BASE TTS paves the way for advancements in speech synthesis that could have wide-ranging applications, from enhancing communication aids to creating more immersive interactive systems.

PDF Markdown Bookmark Chat (Pro)

References (96)

Authors (19)

Guillermo Cámbara (9 papers)
Yang Li (1140 papers)
Fatih Beyhan (4 papers)
Arent van Korlaar (4 papers)
Fan Yang (877 papers)
Arnaud Joly (14 papers)
Álvaro Martín-Cortinas (3 papers)
Ammar Abbas (12 papers)
Adam Michalski (2 papers)
Alexis Moinet (22 papers)
Sri Karlapati (13 papers)
Haohan Guo (22 papers)
Bartosz Putrycz (8 papers)
Soledad López Gambino (1 paper)
Kayeon Yoo (1 paper)
Elena Sokolova (6 papers)
Thomas Drugman (61 papers)
Mateusz Łajszczak (4 papers)
Ewa Muszyńska (1 paper)

Citations (54)

View on Semantic Scholar

Tweets

https://twitter.com/_akhaliq/status/1757601502003134663

https://twitter.com/arankomatsuzaki/status/1757599587387187302

https://twitter.com/ArxivSound/status/1757633113448927341

https://twitter.com/fly51fly/status/1757902665105973480

https://twitter.com/burny_tech/status/1759258508141215885

https://twitter.com/danilop/status/1759915815975989346

HackerNews

BASE TTS: Lessons from building a billion-parameter Text-to-Speech model (3 points, 0 comments)