Spark-TTS: An Efficient LLM-Based Text-to-Speech Model with Single-Stream Decoupled Speech Tokens (2503.01710v1)

Published 3 Mar 2025 in cs.SD, cs.AI, and eess.AS

Abstract: Recent advancements in LLMs have driven significant progress in zero-shot text-to-speech (TTS) synthesis. However, existing foundation models rely on multi-stage processing or complex architectures for predicting multiple codebooks, limiting efficiency and integration flexibility. To overcome these challenges, we introduce Spark-TTS, a novel system powered by BiCodec, a single-stream speech codec that decomposes speech into two complementary token types: low-bitrate semantic tokens for linguistic content and fixed-length global tokens for speaker attributes. This disentangled representation, combined with the Qwen2.5 LLM and a chain-of-thought (CoT) generation approach, enables both coarse-grained control (e.g., gender, speaking style) and fine-grained adjustments (e.g., precise pitch values, speaking rate). To facilitate research in controllable TTS, we introduce VoxBox, a meticulously curated 100,000-hour dataset with comprehensive attribute annotations. Extensive experiments demonstrate that Spark-TTS not only achieves state-of-the-art zero-shot voice cloning but also generates highly customizable voices that surpass the limitations of reference-based synthesis. Source code, pre-trained models, and audio samples are available at https://github.com/SparkAudio/Spark-TTS.

Summary

The paper introduces Spark-TTS, an efficient LLM-based text-to-speech model utilizing a single-stream decoupled tokenization method, BiCodec, for improved efficiency and control.
Spark-TTS employs BiCodec, which separates speech into semantic and global tokens, achieving state-of-the-art reconstruction quality and enabling both coarse-grained and fine-grained voice control.
The authors also release VoxBox, a 100,000-hour meticulously annotated speech dataset, and show Spark-TTS achieves high intelligibility in zero-shot synthesis and strong subjective quality (UTMOS 4.35).

The paper introduces Spark-TTS, a novel and efficient text-to-speech (TTS) model leveraging LLMs with a single-stream decoupled speech tokenization method. The system uses BiCodec, a speech codec that decomposes speech into semantic tokens (for linguistic content) and global tokens (for speaker attributes). It also presents VoxBox, a meticulously annotated 100,000-hour dataset designed to facilitate controllable TTS research.

The key contributions of this work can be summarized as:

BiCodec Tokenization: The paper introduces a novel tokenization scheme called BiCodec, designed to generate a hybrid stream of semantic and global tokens.
Comprehensive Voice Control: Spark-TTS enables both coarse-grained (e.g., gender, speaking style) and fine-grained (e.g., pitch, speaking rate) voice control, integrated within a text LLM-compatible architecture.
Benchmark Dataset: The paper introduces VoxBox, a large speech corpus with systematic data collection, cleaning, and attribute annotation.

The BiCodec architecture consists of a global tokenizer and a semantic tokenizer. The global tokenizer extracts global tokens from Mel spectrograms, while the semantic tokenizer uses features from wav2vec 2.0 to extract semantic tokens at 50 tokens per second (TPS). The codec is trained end-to-end using a Generative Adversarial Network (GAN) methodology, minimizing reconstruction loss and optimizing the Vector Quantization (VQ) codebook. The loss functions include L1 loss on multi-scale mel-spectrograms, multi-period discriminator loss, multi-band multi-scale Short-Time Fourier Transform (STFT) discriminator loss, codebook loss, and commitment loss.

The Spark-TTS speech LLM employs a decoder-only transformer architecture and uses the pre-trained Qwen2.5-0.5B LLM as its backbone. The model supports zero-shot TTS and voice creation using attribute labels. It encodes attribute information at two levels: coarse-grained (attribute labels) and fine-grained (attribute values). During inference, the model predicts fine-grained pitch values, speed values, global tokens, and semantic tokens through a chain-of-thought (CoT) approach. The LLM is trained by minimizing the negative log-likelihood of token predictions.

The VoxBox dataset comprises 4.7 million audio files from 29 open datasets, totaling 102.5k hours of speech data. Each audio file is annotated with gender, pitch, and speed. Gender annotation is performed using a fine-tuned WavLM-large model, while pitch annotation involves extracting the average pitch value using PyWorld and converting it to the Mel scale. Speed annotation is based on syllable-per-second (SPS) measurements. Data cleaning is performed using Whisper-based Automatic Speech Recognition (ASR) systems and FunASR.

Experiments were conducted to evaluate the reconstruction performance of BiCodec and the control capabilities and zero-shot TTS performance of Spark-TTS. BiCodec achieves a new state-of-the-art (SOTA) reconstruction quality, operating at 50 TPS with a bit rate of 0.65 kbps. Spark-TTS significantly outperforms other controllable TTS systems in gender control, achieving 99.77% accuracy. Spark-TTS also demonstrates high intelligibility in zero-shot TTS scenarios, achieving competitive Character Error Rate (CER) and Word Error Rate (WER) scores. A subjective quality evaluation using Universal Turing Machine for MOS (UTMOS) showed that Spark-TTS achieves higher speech quality (UTMOS = 4.35) than ground truth (UTMOS = 4.08) and CosyVoice2 (UTMOS = 4.23).

The single-stage model Llasa uses a single auto-regressive (AR) LLM and single codebook. Spark-TTS achieves better zero-shot TTS performance and controllable voice creation with fewer model parameters. The paper notes a limitation of Spark-TTS, which exhibits relatively lower speaker similarity metrics in zero-shot TTS compared to multi-stage or non-autoregressive (NAR) methods, potentially due to the greater speaker variability introduced by the AR LLM during inference. The authors suggest future work to enhance global token control over timbre by introducing perturbations to formants or pitch in the semantic token input to promote better disentanglement of timbre information.