- The paper introduces Spark-TTS, an efficient LLM-based text-to-speech model utilizing a single-stream decoupled tokenization method, BiCodec, for improved efficiency and control.
- Spark-TTS employs BiCodec, which separates speech into semantic and global tokens, achieving state-of-the-art reconstruction quality and enabling both coarse-grained and fine-grained voice control.
- The authors also release VoxBox, a 100,000-hour meticulously annotated speech dataset, and show Spark-TTS achieves high intelligibility in zero-shot synthesis and strong subjective quality (UTMOS 4.35).
The paper introduces Spark-TTS, a novel and efficient text-to-speech (TTS) model leveraging LLMs with a single-stream decoupled speech tokenization method. The system uses BiCodec, a speech codec that decomposes speech into semantic tokens (for linguistic content) and global tokens (for speaker attributes). It also presents VoxBox, a meticulously annotated 100,000-hour dataset designed to facilitate controllable TTS research.
The key contributions of this work can be summarized as:
- BiCodec Tokenization: The paper introduces a novel tokenization scheme called BiCodec, designed to generate a hybrid stream of semantic and global tokens.
- Comprehensive Voice Control: Spark-TTS enables both coarse-grained (e.g., gender, speaking style) and fine-grained (e.g., pitch, speaking rate) voice control, integrated within a text LLM-compatible architecture.
- Benchmark Dataset: The paper introduces VoxBox, a large speech corpus with systematic data collection, cleaning, and attribute annotation.
The BiCodec architecture consists of a global tokenizer and a semantic tokenizer. The global tokenizer extracts global tokens from Mel spectrograms, while the semantic tokenizer uses features from wav2vec 2.0 to extract semantic tokens at 50 tokens per second (TPS). The codec is trained end-to-end using a Generative Adversarial Network (GAN) methodology, minimizing reconstruction loss and optimizing the Vector Quantization (VQ) codebook. The loss functions include L1 loss on multi-scale mel-spectrograms, multi-period discriminator loss, multi-band multi-scale Short-Time Fourier Transform (STFT) discriminator loss, codebook loss, and commitment loss.
The Spark-TTS speech LLM employs a decoder-only transformer architecture and uses the pre-trained Qwen2.5-0.5B LLM as its backbone. The model supports zero-shot TTS and voice creation using attribute labels. It encodes attribute information at two levels: coarse-grained (attribute labels) and fine-grained (attribute values). During inference, the model predicts fine-grained pitch values, speed values, global tokens, and semantic tokens through a chain-of-thought (CoT) approach. The LLM is trained by minimizing the negative log-likelihood of token predictions.
The VoxBox dataset comprises 4.7 million audio files from 29 open datasets, totaling 102.5k hours of speech data. Each audio file is annotated with gender, pitch, and speed. Gender annotation is performed using a fine-tuned WavLM-large model, while pitch annotation involves extracting the average pitch value using PyWorld and converting it to the Mel scale. Speed annotation is based on syllable-per-second (SPS) measurements. Data cleaning is performed using Whisper-based Automatic Speech Recognition (ASR) systems and FunASR.
Experiments were conducted to evaluate the reconstruction performance of BiCodec and the control capabilities and zero-shot TTS performance of Spark-TTS. BiCodec achieves a new state-of-the-art (SOTA) reconstruction quality, operating at 50 TPS with a bit rate of 0.65 kbps. Spark-TTS significantly outperforms other controllable TTS systems in gender control, achieving 99.77% accuracy. Spark-TTS also demonstrates high intelligibility in zero-shot TTS scenarios, achieving competitive Character Error Rate (CER) and Word Error Rate (WER) scores. A subjective quality evaluation using Universal Turing Machine for MOS (UTMOS) showed that Spark-TTS achieves higher speech quality (UTMOS = 4.35) than ground truth (UTMOS = 4.08) and CosyVoice2 (UTMOS = 4.23).
The single-stage model Llasa uses a single auto-regressive (AR) LLM and single codebook. Spark-TTS achieves better zero-shot TTS performance and controllable voice creation with fewer model parameters. The paper notes a limitation of Spark-TTS, which exhibits relatively lower speaker similarity metrics in zero-shot TTS compared to multi-stage or non-autoregressive (NAR) methods, potentially due to the greater speaker variability introduced by the AR LLM during inference. The authors suggest future work to enhance global token control over timbre by introducing perturbations to formants or pitch in the semantic token input to promote better disentanglement of timbre information.