Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis (2502.04128v2)

Published 6 Feb 2025 in eess.AS, cs.AI, cs.CL, cs.MM, and cs.SD

Abstract: Recent advances in text-based LLMs, particularly in the GPT series and the o1 model, have demonstrated the effectiveness of scaling both training-time and inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate models (e.g., diffusion models after LLM), complicating the decision of whether to scale a particular model during training or testing. This work makes the following contributions: First, we explore the scaling of train-time and inference-time compute for speech synthesis. Second, we propose a simple framework Llasa for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as Llama. Our experiments reveal that scaling train-time compute for Llasa consistently improves the naturalness of synthesized speech and enables the generation of more complex and accurate prosody patterns. Furthermore, from the perspective of scaling inference-time compute, we employ speech understanding models as verifiers during the search, finding that scaling inference-time compute shifts the sampling modes toward the preferences of specific verifiers, thereby improving emotional expressiveness, timbre consistency, and content accuracy. In addition, we released the checkpoint and training code for our TTS model (1B, 3B, 8B) and codec model publicly available.

Summary

The paper introduces a unified TTS framework that leverages train-time and inference-time compute scaling to simplify traditional multi-stage architectures.
It employs a single-layer VQ codec and transformer model to tokenize speech and predict next-token probabilities, achieving improved naturalness and prosody.
The study demonstrates that increased compute resources yield significant gains in zero-shot performance, speaker similarity, and reduced word error rates on benchmark datasets.

An Evaluation of LLaSA: Enhancing Speech Synthesis Through Computation Scaling

The paper presents LLaSA, a framework designed to improve text-to-speech (TTS) systems through the scaling of train-time and test-time computational resources, inspired by recent successes in text-based LLMs like GPT and the LLaMA series. This paper addresses the challenges faced by conventional multi-stage TTS systems, which often require complex architectures involving multiple models like diffusion models after LLMs. The work explores the potential of a simplified architecture aligning closely with traditional LLMs, utilizing a single-layer vector quantizer (VQ) codec and a transformer model for TTS tasks.

Methodological Innovations

LLaSA introduces an aligned TTS framework with the standard text LLM paradigm, employing:

Speech Tokenization: Using the X-codec2, a single vector quantizer strategy is chosen to convert raw speech waveforms into discrete tokens, ensuring all elements of the speech signal are accurately captured.
Training and Inference: The model initialized from LLaMA is trained in a next-token prediction manner, incorporating text and speech in a combined token sequence. The focus is on understanding and learning the conditional probabilities of speech tokens given the textual context.
Scaling Compute: The paper systematically evaluates scaling both train-time and inference-time computations to assess improvements across two key TTS attributes—text understanding and in-context learning ability.

Experimental Results

The paper demonstrates that increased train-time resources, such as larger model sizes and more extensive training data, consistently enhance the TTS model's naturalness and prosody accuracy. Experimental evaluations on test sets, including LibriSpeech and Seed-TTS-Eval, reveal significant improvements in speech quality—specifically, the emotional expressiveness and intelligibility of the generated speech. In-context learning abilities are notably enhanced, proving the model’s robustness in zero-shot TTS tasks for unseen speakers and emotions.

For test-time computation scaling, LLaSA employs various off-the-shelf speech understanding models as verifiers during inference. This scaling reveals that greater computational efforts during test time can shift the generation outputs towards specific biases beneficial for increasing speaker similarity and reducing word error rates (WER). Techniques like partial process reward models prove instrumental in controlling speech attributes such as speaker identity and emotion while maintaining content accuracy.

Comparative Performance

LLaSA showcases competitive performance against state-of-the-art TTS systems, such as Seed-TTS and others, across multiple metrics. Notably, it achieves impressive speaker similarity scores and intelligibility on zero-shot tasks, rivaling more complex, multi-stage architectures at considerably lower resource costs.

Implications and Future Directions

The LLaSA framework not only provides a scalable and flexible TTS solution but also introduces a unified approach that bypasses the traditional complexity of TTS systems. Open-sourcing the model and framework paves the way for broader exploration into simplified TTS architectures, potentially encouraging the TTS community to focus on foundational scaling laws and inference strategies akin to advancements seen in text LLMs.

Practically, the implications include enhancing real-time applications of speech synthesis in varied contexts—ranging from virtual assistants to content generation platforms. Theoretically, the findings encourage further exploration into the scalability of LLM-based TTS systems, laying groundwork for future innovations that could leverage more refined tokenization techniques and deeper integration with speech understanding models. The emphasis on aligning TTS with LLM approaches also opens avenues for cross-domain utilization, where insights from language processing might further inform audio and speech-based technologies, ultimately advancing AI's capabilities in human-computer interaction.

Related Papers

Tweets

https://twitter.com/unilightwf/status/1927051316217446723

https://twitter.com/_akhaliq/status/1887717587372450139

https://twitter.com/rohanpaul_ai/status/1890393662007255106

https://twitter.com/TheTuringPost/status/1889313677951795274

https://twitter.com/arXivGPT/status/1889012945138385258

https://twitter.com/arXivGPT/status/1888650344898396291