Interleaved Speech-Text LLMs for Streaming Text-to-Speech Synthesis
The paper introduces the Interleaved Speech-Text LLM (IST-LM), a novel approach to streaming zero-shot Text-to-Speech (TTS) synthesis. This research proposes a method that integrates speech and text processing into a unified framework, eliminating the need for complex processes like duration prediction and grapheme-to-phoneme alignment that have traditionally been part of TTS systems.
Core Contributions
The IST-LM is trained on sequences of interleaved text and speech tokens. A significant focus of this research is the exploration of the optimal ratio between the sizes of text and speech chunks, which is crucial for the model's performance. The authors conducted statistical analyses on the training data to identify factors such as the distance between speech and text tokens, the number of future text tokens accessible to speech tokens, and the frequency of speech tokens preceding their corresponding text tokens.
Notably, the paper presents strong experimental results, demonstrating that the IST-LM model can achieve near-optimal performance in streaming TTS with only a marginal performance gap compared to more sophisticated non-streaming systems. Using a 1:3 ratio of text to speech chunk size, the model achieves a relative Word Error Rate (WER) gap of just 8\% compared to its non-streaming counterparts while maintaining similar speaker similarity metrics. These results were validated on the LibriTTS dataset and evaluated using the LibriSpeech test-clean set for zero-shot TTS scenarios.
Methodological Innovations
The research explores the impact of fixed interleaving ratios on performance, leading to the development of IST-LM, which directly models an alternating sequence of text and speech tokens. The model utilizes a unidirectional Transformer decoder to manage the complexities of interleaved token prediction, guided by word-level position-aware statistical measures.
Evaluation
The IST-LM outperforms existing systems like VALL-E under both continuation and cross-sentence TTS tasks. The model achieves notable improvements in both WER and speaker similarity measures due to its efficient use of semantic tokens and an advanced inference design that effectively handles the alternating structure of text and speech inputs.
Practical and Theoretical Implications
Practically, IST-LM offers a streamlined and less computationally intensive approach to streaming TTS. Theoretically, it paves the way for future exploration into LLMs that automatically integrate multiple modalities, potentially opening new avenues for research in real-time speech applications and other multimodal AI systems.
Future Directions
The development of IST-LM is poised to influence future work in streaming TTS and multimodal AI integration. One pertinent future direction would be the incorporation of a streaming vocoder to achieve end-to-end waveform synthesis in real-time applications. Further investigations into alternative interleaving strategies could potentially enhance performance and expand the applicability of such models across different languages and domains.
Overall, the paper provides a detailed assessment of how interleaving speech and text can be effectively utilized within a LLMing framework, offering new insights and strategies for achieving efficient and robust TTS in streaming contexts.