Interleaved Speech-Text Language Models are Simple Streaming Text to Speech Synthesizers (2412.16102v2)

Published 20 Dec 2024 in eess.AS

Abstract: This paper introduces Interleaved Speech-Text LLM (IST-LM) for streaming zero-shot Text-to-Speech (TTS). Unlike many previous approaches, IST-LM is directly trained on interleaved sequences of text and speech tokens with a fixed ratio, eliminating the need for additional efforts in duration prediction and grapheme-to-phoneme alignment. The ratio of text chunk size to speech chunk size is crucial for the performance of IST-LM. To explore this, we conducted a comprehensive series of statistical analyses on the training data and performed correlation analysis with the final performance, uncovering several key factors: 1) the distance between speech tokens and their corresponding text tokens, 2) the number of future text tokens accessible to each speech token, and 3) the frequency of speech tokens precedes their corresponding text tokens. Experimental results demonstrate how to achieve an optimal streaming TTS system without complicated engineering optimization, which has a limited gap with the non-streaming system. IST-LM is conceptually simple and empirically powerful, paving the way for streaming TTS with minimal overhead while largely maintaining performance, showcasing broad prospects coupled with real-time text stream from LLMs.

Authors (13)

Yifan Yang (578 papers)
Ziyang Ma (73 papers)
Shujie Liu (101 papers)
Jinyu Li (164 papers)
Hui Wang (371 papers)
Lingwei Meng (31 papers)
Haiyang Sun (45 papers)
Yuzhe Liang (8 papers)
Ruiyang Xu (15 papers)
Yuxuan Hu (35 papers)
Yan Lu (179 papers)
Rui Zhao (241 papers)
Xie Chen (166 papers)

Summary

Interleaved Speech-Text LLMs for Streaming Text-to-Speech Synthesis

The paper introduces the Interleaved Speech-Text LLM (IST-LM), a novel approach to streaming zero-shot Text-to-Speech (TTS) synthesis. This research proposes a method that integrates speech and text processing into a unified framework, eliminating the need for complex processes like duration prediction and grapheme-to-phoneme alignment that have traditionally been part of TTS systems.

Core Contributions

The IST-LM is trained on sequences of interleaved text and speech tokens. A significant focus of this research is the exploration of the optimal ratio between the sizes of text and speech chunks, which is crucial for the model's performance. The authors conducted statistical analyses on the training data to identify factors such as the distance between speech and text tokens, the number of future text tokens accessible to speech tokens, and the frequency of speech tokens preceding their corresponding text tokens.

Notably, the paper presents strong experimental results, demonstrating that the IST-LM model can achieve near-optimal performance in streaming TTS with only a marginal performance gap compared to more sophisticated non-streaming systems. Using a 1:3 ratio of text to speech chunk size, the model achieves a relative Word Error Rate (WER) gap of just 8\% compared to its non-streaming counterparts while maintaining similar speaker similarity metrics. These results were validated on the LibriTTS dataset and evaluated using the LibriSpeech test-clean set for zero-shot TTS scenarios.

Methodological Innovations

The research explores the impact of fixed interleaving ratios on performance, leading to the development of IST-LM, which directly models an alternating sequence of text and speech tokens. The model utilizes a unidirectional Transformer decoder to manage the complexities of interleaved token prediction, guided by word-level position-aware statistical measures.

Evaluation

The IST-LM outperforms existing systems like VALL-E under both continuation and cross-sentence TTS tasks. The model achieves notable improvements in both WER and speaker similarity measures due to its efficient use of semantic tokens and an advanced inference design that effectively handles the alternating structure of text and speech inputs.

Practical and Theoretical Implications

Practically, IST-LM offers a streamlined and less computationally intensive approach to streaming TTS. Theoretically, it paves the way for future exploration into LLMs that automatically integrate multiple modalities, potentially opening new avenues for research in real-time speech applications and other multimodal AI systems.

Future Directions

The development of IST-LM is poised to influence future work in streaming TTS and multimodal AI integration. One pertinent future direction would be the incorporation of a streaming vocoder to achieve end-to-end waveform synthesis in real-time applications. Further investigations into alternative interleaving strategies could potentially enhance performance and expand the applicability of such models across different languages and domains.

Overall, the paper provides a detailed assessment of how interleaving speech and text can be effectively utilized within a LLMing framework, offering new insights and strategies for achieving efficient and robust TTS in streaming contexts.

PDF Markdown

Related Papers

Tweets

https://twitter.com/AudioAndSpeech/status/1871410281269620757

https://twitter.com/AudioAndSpeech/status/1871118732002525437