Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

VoiceStar: Robust Zero-Shot Autoregressive TTS with Duration Control and Extrapolation (2505.19462v2)

Published 26 May 2025 in eess.AS and cs.SD

Abstract: We present VoiceStar, the first zero-shot TTS model that achieves both output duration control and extrapolation. VoiceStar is an autoregressive encoder-decoder neural codec LLM, that leverages a novel Progress-Monitoring Rotary Position Embedding (PM-RoPE) and is trained with Continuation-Prompt Mixed (CPM) training. PM-RoPE enables the model to better align text and speech tokens, indicates the target duration for the generated speech, and also allows the model to generate speech waveforms much longer in duration than those seen during. CPM training also helps to mitigate the training/inference mismatch, and significantly improves the quality of the generated speech in terms of speaker similarity and intelligibility. VoiceStar outperforms or is on par with current state-of-the-art models on short-form benchmarks such as Librispeech and Seed-TTS, and significantly outperforms these models on long-form/extrapolation benchmarks (20-50s) in terms of intelligibility and naturalness. Code and models: https://github.com/jasonppy/VoiceStar. Audio samples: https://jasonppy.github.io/VoiceStar_web

Summary

Summary of "VoiceStar: Robust Zero-Shot Autoregressive TTS with Duration Control and Extrapolation"

The paper presents VoiceStar, an advanced autoregressive text-to-speech (TTS) model designed to overcome existing limitations in Neural Codec LLMs (NCLMs) for voice synthesis. This work addresses critical challenges, including duration control and sequence length extrapolation, offering solutions that enhance the robustness and intelligibility of generated speech.

Key Innovations and Contributions

  1. Progress-Monitoring Rotary Position Embedding (PM-RoPE):
    • VoiceStar introduces the PM-RoPE mechanism, an innovation that improves text-speech alignment and enables precise control over the generated speech duration. This embedding technique significantly aids in long-form speech generation, allowing the model to produce outputs exceeding the lengths encountered during training.
  2. Continuation-Prompt Mixed (CPM) Training:
    • The authors propose CPM training to mitigate the mismatch between training and inference conditions. By randomly mixing different utterances in the same speaker’s voice during training, the model learns to separate vocal characteristics from prosodic features. This approach enhances intelligibility and allows for more natural emotional delivery in synthesized speech.

Experimental Insights

VoiceStar demonstrates superior performance relative to existing state-of-the-art models on both short-form and long-form benchmarks such as Librispeech and Seed-TTS. Notably, the model surpasses current approaches in long-form/extrapolation benchmarks (20-50 seconds) regarding intelligibility and naturalness. The empirical evidence highlights VoiceStar's capability in zero-shot voice cloning, emphasizing robust duration control and successful extrapolation of sequence lengths.

Implications and Future Directions

VoiceStar points to practical applications in real-time conversational systems where precise control over speech duration and enhanced intelligibility are paramount. The theoretical advancements in PM-RoPE and CPM training suggest pathways for further refining alignment mechanisms and training paradigms. With neural TTS systems becoming increasingly sophisticated, future work might explore integrating improved neural audio codecs and advanced data augmentation techniques.

This work opens avenues in scalable multilingual voice synthesis and potentially bridges performance gaps between distinct TTS frameworks, inviting exploration into hybrid models that seamlessly combine autoregressive and diffusion-based techniques for richer voice synthesis capabilities. Researchers could focus on enhancing computational efficiency and extending these methods to broader audio generation tasks while ensuring real-time adaptability in interactive systems.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com