Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 89 tok/s

Gemini 2.5 Pro 49 tok/s Pro

GPT-5 Medium 29 tok/s Pro

GPT-5 High 31 tok/s Pro

GPT-4o 98 tok/s Pro

GPT OSS 120B 424 tok/s Pro

Kimi K2 164 tok/s Pro

2000 character limit reached

FireRedTTS-2: Towards Long Conversational Speech Generation for Podcast and Chatbot (2509.02020v1)

Published 2 Sep 2025 in cs.SD and eess.AS

Abstract: Current dialogue generation approaches typically require the complete dialogue text before synthesis and produce a single, inseparable speech containing all voices, making them unsuitable for interactive chat; moreover, they suffer from unstable synthesis, inaccurate speaker transitions, and incoherent prosody. In this work, we present FireRedTTS-2, a long-form streaming TTS system for multi-speaker dialogue generation, delivering stable, natural speech with reliable speaker switching and context-aware prosody. A new 12.5Hz streaming speech tokenizer accelerates training and inference, extends maximum dialogue length, encodes richer semantics to stabilize text-to-token modeling and supports high-fidelity streaming generation for real-time applications. We adopt a text-speech interleaved format, concatenating speaker-labeled text with aligned speech tokens in chronological order, and model it with a dual-transformer: a large decoder-only transformer predicts tokens at the first layer, and a smaller one completes subsequent layers. Experimental results show that FireRedTTS-2 integrates seamlessly with chat frameworks and, with minimal fine-tuning, produces emotionally expressive speech guided by implicit contextual cues. In podcast generation, it surpasses existing systems including MoonCast, Zipvoice-Dialogue, and MOSS-TTSD in objective intelligibility, speaker-turn reliability, and perceived naturalness with context-consistent prosody. Our demos are available at https://fireredteam.github.io/demos/firered_tts_2.

Collections

Summary

The paper introduces FireRedTTS-2 as a novel system that generates long-form conversational speech sentence-by-sentence for interactive podcasts and chatbots.
It leverages a low-frame-rate speech tokenizer and a dual-transformer architecture to enable stable speaker transitions and contextually coherent prosody.
Experimental results show superior performance in WER, speaker similarity, and naturalness compared to existing dialogue TTS models.

FireRedTTS-2: Long-Form Conversational Speech Generation for Podcast and Chatbot

Introduction

FireRedTTS-2 introduces a streaming, multi-speaker text-to-speech (TTS) system designed for long-form conversational speech generation, targeting applications such as podcasts and interactive chatbots. The system addresses limitations in prior dialogue TTS approaches, which typically require the entire dialogue text before synthesis and produce a single, inseparable speech track, resulting in inflexible, non-interactive outputs and issues with speaker transitions and prosody coherence. FireRedTTS-2 leverages a novel low-frame-rate speech tokenizer and a dual-transformer architecture operating on interleaved text–speech sequences, enabling sentence-by-sentence generation, stable speaker switching, and contextually coherent prosody.

Figure 1: An overview of FireRedTTS-2, including: (a) a new speech tokenizer with a 12.5Hz frame rate and enhanced semantic information, and (b) a text-to-speech model using a dual-transformer architecture with interleaved text–speech input, enabling sentence-by-sentence generation and contextually coherent prosody.

System Architecture

Speech Tokenizer

The FireRedTTS-2 speech tokenizer operates at 12.5Hz, half the frame rate of most open-source tokenizers, which significantly reduces sequence length and computational requirements for long dialogues. Semantic injection and explicit supervision are employed to stabilize text-to-token modeling and improve synthesis reliability. The tokenizer utilizes a pretrained Whisper encoder for semantic feature extraction, which is concatenated with acoustic features from a trainable encoder. Features are downsampled and discretized via a 16-layer residual vector quantizer (RVQ), each with 2048 code entries. Upsampled quantized features are used by a Vocos-based acoustic decoder for waveform reconstruction, supporting both streaming and non-streaming modes.

Training is performed in two stages: initial non-streaming optimization on 500k hours of speech, followed by streaming adaptation on 60k hours of high-fidelity data. The final model supports real-time, high-fidelity streaming generation, making it suitable for interactive applications.

Dual-Transformer Text-to-Speech Model

FireRedTTS-2 models dialogue as an interleaved sequence of speaker-labeled text and corresponding speech tokens, concatenated in chronological order. The dual-transformer architecture consists of a large backbone transformer that predicts first-layer speech tokens and a smaller decoder transformer that generates subsequent token layers, conditioned on both the backbone’s hidden states and the predicted tokens. This design overcomes the limitations of delay-pattern multi-layer token modeling, providing full contextual access at each timestep and reducing first-packet latency.

The model is trained with a composite loss function combining cross-entropy losses for both transformers and an auxiliary text loss for stability. Decoder optimization is performed on a subset of speech segments to improve efficiency. Curriculum training is employed: large-scale monologue pretraining (1.1M hours), post-training on multi-speaker dialogue (300k hours), and supervised fine-tuning for speaker adaptation.

Downstream Applications

Interactive Chat Integration

FireRedTTS-2 is natively compatible with interactive chat frameworks, supporting sentence-by-sentence generation and real-time streaming. Fine-tuning on a small corpus (15 hours) of emotional speech enables the model to infer and adjust emotion and prosody from implicit context, without explicit emotion labels or LLM modifications. This results in dynamic, contextually appropriate emotional responses, enhancing the naturalness of chatbot interactions.

Figure 2: Integration of FireRedTTS-2 into interactive chat scenarios.

Podcast Generation

For podcast synthesis, FireRedTTS-2 generates dialogue speech turn-by-turn, supporting flexible editing and post-processing. Zero-shot podcast generation is achieved by prompting with initial dialogue turns and generating subsequent turns sequentially. The system supports multi-speaker dialogues (up to 4 speakers, 3 minutes) and can be scaled to longer conversations with more speakers via corpus extension. Fine-tuning on 50 hours of podcast data yields stable synthesis, accurate speaker transitions, and prosody matching the hosts’ styles.

Figure 3: Zero-shot podcast generation of FireRedTTS-2.

Experimental Results

Speech Tokenizer Evaluation

On the LibriSpeech test-clean set, the FireRedTTS-2 tokenizer achieves the lowest WER (2.16%) among semantic-injection tokenizers, and ranks first or second in speaker similarity and speech quality metrics, despite operating at 12.5Hz. The large quantizer and Vocos-based decoder contribute to reduced quantization error and high-fidelity output. The model trails slightly on PESQ and UTMOS compared to systems trained on larger, language-matched corpora, but overall demonstrates robust performance for long-form streaming synthesis.

Voice Cloning

On the Seed-TTS-eval benchmark, FireRedTTS-2 achieves competitive CER (1.14% Mandarin) and WER (1.95% English), closely matching state-of-the-art monologue TTS systems. Speaker similarity is strong in Mandarin, but slightly lower in English, attributed to training data diversity and the absence of dedicated timbre modules. The model’s expressive prosody may also impact objective intelligibility scores.

Interactive Chat Evaluation

Fine-tuned FireRedTTS-2 achieves high emotion control accuracy (76.7–93.3%) across six emotions, demonstrating effective inference of emotional cues from context. This validates the model’s ability to deliver human-like, emotionally expressive chat responses without explicit emotion conditioning.

Podcast Generation

FireRedTTS-2 outperforms MoonCast, ZipVoice-Dialog, and MOSS-TTSD in zero-shot podcast generation, achieving the lowest WER/CER, highest speaker similarity, and lowest MCD on both Mandarin and English test sets. Subjective CMOS scores confirm its superior naturalness and context coherence. Fine-tuning further improves intelligibility (CER 1.66%) and naturalness, with synthesis preferred or indistinguishable from ground truth in 56% of cases.

Figure 4: Subjective preference results between FireRedTTS-2 fine-tuned on two podcast speakers and ground truth recordings. "Win": FireRedTTS-2 synthesis is more natural than ground truth dialogue speech; "Even": indistinguishable; "Fail": ground truth is more natural.

Implications and Future Directions

FireRedTTS-2 demonstrates that low-frame-rate, semantically enriched speech tokenization combined with dual-transformer modeling enables scalable, high-quality conversational speech synthesis. The system’s streaming capability and sentence-by-sentence generation are well-suited for real-time interactive applications and long-form content creation. The architecture supports efficient adaptation to new speakers and emotional styles with minimal data, facilitating rapid deployment in diverse scenarios.

Future research may focus on further reducing latency, improving cross-lingual speaker similarity, and extending the system to handle overlapping speech and more complex conversational structures. Integration with multimodal dialogue agents and end-to-end spoken chatbot frameworks is a promising direction, leveraging FireRedTTS-2’s context-aware prosody and flexible generation capabilities.

Conclusion

FireRedTTS-2 advances conversational TTS by combining a low-frame-rate, semantically rich speech tokenizer with a dual-transformer architecture operating on interleaved text–speech sequences. The system delivers stable, natural, and contextually coherent multi-speaker speech, supporting both interactive chat and podcast generation. Experimental results confirm its superiority over existing dialogue TTS systems in intelligibility, speaker-turn reliability, and perceived naturalness, with fine-tuned synthesis matching or surpassing human recordings. FireRedTTS-2 sets a new standard for scalable, real-time conversational speech generation and provides a robust foundation for future dialogue-centric AI applications.

PDF Markdown

Paper Prompts

Explore 10 Community Prompts

Follow-up Questions

Authors (6)

GitHub

Tweets

https://twitter.com/AudioAndSpeech/status/1964186682049106064

alphaXiv

FireRedTTS-2: Towards Long Conversational Speech Generation for Podcast and Chatbot (57 likes, 0 questions)

FireRedTTS-2: Towards Long Conversational Speech Generation for Podcast and Chatbot (2509.02020v1)

Collections

Summary

FireRedTTS-2: Long-Form Conversational Speech Generation for Podcast and Chatbot

Introduction

System Architecture

Speech Tokenizer

Dual-Transformer Text-to-Speech Model

Downstream Applications

Interactive Chat Integration

Podcast Generation

Experimental Results

Speech Tokenizer Evaluation

Voice Cloning

Interactive Chat Evaluation

Podcast Generation

Implications and Future Directions

Conclusion

Paper Prompts

Follow-up Questions

Related Papers

Authors (6)

GitHub

Tweets

alphaXiv