Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

173 tokens/sec

GPT-4o

7 tokens/sec

Gemini 2.5 Pro Pro

46 tokens/sec

o3 Pro

4 tokens/sec

GPT-4.1 Pro

38 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

CosyVoice2: Scalable Real-Time Multilingual TTS

Updated 4 July 2025

CosyVoice2 is a scalable, LLM-based text-to-speech system that integrates FSQ tokenization and chunk-aware causal modeling for unified streaming and non-streaming synthesis.
It employs innovative quantization techniques and transformer-based flow matching to achieve near-human speech quality with low latency and high speaker similarity.
The system supports instructed TTS with fine-grained control over roles, emotions, and dialects, enabling versatile applications in real-time voice chat and multilingual media production.

CosyVoice2 is a scalable, streaming-capable, LLM-based text-to-speech (TTS) system that advances the CosyVoice family by introducing architectural, quantization, and causal modeling innovations for high-fidelity, low-latency, multilingual speech synthesis. It is positioned as a unified solution bridging real-time, interactive, and high-quality TTS with flexibility in both instructed (role, emotion, dialect) and zero-shot synthesis workflows.

1. System Architecture and Key Innovations

CosyVoice2 rearchitects its predecessor by decoupling the semantic and acoustic modeling stages and enabling unified streaming and non-streaming synthesis in a single model. The system comprises the following principal components:

Speech Tokenizer with Finite Scalar Quantization (FSQ): Instead of conventional vector quantization (VQ), CosyVoice2 employs FSQ, which projects encoder outputs into a low-dimensional space, independently quantizes each scalar into discrete bins, and reconstitutes the quantized representation for speech token generation. This results in full codebook coverage (e.g., 6,561/6,561 tokens used), maximizing token diversity and improving representational richness compared to the 23% utilization rate for VQ (963/4096 codes) in prior approaches.

$\bar{H} = \mathrm{ROUND}(\mathrm{Proj}_{down}(H))$

$\hat{H} = \mathrm{Proj}_{up}(\bar{H})$

$\mu_i = \sum_{j=0}^{D-1} \bar{h}_{i, j} (2K+1)^j$

LLM (LM): CosyVoice2 removes explicit speaker embeddings and text encoders, allowing direct adoption of pre-trained LLMs (such as Qwen2.5-0.5B) as the backbone “text-to-speech” LLM. This architectural simplification enhances both scalability and native language understanding, and facilitates the direct prediction of mixed sequences of text and speech tokens.
Chunk-Aware Causal Flow Matching (CFM): For wave generation, CosyVoice2 introduces a chunk-aware causal flow matching module. This design supports multi-modal context utilization and enables both causal (streaming) and offline (non-streaming) synthesis through mask-based transformer blocks, upsampling, and future-context lookahead convolutions.
Instructed TTS and Unified Control: Tags and instruction-style natural language prompts can be embedded at any position in the input sequence, allowing fine-grained control over speaker identity, emotion, dialect, vocal style, and nonverbal cues (laughter, breath).

2. Performance and Quality Metrics

CosyVoice2 attains “human-parity” speech synthesis with low latency and high speaker similarity. Objective results, as evaluated on large multilingual datasets, are as follows:

Model	WER (%)	NMOS	Speaker Sim (SS)
Human	2.66	3.84	0.697
CosyVoice	2.89	3.93	0.743
CosyVoice2	2.47	3.96	0.745
CosyVoice2-S (stream)	2.45	3.90	0.751
OpenVoice	3.47	3.87	0.299

WER (Word Error Rate): Reflects content consistency; CosyVoice2 matches or surpasses leading systems in both streaming and non-streaming scenarios.
NMOS (Non-intrusive MOS): Approaches or exceeds human-level quality.
Speaker Similarity (SS): Close to or above human ground truth, with robust speaker preservation across languages.

Minimal first-bundle latency ( $L_{TTS}$ ) ensures real-time streaming performance: $L_{TTS} = M \cdot d_{lm} + M \cdot d_{fm} + M \cdot d_{voc}$

Streaming mode shows negligible degradation: for instance, in test-ja, CosyVoice2-S achieves CER/SS/NMOS of 21.41/0.629/3.35, compared to 18.79/0.630/3.42 for non-streaming.

3. Multilingual and Streaming Capabilities

CosyVoice2 expands the language coverage of the CosyVoice line by training its tokenizer on 200,000 hours of Chinese and English, and subsequently including Japanese (4,600 hours) and Korean (2,200 hours) in fine-tuning and evaluation. It demonstrates:

Zero-shot capabilities in Japanese and Korean, attributed to overlapping character sets and language-universal feature encoding in the FSQ tokenizer.
Support for multiple Chinese dialects (Sichuan, Shanghai, Zhengzhou, etc.) as well as role, emotion, and prosodic instruction control in a single deployed model.
Streaming synthesis via chunk-based CFM and interleaved sequence construction, providing seamless low-latency interactive experiences for applications such as LLM-based voice chat.

The chunk-aware CFM module is trained to operate under various attention masking strategies (non-causal, full-causal, chunk-M, chunk-2M), enabling dynamic adjustment between lowest latency and highest fidelity, with training-time mask sampling fostering self-distillation from more “look-ahead” to less “look-ahead” contexts.

4. Technical Design and Implementation Details

Tokenization uses FSQ with rounding and bin-index conversion; codebook utilization and t-SNE visualization confirm rich, disentangled representations.
Unified LM input alternates between full and chunked (interleaved text and speech tokens) formats, allowing the same architecture to serve both streaming and non-streaming calls without retraining.
Conditional flow-matching for waveform generation leverages a continuous ODE:

$\frac{d}{dt} \phi_t(X) = v_t(\phi_t(X), t)$

where $v_t$ is parameterized as a neural net and conditioned on the upsampled speech token sequence and context.

Classifier-free guidance, random masking, and time-step schedules are used for improved diversity and denoising. The final waveform is generated via a HiFi-GAN vocoder.

5. Applications and Practical Implications

CosyVoice2’s design enables practical deployment for:

Real-time, low-latency, and high-quality voice chat systems (e.g., LLM-based assistants) on commodity hardware.
Multilingual, emotional, and role-specific TTS for accessibility, dubbing, audiobooks, and media production, using a unified model architecture.
Customization and control for advanced TTS developers, including support for paralinguistic elements (emotions, laughter, etc.) embedded within natural language or markup.
Foundation speech models for downstream adaptation, as the modular design supports both fine-tuning and instruction-based specialization.

6. Limitations and Future Directions

While CosyVoice2 provides substantial improvements over earlier systems, several specific challenges remain:

Limited to certain language/dialect pairs: Degradation is noted for languages with substantial character-set overlap (e.g., Chinese | Japanese), suggesting future work on linguistic context augmentation and data balancing.
Acoustic control via natural language: Explicit control over timbre and more nuanced style via plain text is not yet realized.
Singing synthesis is not addressed: Application to melodic or music-related TTS remains an open domain.

A future version may integrate multilingual frontends, end-to-end cascading elimination, and further compression for even broader deployment.

7. Position Within the CosyVoice Family and Benchmarking

CosyVoice2 is a direct improvement over CosyVoice (2407.05407), specifically introducing:

FSQ for 100% codebook utilization,
unified streaming/non-streaming synthesis,
simplified LLM integration without explicit speaker/text encoders,
chunk-aware CFM for causal streaming and knowledge distillation across context windows.

CosyVoice2’s empirical results benchmark favorably against contemporary open-source and commercial TTS models (e.g., ChatTTS, OpenVoice, GPT-SoVITs), with low error rates, high speaker similarity, and minimized latency in both batch and streaming setups. As a foundation, CosyVoice2 underpins downstream systems, such as JoyTTS (an LLM-based chatbot), for high-performance voice-driven conversational AI.

In summary, CosyVoice2 delivers a unified, low-latency, scalable, controllable, and high-quality text-to-speech architecture, positioning it as a foundation model for interactive TTS applications with broad multilinguistic, prosodic, and speaker-cloning coverage.

PDF Markdown Chat (Upgrade)

References (1)

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens (2024)