CosyVoice 2 TTS Synthesis

Updated 26 November 2025

CosyVoice 2 is a multilingual hybrid TTS system that employs LLM-based semantic token generation and Finite Scalar Quantization for flawless codebook utilization.
It features a two-stage design with a text-to-semantic module and a flow-matching acoustic synthesizer, achieving high-fidelity, low-latency streaming and offline synthesis.
The model supports zero-shot speaker adaptation via external embeddings and attains competitive objective and subjective metrics for scalable on-device applications.

CosyVoice 2 is a large-scale, multilingual, hybrid text-to-speech (TTS) synthesis model that achieves human-parity naturalness and minimal response latency, supporting both streaming and offline synthesis within a unified architecture. Developed as a successor to CosyVoice, it provides the neural TTS backbone for high-fidelity synthetic speech datasets such as SYNTTS-COMMANDS, which target on-device keyword spotting (KWS) and other TinyML applications. CosyVoice 2 systematically advances core TTS modules with Finite Scalar Quantization (FSQ) for 100% codebook utilization, a LLM backbone for semantic token generation, and a chunk-aware causal flow-matching network for acoustic modeling, enabling effective, scalable, and privacy-preserving speech synthesis across English, Chinese, and additional languages (Du et al., 13 Dec 2024, Gan et al., 11 Nov 2025).

1. System Architecture and Design

CosyVoice 2 adopts a two-stage hybrid TTS design:

Text-to-Semantic Module: Utilizes a large transformer-style LLM (e.g., Qwen2.5–0.5B) to autoregressively generate sequences of discrete “semantic” speech tokens from input text. In streaming mode, it emits tokens incrementally, facilitating low-latency synthesis.
Flow-Matching Acoustic Synthesizer: Converts semantic tokens into mel-spectrogram frames via a chunk-aware causal flow-matching model (CFM). This module up-samples speech tokens (matching a 50 Hz frame rate), applies a look-ahead convolution for contextualization, and employs stacked U-Net–like transformer blocks to model the optimal transport flow between Gaussian latent variables and true Mel distributions.

The final millisecond-level speech waveform is obtained via a HiFi-GAN-style neural vocoder, operating on the generated mel-spectrograms. Distinctively, CosyVoice 2 replaces classical vector quantization (VQ) with Finite Scalar Quantization (FSQ), ensuring perfect codebook utilization and improved detail preservation, overcoming the token collapse commonly observed with prior VQ schemes (Gan et al., 11 Nov 2025, Du et al., 13 Dec 2024).

Speaker identity is introduced via fixed-dimension (typically 256d) embeddings, derived from external speaker recognition networks. These embeddings are integrated at multiple locations—using FiLM-style affine transformations—within both the semantic token generator and the flow-matching acoustic modules, allowing fine-grained control over timbre and speaker style. The architecture supports flexible zero-shot speaker adaptation without explicit re-training.

2. Finite Scalar Quantization for Tokenization

FSQ constitutes a central innovation in CosyVoice 2’s speech tokenization pipeline. Speech features are encoded using a supervised ASR encoder, after which linear projections and scalar rounding are applied to form discrete vector representations: $\bar{H} = \mathrm{ROUND}(\mathrm{Proj}_{\downarrow}(H)) \in \{-K,\dots,K\}^D$ This quantization yields a large, perfectly-utilized token codebook size of $(2K+1)^D$ (e.g., $K=40, D=3 \Rightarrow 531{,}441$ tokens), contrasting with the sub-25% utilization of legacy VQ approaches.

Each quantized row is mapped to a single integer speech token by enumerative encoding: $\mu_i = \sum_{j=0}^{D-1} \bar h_{i,j} (2K+1)^j$

FSQ is enabled by the straight-through estimator and trained end-to-end without a separate codebook commitment loss; gradients backpropagate through the quantization operation in the ASR task. This approach yields maximal efficiency and fidelity in token-based speech representations, crucial for high-quality, controllable synthesis (Du et al., 13 Dec 2024, Gan et al., 11 Nov 2025).

3. Training Regimen and Multilingual Data

Both the FSQ tokenizer and core TTS components are pretrained on extensive multilingual, multi-speaker datasets. The FSQ tokenizer is trained on approximately 200,000 hours of supervised ASR data (111k hours Chinese, 100k hours English), while the TTS model is optimized on 167,000 hours of TTS-style data, spanning 130k hours Chinese, 30k hours English, as well as Japanese and Korean resources. An additional 1.5k hours provide fine-grained prosody, emotion, and dialect supervision.

CosyVoice 2’s training loss is a composite objective comprising:

Spectrogram L1/L2 reconstruction: $L_{\text{spec}} = \mathbb{E}[\| \text{Mel}_\text{pred} - \text{Mel}_\text{gt} \|_1 ]$
FSQ (when used in non-streaming TTS and dataset generation) codebook alignment loss: $L_\text{FSQ} = \mathbb{E}[ \| \text{stopgrad}(z_e) - e_q \|^2 + \beta \| z_e - \text{stopgrad}(e_q) \|^2 ]$
Multi-scale adversarial losses on mels.

All modules are optimized end-to-end using AdamW, with batch sizes of 32 and audio sequences up to 10 seconds. Language balancing is performed by sampling ratios proportional to data command frequency, with no explicit weighting equations required. The resulting models generalize to unseen languages in zero-shot settings and support joint as well as language-specific synthesis front-ends (e.g., ARPAbet for English, pinyin with tone for Chinese) (Du et al., 13 Dec 2024, Gan et al., 11 Nov 2025).

4. Speaker Embeddings and Zero-Shot Conditioning

CosyVoice 2 enables wide timbral and inter-speaker variability through external speaker embeddings drawn from publicly available corpora, such as VoxCeleb1/2 and the Free ST Chinese Mandarin Corpus. Embeddings are extracted from best representative utterances using pretrained ResNet-based speaker encoders and injected at inference time via FiLM mechanisms across the synthesis stack.

This protocol allows the system to render speech conditioned on arbitrary speaker styles in a zero-shot manner, eliminating the need for TTS re-training for new speakers. For dataset generation in SYNTTS-COMMANDS, embeddings are randomly sampled, ensuring rich speaker diversity and faithful simulation of inter-speaker acoustic variance (Gan et al., 11 Nov 2025).

5. Synthesis Pipeline and Practical Applications

CosyVoice 2 underpins large-scale, synthetic dataset generation for applications such as on-device keyword spotting (KWS) and TinyML. The pipeline, as instantiated in SYNTTS-COMMANDS, is:

Text normalization:
- English: Lowercasing, numeral expansion, lexicon lookup, G2P to ARPAbet.
- Chinese: UTF8 normalization, punctuation removal, word segmentation (Jieba), character-to-pinyin with tone marks.
Inference:
- LLM-based semantic decoding outputs phone-level attributes (duration, pitch, energy).
- Flow-matching acoustic synthesizer generates 80-bin mels (12.5 ms hop, 50 ms window).
- HiFi-GAN-like vocoder reconstructs waveform at 24 kHz.
- Post-processing includes de-essing, highpass/peak normalization.

This synthetic pipeline enables the construction of massive, richly labeled multilingual datasets (e.g., 384,000+ utterances in SYNTTS-COMMANDS) while bypassing the costs, privacy concerns, and latency of human recordings. Empirically, KWS models trained on CosyVoice 2–synthesized data achieve up to 99.5% accuracy (English, CRNN) and 97.9% (Chinese, EfficientNet/MobileNet-V1), establishing that synthetic speech can replace human data for downstream on-device classifiers (Gan et al., 11 Nov 2025).

6. Streaming Capabilities and Latency

CosyVoice 2 integrates a chunk-aware causal processing design. In streaming mode, the text-token LM and CFM process tokens and acoustic frames incrementally, using causal or chunk-based attention masks that enable both ultra-low-latency and full-context synthesis within a single network. This design allows for:

First-package TTS latency (with text ready): $L_{\mathrm{TTS}} = M (d_{\text{lm}} + d_{\text{fm}} + d_{\text{voc}})$ , with per-token costs $d_{\text{lm}} \approx 3$ ms, $d_{\text{fm}} \approx 10$ ms, $d_{\text{voc}} \approx 2$ ms, yielding $L_{\mathrm{TTS}} \approx 225$ ms for $M=15$ .
Real-time factors consistently below 1, supporting interactive applications.
Near-lossless streaming synthesis quality: e.g., WER 2.45%, NMOS 3.90, SS 0.751 on LibriSpeech test-clean—essentially parity with offline mode (Du et al., 13 Dec 2024).

These characteristics make CosyVoice 2 particularly suitable for interactive conversational systems and resource-constrained edge deployments.

7. Evaluation Metrics and Comparative Results

CosyVoice 2 has been evaluated on standard benchmarks using objective (WER, CER, speaker similarity) and subjective (naturalness MOS, NMOS) measures:

Model	LibriSpeech WER (%)	NMOS	Speaker Similarity (SS)
Human	2.66	3.84	0.697
CosyVoice 2	2.47	3.96	0.745
CosyVoice 2-S	2.45	3.90	0.751

On SEED (Chinese/English):

CER (zh): 1.45% (SS 0.812 offline/0.806 streaming)
WER (en): 2.57% (SS 0.743 streaming/0.736 offline)

Downstream KWS models trained on CosyVoice 2–generated data reach or exceed the accuracy of human-recorded data on English and Chinese commands (up to 99.5%/97.9% for best configurations) (Gan et al., 11 Nov 2025, Du et al., 13 Dec 2024).

8. Multilingual Generalization and Extensions

CosyVoice 2’s architecture, loss, and data regime generalize across languages. No language-specific acoustic modules are required—switching front-ends and lexicons suffices for synthesis in ARPAbet-encoded English or pinyin-toned Chinese. The joint multilingual training leads to shared internal representations of prosody and phonetic attributes across languages. For new command sets, a brief 10k-step adaptation on mixed-language data ensures cross-lingual consistency in prosody and style (Gan et al., 11 Nov 2025).

A plausible implication is that extending the system to further languages should require only corpus expansion and minor front-end modifications, leveraging the architecture’s inherent language-agnosticism.

CosyVoice 2 represents a convergence of modern LLM-based semantic modeling, robust discrete tokenization (via FSQ), causal flow-matching synthesis, practical zero-shot speaker conditioning, and seamless streaming deployment. It serves as both a state-of-the-art academic baseline and an industrially relevant engine for private, scalable, and high-quality speech synthesis supporting KWS, dialogue, and interactive AI applications (Du et al., 13 Dec 2024, Gan et al., 11 Nov 2025).

PDF Markdown Chat (Pro)

References (2)

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models (2024)

SynTTS-Commands: A Public Dataset for On-Device KWS via TTS-Synthesized Multilingual Speech (2025)

CosyVoice 2 TTS Synthesis

1. System Architecture and Design

2. Finite Scalar Quantization for Tokenization

3. Training Regimen and Multilingual Data

4. Speaker Embeddings and Zero-Shot Conditioning

5. Synthesis Pipeline and Practical Applications

6. Streaming Capabilities and Latency

7. Evaluation Metrics and Comparative Results

8. Multilingual Generalization and Extensions

Whiteboard

Follow Topic

Continue Learning

CosyVoice 2 TTS Synthesis

1. System Architecture and Design

2. Finite Scalar Quantization for Tokenization

3. Training Regimen and Multilingual Data

4. Speaker Embeddings and Zero-Shot Conditioning

5. Synthesis Pipeline and Practical Applications

6. Streaming Capabilities and Latency

7. Evaluation Metrics and Comparative Results

8. Multilingual Generalization and Extensions

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics