CosyVoice 2: Multilingual Low-Latency TTS

Updated 13 November 2025

CosyVoice 2 is a multilingual text-to-speech system that uses finite-scalar quantization and chunk-aware causal flow matching to enable scalable, low-latency synthesis.
The system follows a four-stage pipeline—from a supervised semantic speech tokenizer to a unified LM, flow matching module, and HiFi-GAN vocoder—to generate natural audio.
It achieves human-parity naturalness with robust speaker similarity by training on hundreds of thousands of hours of diverse multilingual audio.

CosyVoice 2 is a multilingual speech synthesis system designed for scalable, low-latency, and streaming text-to-speech (TTS), leveraging LLMs and advanced generative flow modeling. It introduces architectural and algorithmic advances—including finite-scalar quantization and chunk-aware causal flow matching—to support both streaming and non-streaming synthesis with minimal latency and virtually lossless degradation relative to offline modes. The model trains on hundreds of thousands of hours of multilingual audio, achieving human-parity naturalness, robust speaker similarity, and high content fidelity across languages in both interactive and batch processing scenarios.

1. System Architecture

CosyVoice 2 comprises a four-stage pipeline, systematically optimized for streaming deployment:

Stage 1: Supervised Semantic Speech Tokenizer

Inputs: BPE-tokenized multilingual text.
Encoder₁: 6 Transformer blocks with rotary embeddings at 25 Hz.
Finite-Scalar Quantization (FSQ): Projects intermediate features to low-rank, quantizes each scalar to an integer in [−K,K], resulting in 100% codebook utilization.
Encoder₂ + ASR Decoder (SenseVoice-Large): Optimizes posterior text-token loss.

Stage 2: Unified Text-Speech LLM (LM)

Backbone: Pre-trained Qwen2.5-0.5B, decoder-only, with text encoder & speaker embedding modules removed from the original CosyVoice.
The LM is trained to autoregressively predict supervised semantic speech tokens.
Supports two sequence-construction schemes (non-streaming: sequential; streaming: interleaved chunks of text and speech tokens).

Stage 3: Chunk-Aware Causal Flow Matching (CFM)

Upsamples semantic tokens for 50 Hz Mel spectrogram rate.
Adds look-ahead convolution and passes through N stacked causal Transformer-UNet blocks.
Model is conditioned on speaker embedding $v$ , upsampled token sequence $\mu$ , potentially masked or reference Mel $\tilde{X}$ , and time-step $t$ .
Learns an OT-flow ODE: $\dot{X}_t = \omega_t(X_t|X_1)$ , with loss $\mathcal{L}(\theta) = \mathbb{E}_{X_0 \sim \mathcal{N}(0,I), X_1 \sim q(X), t \sim U[0,1]} \left[ \| \omega_t(\varphi_t^{OT}(X_0, X_1)) - \nu_t(\varphi_t^{OT}(X_0, X_1)|\theta) \|_1 \right]$ where $\varphi_t^{OT} = (1-t)X_0 + tX_1$ , $\omega_t = X_1 - X_0$ .
Inference uses cosine time rescaling ( $t \leftarrow 1-\cos(\pi t/2)$ ), 10 NFE, classifier-free guidance ( $\tilde{\nu}_t = (1+\beta)\nu_t(\text{cond}) - \beta\nu_t(\text{uncond}), \beta=0.7$ ).

Stage 4: HiFi-GAN Vocoder

Converts Mel spectrograms to waveforms.

Data Flow Overview:

1	text → BPE → [S]+tokens+[T] → LM → μ (speech tokens) → upsample + look-ahead conv + causal Transformer blocks → CFM → Mel → vocoder → waveform

Inference Pseudocode:

Non-streaming LM

tokens = [S] + text_tokens + [T]
while next_token != [E]:
    logits = QwenLM(tokens)
    next_token = sample(logits)
    tokens.append(next_token)
μ = tokens after [T] and before [E]

Streaming LM (N text : M speech)

prompt = [S] + first N text_tokens
while text_tokens or speech not done:
    out = QwenLM(prompt)
    if out == FILL_TOKEN:
        prompt.extend(next N text_tokens)
        continue
    elif out == [T]:
        prompt.append([T])
        continue
    else:
        emit out
        if (emitted_speech % M) == 0:
            return chunk of M tokens

2. Finite-Scalar Quantization (FSQ)

FSQ addresses codebook utilization and quantization efficiency in the speech tokenizer:

For intermediate activations $H \in \mathbb{R}^{L \times d}$ $H \in R^{L \times d}$ :
- Project to reduced dimension: $H_{\text{down}} = \operatorname{Proj}_{\text{down}}(H) \in \mathbb{R}^{L \times D}$
- Quantize each scalar: $\bar{H} = \operatorname{ROUND}(H_{\text{down}})$ , $\bar{h}_{i,j} \in [-K, K]$
- Reproject: $\hat{H} = \operatorname{Proj}_{\text{up}}(\bar{H})$
- Index codebook: $\mu_i = \sum_{j=0}^{D-1} \bar{h}_{i,j} \cdot (2K+1)^j$
Training uses a straight-through estimator for ROUND and cross-entropy of ASR decoder posterior.
FSQ codebook: $(2K+1)^D$ entries; 100% utilization compared to 23% for VQ (Table 4.1).
Per scalar quantization error $\leq 0.5$ ; global $\ell_2$ error $\leq \frac{\sqrt{D}}{2}$ per token.

This quantization strategy underpins maximal efficiency in representing supervised speech tokens, facilitating low-latency and compact autoregressive modeling within the LM.

3. Chunk-Aware Causal Flow Matching

CFM is central to high-quality, low-latency Mel synthesis and enables both streaming and offline modes:

Generative path: $X_0 \sim \mathcal{N}(0,I)$ , $X_1 \sim q(\text{Mel})$ , $\varphi_t^{OT} = (1-t)X_0 + tX_1$ , $\omega_t = X_1-X_0$
Loss: $\mathcal{L}(\theta) = \mathbb{E}_{X_0,X_1,t} \left[\|\omega_t(\varphi_t) - \nu_t(\varphi_t|\theta;\mu, \tilde X_1, v)\|_1\right]$
Inference ODE: $\frac{dX}{dt} = \frac{\nu_t(X|\theta)}{1-t}$ , solved in $NFE$ steps (typically 10).
During training, attention masks for the Transformer-UNet are sampled among:
- non-causal (full),
- full-causal (only past),
- chunk-M (attend past + $M$ future),
- chunk-2M (past + $2M$ future, emulating high-quality offline inference).
This enables a single model to be deployed in streaming (restricted look-ahead for low latency) or offline (full look-ahead for quality) contexts.

Chunk-aware masking strategies are pivotal for balancing real-time requirements with synthesis accuracy, as measured by objective and subjective quality scores.

4. Streaming and Non-Streaming Synthesis

CosyVoice 2 provides explicit modes:

Non-Streaming: Sequence format is $[S]$ text $[T]$ $\mu$ $[E]$ with cross-entropy loss computed on $\mu$ .
Streaming: Interleaves $N$ text tokens and $M$ speech tokens. The LM predicts a special FILL token to denote a request for the next text chunk.
Latency Formulation:
- First package latency: $L_{\text{TTS}} = M \cdot d_{\text{lm}} + M \cdot d_{\text{fm}} + M \cdot d_{\text{voc}}$
- For chat: $L_{\text{Chat}} \leq N \cdot d_{\text{LLM}} + L_{\text{TTS}}$
Typical settings: $N=15$ , $M=15$ per chunk.
Quality-Latency Trade-Off:
- Decreasing $M$ yields lower latency but increases computational overhead per second.
- The choice of chunk mask determines trade-off between latency and Mel/fidelity (see Fig. 3 for empirical trade-offs).

The streaming regime maintains near-lossless quality, with delta WER $<0.1\%$ between streaming and offline synthesis.

5. Training Methodology and Evaluation

Extensive multilingual and instructional tuning datasets underpin CosyVoice 2:

Training Volumes:
- Speech tokenizer (FSQ): 200k hours ASR (110.9k h Chinese, 99.9k h English)
- LM + CFM: ~166.8k hours multi-speaker (130k h Chinese, 30k h English, 4.6k h Japanese, 2.2k h Korean)
- Instruction-tuning: +1,500 h with prompt targets and fine-grained labels (e.g., [laughter]).
Evaluation Metrics:
- Content: WER/CER via Whisper-Large V3 (English), Paraformer (Chinese)
- Speaker similarity: Cosine similarity of ERes2Net embeddings
- Speech quality: NMOS (objective), MOS/MOS-I (subjective)
- Latency: First-package time, RTF
Performance Highlights:

Benchmark	Content Consistency	Speaker Similarity	Quality
Librispeech-test-clean	WER=2.47% (vs. 2.66% human), NMOS=3.96 (vs. 3.84)	SS=0.745	MOS/MOS-I measured
SEED test-zh (Chinese)	CER=1.45%	SS=0.806	Robust under "hard" (WER=6.83%)
Japanese/Korean	CER=18.8%/7.98%	SS=0.63/0.71	NMOS=3.42/3.73
Instruction-Control	MOS-I=4.11/5	CER=1.52%	SS=0.804, NMOS=3.94

Streaming mode is nearly lossless, indicating successful chunk-aware adaptation.

6. Implementation and Reproducibility

Key implementation details and resources:

Codebase and Models:

https://github.com/FunAudioLLM/CosyVoice (For trained checkpoints and configuration).

Audio Demos:

https://funaudiollm.github.io/cosyvoice2/

Backbone: Initialize LM from Qwen2.5-0.5B; remove speaker embedding; tie text and speech token embeddings.
FSQ Settings: Down-projection $D=8$ , $K=40$ ; codebook size $(2K+1)^D \approx 6561$ ; employs STE for backpropagation.
CFM: 10 UNet stacks, each with causal conv + attention; chunk mask $\lambda$ sampled uniformly from $\{0, M, 2M, \infty\}$ ; CFG $\beta=0.7$ , NFE=10 per example.
Streaming LM: $N=15$ , $M=15$ chunk sizes; special FILL token id; emit/flush every $M$ tokens.
Vocoder: HiFi-GAN, standard configuration.

Minimum requirements for reproduction:

Train the FSQ-based speech tokenizer on aligned ASR data.
Fine-tune Qwen2.5 LLM for next-token prediction on combined text and speech token sequences.
Train the flow-matching UNet with chunk-aware masking.
Inference with the described chunked or full-sequence regimes.

This modular, unified TTS design achieves sub-100 ms streaming latency and human-parity naturalness in multiple languages, with balanced trade-offs between latency and output quality as determined by chunk and masking configuration (Du et al., 2024).

PDF Markdown Chat (Pro)

References (1)

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to CosyVoice 2 Model.