Papers
Topics
Authors
Recent
2000 character limit reached

CosyVoice 2: Multilingual Low-Latency TTS

Updated 13 November 2025
  • CosyVoice 2 is a multilingual text-to-speech system that uses finite-scalar quantization and chunk-aware causal flow matching to enable scalable, low-latency synthesis.
  • The system follows a four-stage pipeline—from a supervised semantic speech tokenizer to a unified LM, flow matching module, and HiFi-GAN vocoder—to generate natural audio.
  • It achieves human-parity naturalness with robust speaker similarity by training on hundreds of thousands of hours of diverse multilingual audio.

CosyVoice 2 is a multilingual speech synthesis system designed for scalable, low-latency, and streaming text-to-speech (TTS), leveraging LLMs and advanced generative flow modeling. It introduces architectural and algorithmic advances—including finite-scalar quantization and chunk-aware causal flow matching—to support both streaming and non-streaming synthesis with minimal latency and virtually lossless degradation relative to offline modes. The model trains on hundreds of thousands of hours of multilingual audio, achieving human-parity naturalness, robust speaker similarity, and high content fidelity across languages in both interactive and batch processing scenarios.

1. System Architecture

CosyVoice 2 comprises a four-stage pipeline, systematically optimized for streaming deployment:

Stage 1: Supervised Semantic Speech Tokenizer

  • Inputs: BPE-tokenized multilingual text.
  • Encoder₁: 6 Transformer blocks with rotary embeddings at 25 Hz.
  • Finite-Scalar Quantization (FSQ): Projects intermediate features to low-rank, quantizes each scalar to an integer in [−K,K], resulting in 100% codebook utilization.
  • Encoder₂ + ASR Decoder (SenseVoice-Large): Optimizes posterior text-token loss.

Stage 2: Unified Text-Speech LLM (LM)

  • Backbone: Pre-trained Qwen2.5-0.5B, decoder-only, with text encoder & speaker embedding modules removed from the original CosyVoice.
  • The LM is trained to autoregressively predict supervised semantic speech tokens.
  • Supports two sequence-construction schemes (non-streaming: sequential; streaming: interleaved chunks of text and speech tokens).

Stage 3: Chunk-Aware Causal Flow Matching (CFM)

  • Upsamples semantic tokens for 50 Hz Mel spectrogram rate.
  • Adds look-ahead convolution and passes through N stacked causal Transformer-UNet blocks.
  • Model is conditioned on speaker embedding vv, upsampled token sequence μ\mu, potentially masked or reference Mel X~\tilde{X}, and time-step tt.
  • Learns an OT-flow ODE: X˙t=ωt(XtX1)\dot{X}_t = \omega_t(X_t|X_1), with loss L(θ)=EX0N(0,I),X1q(X),tU[0,1][ωt(φtOT(X0,X1))νt(φtOT(X0,X1)θ)1]\mathcal{L}(\theta) = \mathbb{E}_{X_0 \sim \mathcal{N}(0,I), X_1 \sim q(X), t \sim U[0,1]} \left[ \| \omega_t(\varphi_t^{OT}(X_0, X_1)) - \nu_t(\varphi_t^{OT}(X_0, X_1)|\theta) \|_1 \right] where φtOT=(1t)X0+tX1\varphi_t^{OT} = (1-t)X_0 + tX_1, ωt=X1X0\omega_t = X_1 - X_0.
  • Inference uses cosine time rescaling (t1cos(πt/2)t \leftarrow 1-\cos(\pi t/2)), 10 NFE, classifier-free guidance (ν~t=(1+β)νt(cond)βνt(uncond),β=0.7\tilde{\nu}_t = (1+\beta)\nu_t(\text{cond}) - \beta\nu_t(\text{uncond}), \beta=0.7).

Stage 4: HiFi-GAN Vocoder

  • Converts Mel spectrograms to waveforms.

Data Flow Overview:

1
text → BPE → [S]+tokens+[T] → LM → μ (speech tokens) → upsample + look-ahead conv + causal Transformer blocks → CFM → Mel → vocoder → waveform

Inference Pseudocode:

Non-streaming LM

1
2
3
4
5
6
tokens = [S] + text_tokens + [T]
while next_token != [E]:
    logits = QwenLM(tokens)
    next_token = sample(logits)
    tokens.append(next_token)
μ = tokens after [T] and before [E]
Streaming LM (N text : M speech)
1
2
3
4
5
6
7
8
9
10
11
12
13
prompt = [S] + first N text_tokens
while text_tokens or speech not done:
    out = QwenLM(prompt)
    if out == FILL_TOKEN:
        prompt.extend(next N text_tokens)
        continue
    elif out == [T]:
        prompt.append([T])
        continue
    else:
        emit out
        if (emitted_speech % M) == 0:
            return chunk of M tokens

2. Finite-Scalar Quantization (FSQ)

FSQ addresses codebook utilization and quantization efficiency in the speech tokenizer:

  • For intermediate activations HRL×dH \in \mathbb{R}^{L \times d}:
    • Project to reduced dimension: Hdown=Projdown(H)RL×DH_{\text{down}} = \operatorname{Proj}_{\text{down}}(H) \in \mathbb{R}^{L \times D}
    • Quantize each scalar: Hˉ=ROUND(Hdown)\bar{H} = \operatorname{ROUND}(H_{\text{down}}), hˉi,j[K,K]\bar{h}_{i,j} \in [-K, K]
    • Reproject: H^=Projup(Hˉ)\hat{H} = \operatorname{Proj}_{\text{up}}(\bar{H})
    • Index codebook: μi=j=0D1hˉi,j(2K+1)j\mu_i = \sum_{j=0}^{D-1} \bar{h}_{i,j} \cdot (2K+1)^j
  • Training uses a straight-through estimator for ROUND and cross-entropy of ASR decoder posterior.
  • FSQ codebook: (2K+1)D(2K+1)^D entries; 100% utilization compared to 23% for VQ (Table 4.1).
  • Per scalar quantization error 0.5\leq 0.5; global 2\ell_2 error D2\leq \frac{\sqrt{D}}{2} per token.

This quantization strategy underpins maximal efficiency in representing supervised speech tokens, facilitating low-latency and compact autoregressive modeling within the LM.

3. Chunk-Aware Causal Flow Matching

CFM is central to high-quality, low-latency Mel synthesis and enables both streaming and offline modes:

  • Generative path: X0N(0,I)X_0 \sim \mathcal{N}(0,I), X1q(Mel)X_1 \sim q(\text{Mel}), φtOT=(1t)X0+tX1\varphi_t^{OT} = (1-t)X_0 + tX_1, ωt=X1X0\omega_t = X_1-X_0
  • Loss: L(θ)=EX0,X1,t[ωt(φt)νt(φtθ;μ,X~1,v)1]\mathcal{L}(\theta) = \mathbb{E}_{X_0,X_1,t} \left[\|\omega_t(\varphi_t) - \nu_t(\varphi_t|\theta;\mu, \tilde X_1, v)\|_1\right]
  • Inference ODE: dXdt=νt(Xθ)1t\frac{dX}{dt} = \frac{\nu_t(X|\theta)}{1-t}, solved in NFENFE steps (typically 10).
  • During training, attention masks for the Transformer-UNet are sampled among:
    • non-causal (full),
    • full-causal (only past),
    • chunk-M (attend past + MM future),
    • chunk-2M (past + $2M$ future, emulating high-quality offline inference).
  • This enables a single model to be deployed in streaming (restricted look-ahead for low latency) or offline (full look-ahead for quality) contexts.

Chunk-aware masking strategies are pivotal for balancing real-time requirements with synthesis accuracy, as measured by objective and subjective quality scores.

4. Streaming and Non-Streaming Synthesis

CosyVoice 2 provides explicit modes:

  • Non-Streaming: Sequence format is [S][S] text [T][T] μ\mu [E][E] with cross-entropy loss computed on μ\mu.
  • Streaming: Interleaves NN text tokens and MM speech tokens. The LM predicts a special FILL token to denote a request for the next text chunk.
  • Latency Formulation:
    • First package latency: LTTS=Mdlm+Mdfm+MdvocL_{\text{TTS}} = M \cdot d_{\text{lm}} + M \cdot d_{\text{fm}} + M \cdot d_{\text{voc}}
    • For chat: LChatNdLLM+LTTSL_{\text{Chat}} \leq N \cdot d_{\text{LLM}} + L_{\text{TTS}}
  • Typical settings: N=15N=15, M=15M=15 per chunk.
  • Quality-Latency Trade-Off:
    • Decreasing MM yields lower latency but increases computational overhead per second.
    • The choice of chunk mask determines trade-off between latency and Mel/fidelity (see Fig. 3 for empirical trade-offs).

The streaming regime maintains near-lossless quality, with delta WER <0.1%<0.1\% between streaming and offline synthesis.

5. Training Methodology and Evaluation

Extensive multilingual and instructional tuning datasets underpin CosyVoice 2:

  • Training Volumes:
    • Speech tokenizer (FSQ): 200k hours ASR (110.9k h Chinese, 99.9k h English)
    • LM + CFM: ~166.8k hours multi-speaker (130k h Chinese, 30k h English, 4.6k h Japanese, 2.2k h Korean)
    • Instruction-tuning: +1,500 h with prompt targets and fine-grained labels (e.g., [laughter]).
  • Evaluation Metrics:
    • Content: WER/CER via Whisper-Large V3 (English), Paraformer (Chinese)
    • Speaker similarity: Cosine similarity of ERes2Net embeddings
    • Speech quality: NMOS (objective), MOS/MOS-I (subjective)
    • Latency: First-package time, RTF
  • Performance Highlights:
Benchmark Content Consistency Speaker Similarity Quality
Librispeech-test-clean WER=2.47% (vs. 2.66% human), NMOS=3.96 (vs. 3.84) SS=0.745 MOS/MOS-I measured
SEED test-zh (Chinese) CER=1.45% SS=0.806 Robust under "hard" (WER=6.83%)
Japanese/Korean CER=18.8%/7.98% SS=0.63/0.71 NMOS=3.42/3.73
Instruction-Control MOS-I=4.11/5 CER=1.52% SS=0.804, NMOS=3.94

Streaming mode is nearly lossless, indicating successful chunk-aware adaptation.

6. Implementation and Reproducibility

Key implementation details and resources:

  • Codebase and Models:

https://github.com/FunAudioLLM/CosyVoice (For trained checkpoints and configuration).

  • Audio Demos:

https://funaudiollm.github.io/cosyvoice2/

  • Backbone: Initialize LM from Qwen2.5-0.5B; remove speaker embedding; tie text and speech token embeddings.
  • FSQ Settings: Down-projection D=8D=8, K=40K=40; codebook size (2K+1)D6561(2K+1)^D \approx 6561; employs STE for backpropagation.
  • CFM: 10 UNet stacks, each with causal conv + attention; chunk mask λ\lambda sampled uniformly from {0,M,2M,}\{0, M, 2M, \infty\}; CFG β=0.7\beta=0.7, NFE=10 per example.
  • Streaming LM: N=15N=15, M=15M=15 chunk sizes; special FILL token id; emit/flush every MM tokens.
  • Vocoder: HiFi-GAN, standard configuration.

Minimum requirements for reproduction:

  1. Train the FSQ-based speech tokenizer on aligned ASR data.
  2. Fine-tune Qwen2.5 LLM for next-token prediction on combined text and speech token sequences.
  3. Train the flow-matching UNet with chunk-aware masking.
  4. Inference with the described chunked or full-sequence regimes.

This modular, unified TTS design achieves sub-100 ms streaming latency and human-parity naturalness in multiple languages, with balanced trade-offs between latency and output quality as determined by chunk and masking configuration (Du et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to CosyVoice 2 Model.