Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 147 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 23 tok/s Pro
GPT-5 High 26 tok/s Pro
GPT-4o 59 tok/s Pro
Kimi K2 190 tok/s Pro
GPT OSS 120B 446 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

CosyVoice 2: Multilingual Low-Latency TTS

Updated 13 November 2025
  • CosyVoice 2 is a multilingual text-to-speech system that uses finite-scalar quantization and chunk-aware causal flow matching to enable scalable, low-latency synthesis.
  • The system follows a four-stage pipeline—from a supervised semantic speech tokenizer to a unified LM, flow matching module, and HiFi-GAN vocoder—to generate natural audio.
  • It achieves human-parity naturalness with robust speaker similarity by training on hundreds of thousands of hours of diverse multilingual audio.

CosyVoice 2 is a multilingual speech synthesis system designed for scalable, low-latency, and streaming text-to-speech (TTS), leveraging LLMs and advanced generative flow modeling. It introduces architectural and algorithmic advances—including finite-scalar quantization and chunk-aware causal flow matching—to support both streaming and non-streaming synthesis with minimal latency and virtually lossless degradation relative to offline modes. The model trains on hundreds of thousands of hours of multilingual audio, achieving human-parity naturalness, robust speaker similarity, and high content fidelity across languages in both interactive and batch processing scenarios.

1. System Architecture

CosyVoice 2 comprises a four-stage pipeline, systematically optimized for streaming deployment:

Stage 1: Supervised Semantic Speech Tokenizer

  • Inputs: BPE-tokenized multilingual text.
  • Encoder₁: 6 Transformer blocks with rotary embeddings at 25 Hz.
  • Finite-Scalar Quantization (FSQ): Projects intermediate features to low-rank, quantizes each scalar to an integer in [−K,K], resulting in 100% codebook utilization.
  • Encoder₂ + ASR Decoder (SenseVoice-Large): Optimizes posterior text-token loss.

Stage 2: Unified Text-Speech LLM (LM)

  • Backbone: Pre-trained Qwen2.5-0.5B, decoder-only, with text encoder & speaker embedding modules removed from the original CosyVoice.
  • The LM is trained to autoregressively predict supervised semantic speech tokens.
  • Supports two sequence-construction schemes (non-streaming: sequential; streaming: interleaved chunks of text and speech tokens).

Stage 3: Chunk-Aware Causal Flow Matching (CFM)

  • Upsamples semantic tokens for 50 Hz Mel spectrogram rate.
  • Adds look-ahead convolution and passes through N stacked causal Transformer-UNet blocks.
  • Model is conditioned on speaker embedding vv, upsampled token sequence μ\mu, potentially masked or reference Mel X~\tilde{X}, and time-step tt.
  • Learns an OT-flow ODE: X˙t=ωt(XtX1)\dot{X}_t = \omega_t(X_t|X_1), with loss L(θ)=EX0N(0,I),X1q(X),tU[0,1][ωt(φtOT(X0,X1))νt(φtOT(X0,X1)θ)1]\mathcal{L}(\theta) = \mathbb{E}_{X_0 \sim \mathcal{N}(0,I), X_1 \sim q(X), t \sim U[0,1]} \left[ \| \omega_t(\varphi_t^{OT}(X_0, X_1)) - \nu_t(\varphi_t^{OT}(X_0, X_1)|\theta) \|_1 \right] where φtOT=(1t)X0+tX1\varphi_t^{OT} = (1-t)X_0 + tX_1, ωt=X1X0\omega_t = X_1 - X_0.
  • Inference uses cosine time rescaling (t1cos(πt/2)t \leftarrow 1-\cos(\pi t/2)), 10 NFE, classifier-free guidance (ν~t=(1+β)νt(cond)βνt(uncond),β=0.7\tilde{\nu}_t = (1+\beta)\nu_t(\text{cond}) - \beta\nu_t(\text{uncond}), \beta=0.7).

Stage 4: HiFi-GAN Vocoder

  • Converts Mel spectrograms to waveforms.

Data Flow Overview:

1
text → BPE → [S]+tokens+[T] → LM → μ (speech tokens) → upsample + look-ahead conv + causal Transformer blocks → CFM → Mel → vocoder → waveform

Inference Pseudocode:

Non-streaming LM

1
2
3
4
5
6
tokens = [S] + text_tokens + [T]
while next_token != [E]:
    logits = QwenLM(tokens)
    next_token = sample(logits)
    tokens.append(next_token)
μ = tokens after [T] and before [E]
Streaming LM (N text : M speech)
1
2
3
4
5
6
7
8
9
10
11
12
13
prompt = [S] + first N text_tokens
while text_tokens or speech not done:
    out = QwenLM(prompt)
    if out == FILL_TOKEN:
        prompt.extend(next N text_tokens)
        continue
    elif out == [T]:
        prompt.append([T])
        continue
    else:
        emit out
        if (emitted_speech % M) == 0:
            return chunk of M tokens

2. Finite-Scalar Quantization (FSQ)

FSQ addresses codebook utilization and quantization efficiency in the speech tokenizer:

  • For intermediate activations HRL×dH \in \mathbb{R}^{L \times d}:
    • Project to reduced dimension: Hdown=Projdown(H)RL×DH_{\text{down}} = \operatorname{Proj}_{\text{down}}(H) \in \mathbb{R}^{L \times D}
    • Quantize each scalar: Hˉ=ROUND(Hdown)\bar{H} = \operatorname{ROUND}(H_{\text{down}}), hˉi,j[K,K]\bar{h}_{i,j} \in [-K, K]
    • Reproject: H^=Projup(Hˉ)\hat{H} = \operatorname{Proj}_{\text{up}}(\bar{H})
    • Index codebook: μi=j=0D1hˉi,j(2K+1)j\mu_i = \sum_{j=0}^{D-1} \bar{h}_{i,j} \cdot (2K+1)^j
  • Training uses a straight-through estimator for ROUND and cross-entropy of ASR decoder posterior.
  • FSQ codebook: (2K+1)D(2K+1)^D entries; 100% utilization compared to 23% for VQ (Table 4.1).
  • Per scalar quantization error 0.5\leq 0.5; global 2\ell_2 error D2\leq \frac{\sqrt{D}}{2} per token.

This quantization strategy underpins maximal efficiency in representing supervised speech tokens, facilitating low-latency and compact autoregressive modeling within the LM.

3. Chunk-Aware Causal Flow Matching

CFM is central to high-quality, low-latency Mel synthesis and enables both streaming and offline modes:

  • Generative path: X0N(0,I)X_0 \sim \mathcal{N}(0,I), X1q(Mel)X_1 \sim q(\text{Mel}), φtOT=(1t)X0+tX1\varphi_t^{OT} = (1-t)X_0 + tX_1, ωt=X1X0\omega_t = X_1-X_0
  • Loss: L(θ)=EX0,X1,t[ωt(φt)νt(φtθ;μ,X~1,v)1]\mathcal{L}(\theta) = \mathbb{E}_{X_0,X_1,t} \left[\|\omega_t(\varphi_t) - \nu_t(\varphi_t|\theta;\mu, \tilde X_1, v)\|_1\right]
  • Inference ODE: dXdt=νt(Xθ)1t\frac{dX}{dt} = \frac{\nu_t(X|\theta)}{1-t}, solved in NFENFE steps (typically 10).
  • During training, attention masks for the Transformer-UNet are sampled among:
    • non-causal (full),
    • full-causal (only past),
    • chunk-M (attend past + MM future),
    • chunk-2M (past + $2M$ future, emulating high-quality offline inference).
  • This enables a single model to be deployed in streaming (restricted look-ahead for low latency) or offline (full look-ahead for quality) contexts.

Chunk-aware masking strategies are pivotal for balancing real-time requirements with synthesis accuracy, as measured by objective and subjective quality scores.

4. Streaming and Non-Streaming Synthesis

CosyVoice 2 provides explicit modes:

  • Non-Streaming: Sequence format is [S][S] text [T][T] μ\mu [E][E] with cross-entropy loss computed on μ\mu.
  • Streaming: Interleaves NN text tokens and MM speech tokens. The LM predicts a special FILL token to denote a request for the next text chunk.
  • Latency Formulation:
    • First package latency: LTTS=Mdlm+Mdfm+MdvocL_{\text{TTS}} = M \cdot d_{\text{lm}} + M \cdot d_{\text{fm}} + M \cdot d_{\text{voc}}
    • For chat: LChatNdLLM+LTTSL_{\text{Chat}} \leq N \cdot d_{\text{LLM}} + L_{\text{TTS}}
  • Typical settings: N=15N=15, M=15M=15 per chunk.
  • Quality-Latency Trade-Off:
    • Decreasing MM yields lower latency but increases computational overhead per second.
    • The choice of chunk mask determines trade-off between latency and Mel/fidelity (see Fig. 3 for empirical trade-offs).

The streaming regime maintains near-lossless quality, with delta WER <0.1%<0.1\% between streaming and offline synthesis.

5. Training Methodology and Evaluation

Extensive multilingual and instructional tuning datasets underpin CosyVoice 2:

  • Training Volumes:
    • Speech tokenizer (FSQ): 200k hours ASR (110.9k h Chinese, 99.9k h English)
    • LM + CFM: ~166.8k hours multi-speaker (130k h Chinese, 30k h English, 4.6k h Japanese, 2.2k h Korean)
    • Instruction-tuning: +1,500 h with prompt targets and fine-grained labels (e.g., [laughter]).
  • Evaluation Metrics:
    • Content: WER/CER via Whisper-Large V3 (English), Paraformer (Chinese)
    • Speaker similarity: Cosine similarity of ERes2Net embeddings
    • Speech quality: NMOS (objective), MOS/MOS-I (subjective)
    • Latency: First-package time, RTF
  • Performance Highlights:
Benchmark Content Consistency Speaker Similarity Quality
Librispeech-test-clean WER=2.47% (vs. 2.66% human), NMOS=3.96 (vs. 3.84) SS=0.745 MOS/MOS-I measured
SEED test-zh (Chinese) CER=1.45% SS=0.806 Robust under "hard" (WER=6.83%)
Japanese/Korean CER=18.8%/7.98% SS=0.63/0.71 NMOS=3.42/3.73
Instruction-Control MOS-I=4.11/5 CER=1.52% SS=0.804, NMOS=3.94

Streaming mode is nearly lossless, indicating successful chunk-aware adaptation.

6. Implementation and Reproducibility

Key implementation details and resources:

  • Codebase and Models:

https://github.com/FunAudioLLM/CosyVoice (For trained checkpoints and configuration).

  • Audio Demos:

https://funaudiollm.github.io/cosyvoice2/

  • Backbone: Initialize LM from Qwen2.5-0.5B; remove speaker embedding; tie text and speech token embeddings.
  • FSQ Settings: Down-projection D=8D=8, K=40K=40; codebook size (2K+1)D6561(2K+1)^D \approx 6561; employs STE for backpropagation.
  • CFM: 10 UNet stacks, each with causal conv + attention; chunk mask λ\lambda sampled uniformly from {0,M,2M,}\{0, M, 2M, \infty\}; CFG β=0.7\beta=0.7, NFE=10 per example.
  • Streaming LM: N=15N=15, M=15M=15 chunk sizes; special FILL token id; emit/flush every MM tokens.
  • Vocoder: HiFi-GAN, standard configuration.

Minimum requirements for reproduction:

  1. Train the FSQ-based speech tokenizer on aligned ASR data.
  2. Fine-tune Qwen2.5 LLM for next-token prediction on combined text and speech token sequences.
  3. Train the flow-matching UNet with chunk-aware masking.
  4. Inference with the described chunked or full-sequence regimes.

This modular, unified TTS design achieves sub-100 ms streaming latency and human-parity naturalness in multiple languages, with balanced trade-offs between latency and output quality as determined by chunk and masking configuration (Du et al., 13 Dec 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to CosyVoice 2 Model.