CosyVoice: Advanced Zero-Shot TTS

Updated 16 June 2026

CosyVoice is a family of large language model-based zero-shot text-to-speech systems that utilize supervised semantic tokens and autoregressive synthesis for natural speech generation.
It employs an autoregressive text-to-token language model alongside a conditional flow matching decoder to achieve sub-50ms latency and robust multilingual performance.
CosyVoice demonstrates human-parity naturalness in voice cloning, low-latency streaming, and fine-grained speech editing, with significant improvements in WER and speaker similarity metrics.

CosyVoice is a family of LLM-based zero-shot text-to-speech (TTS) systems characterized by supervised semantic speech tokens, autoregressive or parallel text-to-token synthesis, and a conditional flow-matching decoder. Its design enables human-parity naturalness, speaker similarity, multilingual and cross-lingual voice cloning, fine-grained style control, low-latency streaming, and robust instruction following. The architecture has evolved rapidly, with major public releases culminating in CosyVoice 3, which incorporates scaling, post-training, and unified speech-editing capabilities.

1. Architectural Foundations

CosyVoice factorizes TTS into a discrete sequence transduction pipeline. The principal architectural stages are:

Supervised Speech Tokenizer: Converts speech audio into discrete, text-aligned semantic tokens using either vector quantization (VQ) or, in CosyVoice 2+, finite-scalar quantization (FSQ). Supervision derives from ASR (and, in CosyVoice 3, multi-task objectives including emotion recognition, language ID, and speaker analysis).
Text-Speech LLM (LM): An LLM (e.g., Qwen2.5-0.5B, scaled to 1.5B in CosyVoice 3) autoregressively or, in later variants, bidirectionally generates supervised speech tokens from the input text (BPE/phone-based) and optional prompt speech tokens or speaker embedding.
Conditional Flow Matching (CFM) Decoder: A non-autoregressive, chunk-aware model that transforms semantic tokens (μ), speaker embedding (v), and masked Mel-spectrograms into acoustic features by learning an Ordinary Differential Equation (ODE)-defined probability flow.
Vocoder: Neural waveform synthesizer (e.g., HiFi-GAN, HiFTNet), producing the final signal, with streaming support and chunked upsampling.

This design supports a streaming architecture where the token generation and acoustic synthesis proceed with low end-to-end latency and minimal loss in naturalness (Du et al., 2024, Du et al., 23 May 2025).

2. Supervised Semantic Tokenization and Quantization

CosyVoice is distinguished by its use of supervised, text-aligned semantic speech tokens derived from multilingual ASR networks with inserted quantization layers. The progression of quantization approaches reflects careful attention to codebook utilization and semantic expressivity:

Vector Quantization (VQ): Used in CosyVoice and early CosyVoice 2; known to underutilize codebooks (e.g., 23% usage in VQ-4096).
Finite-Scalar Quantization (FSQ): Introduced in CosyVoice 2. Projects encoder output H to a low-rank space, rounds to integer grid (−K to K), then re-projects. The token μₖ for frame k is

$μₖ = \sum_{j=0}^{D-1} \bar{h}_{k,j}\,(2K+1)^j$

This ensures 100% codebook utilization (e.g., 6561 codes fully used), significantly reducing ASR WER and enhancing semantic efficiency.

CosyVoice 3 further extends the tokenizer to multi-task supervision: besides ASR, losses for language, emotion, event, and speaker classification are combined:

$L_{\text{tokenizer}} = \lambda_{\text{ASR}}L_{\text{ASR}} + \lambda_{\text{LID}}L_{\text{LID}} + \lambda_{\text{SER}}L_{\text{SER}} + \lambda_{\text{AED}}L_{\text{AED}} + \lambda_{\text{SA}}L_{\text{SA}}$

This multi-task regimen injects paralinguistic information directly into the tokens, yielding gains in prosody and in-the-wild expressivity (Du et al., 2024, Du et al., 23 May 2025, An et al., 2024).

3. Semantic Decoding: LLMs and Flow Matching

Text-to-Speech LM:

CosyVoice and CosyVoice 2 use an autoregressive LLM to generate speech tokens, conditioned on text, prompt tokens, and/or speaker embedding.
CosyVoice 2 leverages sequence formats that unify offline (“∘S, T₁…Tₙ, ∘T, μ₁…μ_L, ∘E”) and streaming (interleaved text/speech tokens with control/filling tokens).
CosyVoice 3 scales the LLM to 1.5B parameters and adopts a chunk-aware, bi-streaming approach, enabling sub-50 ms E2E latency.

Conditional Flow Matching (CFM):

Models the transformation of random noise to target Mel-spectrogram frames along the optimal transport path:

$\varphi_t = (1-t)x_0 + t x_1;\quad \omega_t = x_1 - x_0$

A U-Net (UNetθ or DiT backbone in CosyVoice 3) is trained to match the velocity field via L₁ or L₂ loss:

$L_{FM} = \mathbb{E}_{x_0,x_1,t} \|\omega_t(\varphi_t) - \nu_t(\varphi_t \mid \mu, \hat{I}, v)\|_1$

Conditioning is flexible: upsampled tokens μ, masked Mel Ĩ, speaker embedding, and time t.

Classifier-free guidance and causal/chunked attention masks further refine the generation dynamics and training stability (Du et al., 2024, An et al., 2024, Du et al., 23 May 2025).

4. Training Regimes, Scaling, and Post-Training

Data Scaling:

CosyVoice 3 demonstrates extreme scalability: from 10k hours (CosyVoice 2) to 1M hours of training data covering 9 languages, 18 Chinese dialects, and diverse domains (e.g., e-commerce, podcasts, singing) (Du et al., 23 May 2025).

Model Scaling:

CosyVoice 3 increases text-to-speech LLM from 0.5B → 1.5B parameters and CFM decoder from 100M → 300M parameters.

Post-Training Optimization:

DiffRO introduces differentiable reward optimization: rather than RL with speech rollouts, sampled discrete μ’s are optimized via a Gumbel-Softmax proxy and ASR, SER, and MOS-prediction-based reward models, with KL regularization against a reference policy:

$\pi_\theta^* = \arg\max_{\pi_\theta} E_{Y\sim \text{data}}[R(Y)] - \beta D_{KL}[\pi_\theta(\mu|Y)\|\pi_\text{ref}(\mu|Y)]$

Relative reductions up to 35% WER/CER and improvements in cross-lingual and emotional synthesis are empirically observed (Du et al., 23 May 2025).

5. Applications: Multilingual Zero-Shot Synthesis, Voice Cloning, and Editing

CosyVoice is deployed in diverse use cases:

Multilingual and Cross-Lingual TTS: Supports nine major languages and multiple dialects (CosyVoice 3), robust to noise, accent, and domain variation (Du et al., 23 May 2025).
Zero-Shot and Polyglot Voice Cloning: Given a brief target speaker reference (raw audio or tokens) and optional text, CosyVoice achieves strong speaker similarity, with mean subjective similarity (SS) approaching or surpassing human reference on LibriTTS and SEED-TTS-Eval (Du et al., 2024, Du et al., 23 May 2025).
Instruction-Following and Expressive Control: By augmenting training data with speaker/style/emotion instructions, natural language prompts guide emotion, prosody, and even fine-grained paralinguistic events during synthesis, leading to high accuracy on emotion transfer tasks and notable subjective improvements (Du et al., 2024, Du et al., 23 May 2025).
Editing: CosyEdit and LLaDA-TTS demonstrate end-to-end, alignment-free word/sentence-level speech editing, leveraging either optimized inference with AR models or parallel masked diffusion with bidirectional attention—enabling insertion, deletion, and substitution without re-training (Chen et al., 8 Jan 2026, Fan et al., 27 Mar 2026).

6. Quantitative Performance and Benchmarks

CosyVoice models are consistently state-of-the-art on open-source and proprietary TTS benchmarks:

Content Consistency and Speaker Similarity:
- CosyVoice 2 achieves test-zh CER 1.45%, WER_en 2.57%, with average SS ≈ 0.75; CosyVoice 3 improves these figures (CER_zh 0.71%, WER_en 1.45%, similar SS) (Du et al., 23 May 2025).
- LLaDA-TTS (parallel masked diffusion) matches or surpasses AR baselines (CER_zh 0.98%, WER_en 1.96% at 64 steps), at ∼2× LLM-stage speedup (Fan et al., 27 Mar 2026).
- LibriTTS test-clean: CosyVoice: WER 2.89% ±0.18, SS 0.743; CosyVoice 3: WER 2.02%, SS∼0.780 (Du et al., 23 May 2025, An et al., 2024, Du et al., 2024).
Emotion and Style Transfer: Emotion classification accuracy 0.83–1.00 after instruction tuning; fine-grained paralinguistics controllable via instruction (An et al., 2024).
Syntactic and Pronunciation Inpainting: 100% correction rate on monophonic inpainting, robust handling of polyphonic edge cases (Du et al., 23 May 2025).
Editing: CosyEdit, with 250 h data, matches or outperforms billion-parameter cascaded baselines, supporting one-shot, alignment-free editing (Chen et al., 8 Jan 2026, Fan et al., 27 Mar 2026).

7. Broader Impact, Datasets, and Limitations

CosyVoice has catalyzed several ecosystem advances:

Synthetic Data for KWS and ASR: CosyVoice 2 synthesizes SYNTTS-COMMANDS, a multilingual keyword spotting set with ∼400k commands, supporting on-device models (e.g., DS-CNN, CRNN) at near-human accuracy (up to 99.5% English/98% Chinese) (Gan et al., 11 Nov 2025).
Speaker Verification Augmentation: CosyVoice-generated utterances, when fused with bona fide samples, reduce speaker verification EER by 10–16% for 0.5–1s utterances—benefitting real-world short-utterance scenarios (Zhao et al., 17 Jun 2025).
Open-Source Releases: CosyVoice base, instruct, and SFT variants (∼300M–1.5B params), inference/fine-tuning code, and datasets are public on ModelScope, Huggingface, and GitHub (An et al., 2024, Du et al., 23 May 2025).
Limitations: Despite scaling, some language coverage (e.g., Japanese–Chinese pronunciation) and paralinguistic control (timbre, singing) remain open. Synthetic duration alone does not replicate long real speaker samples; careful token/phoneme diversity design is essential (Zhao et al., 17 Jun 2025, Du et al., 2024, Du et al., 23 May 2025).