Papers
Topics
Authors
Recent
2000 character limit reached

CosyVoice2: Zero-Shot Multilingual TTS Model

Updated 6 January 2026
  • CosyVoice2 is a zero-shot multilingual TTS model that integrates advanced supervised semantic tokenization and strong text–speech alignment.
  • The model employs a transformer-based language model and a chunk-aware causal flow matching decoder to achieve high-fidelity streaming and offline synthesis.
  • Empirical evaluations show CosyVoice2 achieving human-parity naturalness and state-of-the-art results across languages, enhancing ASR data augmentation.

CosyVoice2 is a large-scale, zero-shot multilingual text-to-speech (TTS) synthesis model with a unified streaming and offline speech generation architecture. Developed as the successor to CosyVoice, CosyVoice2 incorporates advanced supervised semantic tokenization, a transformer-based LLM backbone, and chunk-aware causal flow matching to support scalable, high-fidelity speech synthesis across multiple languages and speaker identities. Its design specifically targets robustness in scenarios such as code-switching conversational speech, data augmentation for automatic speech recognition (ASR), and low-latency real-time applications.

1. Architectural Foundations and Semantic Tokenization

CosyVoice2 is composed of four tightly integrated modules:

  1. Text Encoder: Utilizes language-agnostic byte-pair encoding (BPE) to convert multilingual input text (Chinese, English, Japanese, Korean) into token sequences and aligns them to the corresponding speech timeline, nominally at 50 Hz for text tokens.
  2. Supervised Semantic Speech Tokenizer: Built atop a multispeaker ASR encoder (SenseVoice-Large, six Transformer blocks with RoPE), this component applies Finite Scalar Quantization (FSQ) for token discretization. Unlike traditional vector-quantization (VQ) methods that suffer from low codebook utilization (~23%), FSQ achieves full codebook coverage (100%), quantizing encoder activations to integer space and mapping each quantized vector to a unique semantic token at 25 Hz. This ensures that speech tokens encode both fine phonetic and high-level semantic information.
  3. Unified Text–Speech LLM (LM): The backbone is a pre-trained autoregressive transformer (Qwen2.5-0.5B, 0.5B parameters) that sequentially predicts semantic speech tokens from concatenated text and optional prompt tokens. Special markers denote the start ("S"), turn-to-speech ("T"), end ("E"), and streaming "fill" tokens. The model supports both streaming and non-streaming synthesis within a unified codebase.
  4. Chunk-Aware Causal Flow Matching Decoder + Vocoder: Acoustic synthesis is performed by a chunk-aware causal flow matching (CFM) network—a U-Net with causal/self-attention and look-ahead convolution blocks. The system solves an optimal transport path from Gaussian noise to empirical Mel spectrograms and is trained to minimize L1L_1 flow-matching losses. A lightweight HiFi-GAN vocoder converts these Mel features to the waveform.

Pretraining involves approximately 200,000 hours of aligned multilingual speech (110 k h Chinese, 100 k h English for the tokenizer, and 166 k h for the overall TTS stack), providing strong generalization and enabling fluent intra-sentence code-switching, rich prosodic variation, and natural speaker identity consistency (Du et al., 2024).

2. Streaming and Offline Synthesis: Chunk-Aware Flow Matching

The chunk-aware CFM model underlies streaming and non-streaming capabilities—a critical optimization for interactive AI applications. The flow-matching process follows the deterministic optimal transport trajectory:

ϕtOT(X0,X1)=(1t)X0+tX1,ωt(ϕOT)=X1X0\phi^{OT}_t(X_0, X_1) = (1-t) X_0 + t X_1,\quad \omega_t(\phi^{OT}) = X_1 - X_0

Training minimizes:

θ=argminθEX0,X1,tωt(ϕtOT)νt(ϕtOTθ;μ,X~1,v)1\theta^* = \arg\min_\theta \mathbb{E}_{X_0, X_1, t}\Bigl\| \omega_t(\phi_t^{OT}) - \nu_t(\phi_t^{OT} | \theta; \mu, \tilde X_1, \mathbf{v}) \Bigr\|_1

where νt\nu_t is the learned U-Net flow field, μ\mu are semantic tokens, X~1\tilde X_1 are masked Mel frames, and v\mathbf{v} is the speaker embedding.

To support streaming, attention masks are randomly selected during training to cover non-causal (offline), full-causal (only past frames), chunk-MM (past plus MM future), and chunk-$2M$ scenarios. Inference runs these masks adaptively based on latency constraints. The self-distillation effect from hybrid context masks stabilizes performance under varying conditions. First-package latency on an A100 GPU (typical chunk M=5M=5) is approximately 45 ms, partitioned among LM (1.2 ms/token), FM (5.8 ms/Mel-chunk), and vocoder (2.0 ms/chunk) (Du et al., 2024).

3. Supervised Tokenization and Alignment Stability

CosyVoice2 employs supervised speech tokens to enforce strong text–speech alignment and robust semantic encoding. Tokenization proceeds by projecting continuous encoder activations into low-dimensional integer space, rounding each value, and reconstructing with a projection head:

z=Proj(H),zˉi,j=ROUND(zi,j),μi=j=0D1zˉi,j(2K+1)jz = \mathrm{Proj}_{\downarrow}(H),\qquad \bar{z}_{i,j} = \mathrm{ROUND}(z_{i,j}),\qquad \mu_i = \sum_{j=0}^{D-1} \bar{z}_{i,j} (2K+1)^j

FSQ achieves 100% codebook utilization. Empirical ASR results demonstrate lower word error rates (WER) using FSQ tokens versus VQ (e.g., 10.67% vs. 18.26% on CV-EN), as shown in the following table:

Method Codebook Size Utilization WER@CV-EN WER@CV-CN
VQ 4,096 23% 18.26% 11.56%
FSQ 6,561 100% 10.67% 7.29%

This quantization paradigm, coupled with ASR loss regularization, preserves content–speaker consistency in zero-shot and code-switching scenarios (Du et al., 2024).

Stability hallucinations (repetitions and omissions) present in LLM-based decoder-only TTS systems are mitigated in CosyVoice2 by integrating the Optimal Alignment Score (OAS) loss and chain-of-thought (CoT) attention guidance (Wang et al., 24 Sep 2025). OAS uses Viterbi dynamic programming to enforce monotonic text–speech token alignments:

OAS=i=1LsAi,Pii=1Lsj=1LtAi,j\mathrm{OAS} = \frac{\sum_{i=1}^{L_s} A_{i,P_i}} {\sum_{i=1}^{L_s}\sum_{j=1}^{L_t} A_{i,j}}

where AA is the attention matrix and PiP_i traces the aligned text position for each speech token. This score regularizes designated heads within the transformer, dramatically lowering WER in challenging contexts.

4. Training, Fine-Tuning, and Data Augmentation for Code-Switching ASR

CosyVoice2 can be fine-tuned for domain-adaptive data augmentation, including code-switching speech. Typically, only the LLM backbone (QwenLM, 0.5B parameters) is updated, with the tokenizer, flow-matching decoder, and vocoder frozen to retain acoustic fidelity. The fine-tuning regimen includes Adam optimization (initial LR 1×1041\times 10^{-4}), 10,000-step linear warm-up, and 200 epochs over the target corpus (e.g., SEAME: 100 h, 156 speakers). The training objective combines autoregressive cross-entropy over semantic tokens and fixed flow-matching spectrogram reconstruction loss:

LTTS=tlogpθ(sts<t,x)+λEFϕ(z)Mel(x)2\mathcal{L}_\mathrm{TTS} = -\sum_{t}\log p_\theta(s_t | s_{<t}, x) + \lambda \,\mathbb{E}\lVert F_\phi(z) - \mathrm{Mel}(x)\rVert^2

where λ\lambda balances the two objectives.

Synthetic code-switching speech is generated by resynthesizing each utterance with multiple sampled x-vector speaker embeddings, producing a parallel corpus with expanded speaker, pitch, and rate variation. This data is mixed (ideally 1:1) with real speech for ASR model training, yielding substantial reductions in mixed error rate (MER):

MER=S+D+IN\mathrm{MER} = \frac{S + D + I}{N}

where SS, DD, II are substitutions, deletions, insertions, and NN is the reference token count.

Empirically, adding TTS-generated data with random speaker embeddings (TTS-R) achieves the largest MER reduction (DevMan 10.1%, DevSGE 16.0%) compared to baseline or original embedding augmentation, emphasizing the importance of synthetic speaker diversity (Yeo et al., 2 Jan 2026). However, pure synthetic data underperforms real data, indicating that TTS-augmented datasets should complement, not replace, natural speech.

5. Evaluation Metrics, Benchmarks, and Empirical Results

CosyVoice2 and its streaming variant (CosyVoice2-S) are evaluated across multiple multilingual datasets:

Model zh CER zh SS en WER en SS hard WER hard SS
CosyVoice 3.63 0.775 4.29 0.699 11.75 0.755
CosyVoice 2 1.45 0.806 2.57 0.736 6.83 0.776
CosyVoice 2-S 1.45 0.812 2.38 0.743 8.08 0.785

CosyVoice2 achieves human-parity on Chinese and "hard" cases (challenging test sets) and state-of-the-art results among open-source TTS systems. Streaming synthesis is virtually lossless, with negligible impact on naturalness and similarity metrics. Similar benchmarks are reported for Japanese and Korean.

In augmentation experiments, adding synthetic CosyVoice2 speech data provides greater ASR gains than traditional speed perturbation, highlighting the effectiveness of increased speaker and prosody diversity over temporal modifications. Doubling synthetic data (from 100 to 200 h) produces the largest relative MER decrease, with diminishing returns thereafter (Yeo et al., 2 Jan 2026).

6. Alignment Optimization and Stability Hallucination Mitigation

CosyVoice2's decoder-only transformer architecture risks instability due to lack of explicit duration or alignment mechanisms. Recent work introduces attention-based regularization to counter hallucination phenomena:

  • Optimal Alignment Score (OAS) loss encourages transformer heads to focus on monotonic, left-to-right paths between text and speech tokens. Heads with the highest OAS (typically in layers 8–9) are selected for explicit alignment regularization.
  • Chain-of-Thought (CoT) Attention Guidance: Student models receive teacher-derived sparse token "breadcrumb" sequences and normalized progress-bar supervision during training. This compels the network to maintain continuous advancement through the input text and reduces the likelihood of repetitions or omissions.

Ablation studies indicate that sparse attention guidance with progress-bar signals achieves the greatest reduction in WER without penalizing speaker similarity or naturalness. These optimizations impose modest computational overhead, and the alignment-guidance techniques are integral to the stability of current CosyVoice2 deployments (Wang et al., 24 Sep 2025).

7. Limitations, Recommendations, and Future Directions

Limitations include reliance on large proprietary corpora for pretraining, runtime latency dominated by transformer autoregression and flow-matching ODE solves, and current alignment solutions tailored to Mandarin and decoder-only architectures. Cross-lingual generalization and adaptation to other modalities (multi-speaker, singing) remain open research directions.

Recommendations for practitioners are as follows:

  1. Fine-tune CosyVoice2 on 50–100 h of in-domain code-switching or target speech for optimal prosodic and code-switch pattern modeling.
  2. When generating synthetic datasets, sample a large pool of distinct speaker embeddings (≥2× real data hours) to maximize diversity.
  3. Mix real and synthetic speech (ideally 1:1) for robust ASR fine-tuning.
  4. Use standard augmentations (SpecAugment, speed perturbation) alongside TTS-generated data.

Continued advances in multi-codebook semantic quantization, non-autoregressive LM decoding, joint training protocols, and real-time streaming architectures are anticipated to further enhance CosyVoice2's coverage of languages, speaker/stylistic contrasts, and computational efficiency.


CosyVoice2 establishes a new benchmark in zero-shot, streaming, multilingual TTS synthesis, achieving human-parity naturalness, robust code-switching capabilities, and rapid, stable speech generation via sophisticated alignment-guidance mechanisms and large-scale supervised tokenization (Du et al., 2024, Yeo et al., 2 Jan 2026, Wang et al., 24 Sep 2025).

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to CosyVoice2 TTS Model.