CosyVoice 2: Multilingual Low-Latency TTS
- CosyVoice 2 is a multilingual text-to-speech system that uses finite-scalar quantization and chunk-aware causal flow matching to enable scalable, low-latency synthesis.
- The system follows a four-stage pipeline—from a supervised semantic speech tokenizer to a unified LM, flow matching module, and HiFi-GAN vocoder—to generate natural audio.
- It achieves human-parity naturalness with robust speaker similarity by training on hundreds of thousands of hours of diverse multilingual audio.
CosyVoice 2 is a multilingual speech synthesis system designed for scalable, low-latency, and streaming text-to-speech (TTS), leveraging LLMs and advanced generative flow modeling. It introduces architectural and algorithmic advances—including finite-scalar quantization and chunk-aware causal flow matching—to support both streaming and non-streaming synthesis with minimal latency and virtually lossless degradation relative to offline modes. The model trains on hundreds of thousands of hours of multilingual audio, achieving human-parity naturalness, robust speaker similarity, and high content fidelity across languages in both interactive and batch processing scenarios.
1. System Architecture
CosyVoice 2 comprises a four-stage pipeline, systematically optimized for streaming deployment:
Stage 1: Supervised Semantic Speech Tokenizer
- Inputs: BPE-tokenized multilingual text.
- Encoder₁: 6 Transformer blocks with rotary embeddings at 25 Hz.
- Finite-Scalar Quantization (FSQ): Projects intermediate features to low-rank, quantizes each scalar to an integer in [−K,K], resulting in 100% codebook utilization.
- Encoder₂ + ASR Decoder (SenseVoice-Large): Optimizes posterior text-token loss.
Stage 2: Unified Text-Speech LLM (LM)
- Backbone: Pre-trained Qwen2.5-0.5B, decoder-only, with text encoder & speaker embedding modules removed from the original CosyVoice.
- The LM is trained to autoregressively predict supervised semantic speech tokens.
- Supports two sequence-construction schemes (non-streaming: sequential; streaming: interleaved chunks of text and speech tokens).
Stage 3: Chunk-Aware Causal Flow Matching (CFM)
- Upsamples semantic tokens for 50 Hz Mel spectrogram rate.
- Adds look-ahead convolution and passes through N stacked causal Transformer-UNet blocks.
- Model is conditioned on speaker embedding , upsampled token sequence , potentially masked or reference Mel , and time-step .
- Learns an OT-flow ODE: , with loss where , .
- Inference uses cosine time rescaling (), 10 NFE, classifier-free guidance ().
Stage 4: HiFi-GAN Vocoder
- Converts Mel spectrograms to waveforms.
Data Flow Overview:
1 |
text → BPE → [S]+tokens+[T] → LM → μ (speech tokens) → upsample + look-ahead conv + causal Transformer blocks → CFM → Mel → vocoder → waveform |
Inference Pseudocode:
Non-streaming LM
1 2 3 4 5 6 |
tokens = [S] + text_tokens + [T] while next_token != [E]: logits = QwenLM(tokens) next_token = sample(logits) tokens.append(next_token) μ = tokens after [T] and before [E] |
1 2 3 4 5 6 7 8 9 10 11 12 13 |
prompt = [S] + first N text_tokens while text_tokens or speech not done: out = QwenLM(prompt) if out == FILL_TOKEN: prompt.extend(next N text_tokens) continue elif out == [T]: prompt.append([T]) continue else: emit out if (emitted_speech % M) == 0: return chunk of M tokens |
2. Finite-Scalar Quantization (FSQ)
FSQ addresses codebook utilization and quantization efficiency in the speech tokenizer:
- For intermediate activations :
- Project to reduced dimension:
- Quantize each scalar: ,
- Reproject:
- Index codebook:
- Training uses a straight-through estimator for ROUND and cross-entropy of ASR decoder posterior.
- FSQ codebook: entries; 100% utilization compared to 23% for VQ (Table 4.1).
- Per scalar quantization error ; global error per token.
This quantization strategy underpins maximal efficiency in representing supervised speech tokens, facilitating low-latency and compact autoregressive modeling within the LM.
3. Chunk-Aware Causal Flow Matching
CFM is central to high-quality, low-latency Mel synthesis and enables both streaming and offline modes:
- Generative path: , , ,
- Loss:
- Inference ODE: , solved in steps (typically 10).
- During training, attention masks for the Transformer-UNet are sampled among:
- non-causal (full),
- full-causal (only past),
- chunk-M (attend past + future),
- chunk-2M (past + $2M$ future, emulating high-quality offline inference).
- This enables a single model to be deployed in streaming (restricted look-ahead for low latency) or offline (full look-ahead for quality) contexts.
Chunk-aware masking strategies are pivotal for balancing real-time requirements with synthesis accuracy, as measured by objective and subjective quality scores.
4. Streaming and Non-Streaming Synthesis
CosyVoice 2 provides explicit modes:
- Non-Streaming: Sequence format is text with cross-entropy loss computed on .
- Streaming: Interleaves text tokens and speech tokens. The LM predicts a special FILL token to denote a request for the next text chunk.
- Latency Formulation:
- First package latency:
- For chat:
- Typical settings: , per chunk.
- Quality-Latency Trade-Off:
- Decreasing yields lower latency but increases computational overhead per second.
- The choice of chunk mask determines trade-off between latency and Mel/fidelity (see Fig. 3 for empirical trade-offs).
The streaming regime maintains near-lossless quality, with delta WER between streaming and offline synthesis.
5. Training Methodology and Evaluation
Extensive multilingual and instructional tuning datasets underpin CosyVoice 2:
- Training Volumes:
- Speech tokenizer (FSQ): 200k hours ASR (110.9k h Chinese, 99.9k h English)
- LM + CFM: ~166.8k hours multi-speaker (130k h Chinese, 30k h English, 4.6k h Japanese, 2.2k h Korean)
- Instruction-tuning: +1,500 h with prompt targets and fine-grained labels (e.g., [laughter]).
- Evaluation Metrics:
- Content: WER/CER via Whisper-Large V3 (English), Paraformer (Chinese)
- Speaker similarity: Cosine similarity of ERes2Net embeddings
- Speech quality: NMOS (objective), MOS/MOS-I (subjective)
- Latency: First-package time, RTF
- Performance Highlights:
| Benchmark | Content Consistency | Speaker Similarity | Quality |
|---|---|---|---|
| Librispeech-test-clean | WER=2.47% (vs. 2.66% human), NMOS=3.96 (vs. 3.84) | SS=0.745 | MOS/MOS-I measured |
| SEED test-zh (Chinese) | CER=1.45% | SS=0.806 | Robust under "hard" (WER=6.83%) |
| Japanese/Korean | CER=18.8%/7.98% | SS=0.63/0.71 | NMOS=3.42/3.73 |
| Instruction-Control | MOS-I=4.11/5 | CER=1.52% | SS=0.804, NMOS=3.94 |
Streaming mode is nearly lossless, indicating successful chunk-aware adaptation.
6. Implementation and Reproducibility
Key implementation details and resources:
- Codebase and Models:
https://github.com/FunAudioLLM/CosyVoice (For trained checkpoints and configuration).
- Audio Demos:
https://funaudiollm.github.io/cosyvoice2/
- Backbone: Initialize LM from Qwen2.5-0.5B; remove speaker embedding; tie text and speech token embeddings.
- FSQ Settings: Down-projection , ; codebook size ; employs STE for backpropagation.
- CFM: 10 UNet stacks, each with causal conv + attention; chunk mask sampled uniformly from ; CFG , NFE=10 per example.
- Streaming LM: , chunk sizes; special FILL token id; emit/flush every tokens.
- Vocoder: HiFi-GAN, standard configuration.
Minimum requirements for reproduction:
- Train the FSQ-based speech tokenizer on aligned ASR data.
- Fine-tune Qwen2.5 LLM for next-token prediction on combined text and speech token sequences.
- Train the flow-matching UNet with chunk-aware masking.
- Inference with the described chunked or full-sequence regimes.
This modular, unified TTS design achieves sub-100 ms streaming latency and human-parity naturalness in multiple languages, with balanced trade-offs between latency and output quality as determined by chunk and masking configuration (Du et al., 13 Dec 2024).