JoyVoice: Multi-Speaker Speech Synthesis
- JoyVoice is an anthropomorphic foundation model for multi-speaker, long-context conversational speech synthesis that leverages both autoregressive and diffusion transformers.
- It integrates a unified architecture with a global causal diffusion transformer and MM-Tokenizer to maintain prosodic continuity and high content fidelity.
- The system employs robust data perturbation and multilingual training to support zero-shot voice cloning and synthesis for up to eight speakers without explicit boundaries.
JoyVoice is an anthropomorphic foundation model for multi-speaker, long-context conversational speech synthesis, designed to overcome the segmentation and context-length constraints of previous speech generation architectures. The system employs a unified end-to-end architecture integrating an autoregressive transformer with a global causal Diffusion Transformer (DiT) to achieve holistic optimization over both linguistic and acoustic representations. JoyVoice is capable of synthesizing fluid, boundary-free dialogue involving up to eight speakers and supports fine-grained multilingual generation and zero-shot voice cloning, establishing new benchmarks in content fidelity, prosodic continuity, and paralinguistic expressiveness (Yu et al., 22 Dec 2025).
1. Unified E2E-Transformer-DiT Model Architecture
JoyVoice synthesizes conversational speech through two interlinked modules:
- An autoregressive (AR) transformer based on the Qwen2.5-0.5B backbone, which ingests system prompts, natural language text, and speaker tags, and predicts discrete speech tokens.
- A global causal Diffusion Transformer (DiT) employing flow-matching, which converts AR hidden representations directly to mel-spectrograms.
Unlike preceding cascaded TTS designs, JoyVoice feeds the continuous hidden states () from the AR module as direct input to the DiT, thus transmitting prosodic and speaker information without lossy discretization. This direct conditioning supports full end-to-end gradient flow, with the joint objective:
where is a next-token cross-entropy loss for token prediction, and is a flow matching loss on continuous mel-spectrogram reconstruction.
This tightly coupled architecture enables JoyVoice to handle up to eight speakers per segment without the need for explicit dialogue boundary demarcation. The design is robust to substantial tokenizer downsampling (25 Hz→12.5 Hz) with minimal Signal-to-Noise Ratio (SNR) loss, retaining continuity in speaker and prosodic features (Yu et al., 22 Dec 2025).
2. MM-Tokenizer: Semantic & Acoustic Joint Modeling
Speech discretization and understanding are implemented via the MM-Tokenizer, an audio tokenizer operating at 12.5 Hz, derived from the Whisper-large-v3 encoder architecture. The MM-Tokenizer features several innovations:
- An 8× CNN downsampling stack reduces the frame rate before encoding.
- A Finite Scalar Quantization (FSQ) layer inserted into the encoder produces low-bitrate discrete tokens.
- The token vocabulary multiplexes ASR outputs and special tokens for auxiliary tasks such as Speaker Emotion Recognition (SER), Audio Event Detection (AED), Automatic Echo Cancellation (AEC), Speaker Verification (SV), Audio Diarization (AD), and Gender Classification (GC).
Model optimization uses a sum of multi-task cross-entropy over these audio understanding tasks, plus a minimum mean squared error (MMSE) criterion for acoustic (mel-spectrogram) reconstruction:
This design ensures that each token captures both semantic and low-level acoustic cues, preserving intelligibility and expressiveness across diverse speakers and languages.
3. Text Front-End-Free Processing via Data Perturbation
To eliminate the need for a brittle, language-specific TTS front-end, JoyVoice introduces aggressive data perturbation at the text level. Training text is randomly subjected to:
- Text normalization (TN) and inverse text normalization (ITN)
- Polyphonic character substitution
- Synthetic perturbations for numbers, dates, and punctuation
- Rare-character replacement
This approach enhances JoyVoice's generalization to out-of-domain and user-generated texts, as well as robustness across orthographic variance. As a result, the architecture accommodates multi-lingual, cross-dialectal synthesis without the constraints of conventional text-front-end engineering (Yu et al., 22 Dec 2025).
4. Training Dataset, Optimization, and Inference
JoyVoice is trained on approximately 1.3 million hours of scrutinized “in-the-wild” audio content, comprising single-speaker clips (≤30 s) and multi-speaker long-form segments (30 s–5 min) from public audiobooks, podcasts, and video corpora. The language composition prioritizes Chinese (Mandarin, dialectal variants) and English (>90%), but also includes Japanese and Korean.
Training proceeds in two phases:
- Pretraining on single-speaker segments ≤1 min
- Fine-tuning on a curriculum mixing short, long single-speaker, and multi-speaker samples (up to 8 speakers/5 min)
Optimization employs AdamW with linear warmup (10 k steps) and cosine LR annealing. Inference supports streaming through dynamic-chunk causal attention in the DiT module (e.g., 48-frame chunks), allowing for latency-agnostic generation.
Post-training, JoyVoice applies two preference-based reinforcement learning (RL) schemes:
- Direct Preference Optimization (DPO) for text-level output preference
- Acoustic Preference Optimization (APO), leveraging character error rate (CER) as a token-level Bradley–Terry objective, improving intelligibility and naturalness (Yu et al., 22 Dec 2025).
5. Experimental Results and Comparative Performance
5.1 Zero-Shot Single-Speaker Evaluation (Seed-TTS-Eval Benchmark)
JoyVoice achieves state-of-the-art content and speaker similarity metrics for both Chinese (CER=0.97%) and English (WER=1.69%) in zero-shot mode, outperforming CosyVoice 3-0.5B and its own two-stage (cascade) variant by 14% (Chinese CER) and 3.4% (English WER) relative error. Speaker similarity is measured by deep speaker embeddings (SS via WavLM/ERes2Net), with JoyVoice (E2E) scoring 0.836/0.790 on Chinese/English test sets.
| Model | test-zh CER↓/SS↑ | test-en WER↓/SS↑ |
|---|---|---|
| CosyVoice 3-0.5B | 1.16 / 0.825 | 2.02 / 0.789 |
| JoyVoice (Cascade) | 1.13 / 0.780 | 1.75 / 0.710 |
| JoyVoice (E2E) | 0.97 / 0.836 | 1.69 / 0.790 |
Streaming inference (low-latency, 48-frame chunks) matches full-context performance, substantiating real-time deployment viability.
5.2 Multi-Speaker Long-Form (JoyVoice-MSMT-eval)
On multi-speaker evaluation, JoyVoice yields lowest error on both content and speaker-diarization continuity (cpWER), excelling up to four speakers. For instance, JoyVoice achieves 1.44% CER/1.88% cpCER (Chinese, 2-spk) and 3.36% WER/3.61% cpWER (English, 2-spk), outperforming VibeVoice-7B.
| Model | zh 2-spk CER/cpCER↓ | en 2-spk WER/cpWER↓ |
|---|---|---|
| VibeVoice-7B | 1.80 / 7.57 | 4.19 / 6.58 |
| JoyVoice | 1.44 / 1.88 | 3.36 / 3.61 |
Performance degrades for more than four speakers, attributable to reduced data coverage in this regime. This suggests further data curation for higher-order multi-speaker conversations is indicated.
5.3 Qualitative Findings
Subjective listening tests and A/B preference studies (no explicit MOS) validate enhanced prosodic continuity, rhythm, paralinguistic expressiveness, and speech intelligibility for JoyVoice-generated outputs. These subjective measures underpin RL-based fine-tuning.
6. Zero-Shot Voice Cloning Methodology and Results
JoyVoice supports zero-shot speaker adaptation by conditioning on a short vocal prompt to extract a speaker embedding. The prompt and tagged text are processed by the AR transformer, followed by mel-spectrogram generation via DiT. On the SEED benchmark, JoyVoice achieves 0.97% CER for Chinese and 1.69% WER for English in zero-shot mode—surpassing both prior single- and multi-speaker models with no speaker-specific fine-tuning. This evidences generalization to unseen voices, multilingual robustness, and flexible speaker conditioning (Yu et al., 22 Dec 2025).
7. Contributions, Known Limitations, and Prospects
Major contributions include: (1) the integration of AR hidden-state conditioning within a Transformer–DiT framework; (2) the development of a low-bitrate, joint semantic–acoustic MM-Tokenizer; (3) removal of text front-end dependencies through large-scale data perturbation; and (4) leading performance on single- and multi-speaker long-form voice synthesis tasks.
Limitations of the current system include: (i) degradation in synthesis quality beyond four simultaneous speakers, attributed to data scarcity; (ii) reinforcement learning modules (especially for long-term context stability and emotional control) are under further development; (iii) scope confined to speech, excluding music and general audio event synthesis. Planned directions include dataset expansion for dense multi-speaker coverage, refinement of APO to optimize conversational coherence and affect, and exploration of universal audio tokenization for broader audio modalities (Yu et al., 22 Dec 2025).