Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
112 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
55 tokens/sec
2000 character limit reached

ZipVoice-Dialog: NAR Dialogue Synthesis

Updated 17 July 2025
  • ZipVoice-Dialog is a non-autoregressive, zero-shot spoken dialogue generation model that leverages flow matching to synthesize realistic multi-speaker interactions.
  • Its architecture integrates a text encoder, conditional flow matching, and a pre-trained vocoder, ensuring explicit turn-taking and distinct speaker timbres.
  • A curriculum learning strategy combined with the OpenDialog dataset enables high-quality synthesis with faster inference, improved intelligibility, and robust benchmarking.

ZipVoice-Dialog is a non-autoregressive, zero-shot spoken dialogue generation model employing flow matching for high-fidelity, efficient, and accurate multi-speaker speech synthesis. Developed to address the unique challenges of generating realistic spoken dialogues—particularly the need for explicit turn-taking and distinct speaker timbres—ZipVoice-Dialog departs from prior autoregressive speech generation approaches, offering substantial improvements in inference speed, intelligibility, and speaker coherence. Central to its development is a large-scale, curated dataset for open spoken dialogue, as well as a robust benchmarking suite against leading state-of-the-art models (Zhu et al., 12 Jul 2025).

1. Architectural Foundations

ZipVoice-Dialog extends the ZipVoice NAR monologue text-to-speech (TTS) model to the domain of spoken dialogue. The architecture is organized around three key components:

  • Text Encoder: Utilizes the Zipformer backbone to embed input tokenized text, producing feature vectors ȳ for each token yiy_i.
  • Conditional Flow Matching (CFM): Implements a vector field estimator trained to map interpolated noisy features xt=(1t)x0+tx1x_t = (1-t)x_0 + t x_1 back to clean speech features, conditional on text and speaker-turn information. The core training loss over masked regions is

LCFM-TTS=Et,q(x1),p0(x0)(vt(xt,z,(1m)x1;θ)(x1x0))m2\mathcal{L}_{\text{CFM-TTS}} = \mathbb{E}_{t, q(x_1), p_0(x_0)} \left\| \left( v_t(x_t, z, (1-m) \odot x_1 ; \theta) - (x_1 - x_0) \right) \odot m \right\|^2

where zz is the upsampled text condition and mm is a binary mask over masked regions.

  • Pre-trained Vocoder: Employs Vocos for high-quality waveform synthesis from predicted speech features.

Input text is provided as a single interleaved sequence, marked with turn-level speaker labels (e.g., [S1], [S2]), enabling the model to assign and synthesize accurate speaker-specific turns.

2. Speaker-Turn Embeddings and Turn-Taking Mechanisms

Realistic dialogue generation depends critically on distinguishing speakers and enforcing robust turn-taking. ZipVoice-Dialog achieves this through:

  • Learnable Speaker-Turn Embeddings: For each token yiy_i, a corresponding embedding espeaker(i)e_{\text{speaker}(i)} is added:

y~i=yˉi+espeaker(i)\tilde{y}_i = \bar{y}_i + e_{\text{speaker}(i)}

This mechanism encodes explicit speaker identity into the text features, guiding the model to allocate the correct timbre and vocal characteristics per turn.

  • Explicit Turn Markers: Input sequences include special tokens for each speaker, ensuring that the model is provided unambiguous turn boundaries during training and inference.

Objective assessments indicate that this explicit encoding of speaker information leads to highly accurate speaker turn-taking and improved speaker similarity performance.

3. Training Paradigm: Curriculum Strategy

Direct training on dialogue data with multiple, distinct speaker timbres presents alignment difficulties. ZipVoice-Dialog addresses these with a curriculum learning strategy:

  1. Monologue Pre-Training: The ZipVoice-Dialog model is initialized using weights from the ZipVoice monologue TTS system. This phase, trained on hundreds of thousands of hours of single-speaker speech, achieves robust speech–text alignment.
  2. Dialogue Fine-Tuning: The pre-trained model undergoes supervised fine-tuning on single-channel dialogue data (with explicit turns) to acquire conversational dynamics and speaker alternation.

By adopting this staged learning, the model achieves stable convergence, high intelligibility, and minimizes common multi-speaker synthesis errors.

4. Stereo Dialogue and Channel Separation

ZipVoice-Dialog-Stereo extends the model to two-channel (stereo) audio output, a requirement for telecommunication, simulation, and other dual-speaker contexts. Specialized strategies support this extension:

  • Weight Initialization: Input/output projection layers are doubled for stereo output, initialized by duplicating single-channel weights, avoiding instability from random initialization.
  • Dialogue Regularization: Single-channel branches are maintained during stereo fine-tuning; batch alternation between mono and stereo data mitigates overfitting given limited stereo data availability.
  • Speaker Exclusive Loss: To disincentivize simultaneous speech (crosstalk/artifacts) in both channels, an additional loss term penalizes frames where both channels exceed an adaptive silence threshold:

LSE=1Ti1(Ei0>τEi1>τ)(Ei0τ)(Ei1τ)\mathcal{L}^{\mathrm{SE}} = \frac{1}{T} \sum_{i} \mathbb{1}\left(E^0_i > \tau \wedge E^1_i > \tau\right) (E^0_i - \tau)(E^1_i - \tau)

Here, EicE^c_i is the channel energy for channel cc at frame ii, and τ\tau is the median frame energy.

During inference, environmental/ambient noise is used in inactive channels to align with training distribution, supporting natural background modeling.

5. OpenDialog Dataset Construction

The lack of a publicly available large-scale spoken dialogue dataset motivated the introduction of OpenDialog:

  • Scale: 6.8k hours of natural spoken dialogue, with 5074 hours in English, 1759 hours in Chinese.
  • Curation Pipeline:
    • Voice Activity Detection (VAD) to segment speech regions.
    • Speaker diarization to separate speakers in recordings.
    • ASR transcription and LLM-based dialogue classification.
    • Rule-based and DNSMOS quality filtering, removing segments falling below designated MOS thresholds (e.g., DNSMOS < 2.8).

OpenDialog provides critical resources for both model training and rigorous benchmarking of spoken dialogue systems.

6. Empirical Benchmarking and Performance

ZipVoice-Dialog’s evaluation demonstrates advances over top autoregressive dialogue generation models (e.g., MoonCast, Dia):

  • Objective Metrics:
    • Intelligibility (WER): Substantially lower word error rates versus autoregressive baselines.
    • Speaker Turn-Taking (cpWER): Significant improvements, credited to explicit speaker-turn embeddings.
    • Speaker Similarity (cpSIM): High correlation between generated and reference speakers.
    • UTMOS: Objective mean opinion scores confirm overall quality improvements.
    • Inference Speed (RTF): Inference over 15× faster than best AR systems.
  • Subjective Evaluation:
    • CMOS/SMOS: High scores for naturalness, coherent turn-taking, and perceived speaker consistency.

Results from both quantitative and qualitative metrics establish the efficiency and fidelity of the ZipVoice-Dialog architecture.

7. Availability and Research Resources

ZipVoice-Dialog is fully open-sourced, with the following public contributions:

  • GitHub Codebase: Complete training and inference pipelines, including configurations for both single-channel and stereo dialogue.
  • Model Checkpoints: Pre-trained weights ready for adaptation or fine-tuning.
  • Demo Samples: Audio examples demonstrating system capabilities.
  • OpenDialog Dataset: Downloadable resources for training and benchmarking alternative dialogue synthesis approaches.

These resources facilitate reproducibility, critical evaluation, and further research on spoken dialogue modeling.


ZipVoice-Dialog represents a substantive advance in non-autoregressive, turn-aware spoken dialogue synthesis, combining efficient flow matching with curriculum learning, explicit modeling of speaker turns, and extensibility to stereo generation. Its development is bolstered by the construction of the OpenDialog dataset and validated by comprehensive benchmarking against state-of-the-art models in both speed and realism (Zhu et al., 12 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this topic yet.