Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 159 tok/s
Gemini 2.5 Pro 45 tok/s Pro
GPT-5 Medium 30 tok/s Pro
GPT-5 High 28 tok/s Pro
GPT-4o 100 tok/s Pro
Kimi K2 175 tok/s Pro
GPT OSS 120B 452 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

ZipVoice-Dialog: NAR Dialogue Synthesis

Updated 17 July 2025
  • ZipVoice-Dialog is a non-autoregressive, zero-shot spoken dialogue generation model that leverages flow matching to synthesize realistic multi-speaker interactions.
  • Its architecture integrates a text encoder, conditional flow matching, and a pre-trained vocoder, ensuring explicit turn-taking and distinct speaker timbres.
  • A curriculum learning strategy combined with the OpenDialog dataset enables high-quality synthesis with faster inference, improved intelligibility, and robust benchmarking.

ZipVoice-Dialog is a non-autoregressive, zero-shot spoken dialogue generation model employing flow matching for high-fidelity, efficient, and accurate multi-speaker speech synthesis. Developed to address the unique challenges of generating realistic spoken dialogues—particularly the need for explicit turn-taking and distinct speaker timbres—ZipVoice-Dialog departs from prior autoregressive speech generation approaches, offering substantial improvements in inference speed, intelligibility, and speaker coherence. Central to its development is a large-scale, curated dataset for open spoken dialogue, as well as a robust benchmarking suite against leading state-of-the-art models (Zhu et al., 12 Jul 2025).

1. Architectural Foundations

ZipVoice-Dialog extends the ZipVoice NAR monologue text-to-speech (TTS) model to the domain of spoken dialogue. The architecture is organized around three key components:

  • Text Encoder: Utilizes the Zipformer backbone to embed input tokenized text, producing feature vectors ȳ for each token yiy_i.
  • Conditional Flow Matching (CFM): Implements a vector field estimator trained to map interpolated noisy features xt=(1t)x0+tx1x_t = (1-t)x_0 + t x_1 back to clean speech features, conditional on text and speaker-turn information. The core training loss over masked regions is

LCFM-TTS=Et,q(x1),p0(x0)(vt(xt,z,(1m)x1;θ)(x1x0))m2\mathcal{L}_{\text{CFM-TTS}} = \mathbb{E}_{t, q(x_1), p_0(x_0)} \left\| \left( v_t(x_t, z, (1-m) \odot x_1 ; \theta) - (x_1 - x_0) \right) \odot m \right\|^2

where zz is the upsampled text condition and mm is a binary mask over masked regions.

  • Pre-trained Vocoder: Employs Vocos for high-quality waveform synthesis from predicted speech features.

Input text is provided as a single interleaved sequence, marked with turn-level speaker labels (e.g., [S1], [S2]), enabling the model to assign and synthesize accurate speaker-specific turns.

2. Speaker-Turn Embeddings and Turn-Taking Mechanisms

Realistic dialogue generation depends critically on distinguishing speakers and enforcing robust turn-taking. ZipVoice-Dialog achieves this through:

  • Learnable Speaker-Turn Embeddings: For each token yiy_i, a corresponding embedding espeaker(i)e_{\text{speaker}(i)} is added:

y~i=yˉi+espeaker(i)\tilde{y}_i = \bar{y}_i + e_{\text{speaker}(i)}

This mechanism encodes explicit speaker identity into the text features, guiding the model to allocate the correct timbre and vocal characteristics per turn.

  • Explicit Turn Markers: Input sequences include special tokens for each speaker, ensuring that the model is provided unambiguous turn boundaries during training and inference.

Objective assessments indicate that this explicit encoding of speaker information leads to highly accurate speaker turn-taking and improved speaker similarity performance.

3. Training Paradigm: Curriculum Strategy

Direct training on dialogue data with multiple, distinct speaker timbres presents alignment difficulties. ZipVoice-Dialog addresses these with a curriculum learning strategy:

  1. Monologue Pre-Training: The ZipVoice-Dialog model is initialized using weights from the ZipVoice monologue TTS system. This phase, trained on hundreds of thousands of hours of single-speaker speech, achieves robust speech–text alignment.
  2. Dialogue Fine-Tuning: The pre-trained model undergoes supervised fine-tuning on single-channel dialogue data (with explicit turns) to acquire conversational dynamics and speaker alternation.

By adopting this staged learning, the model achieves stable convergence, high intelligibility, and minimizes common multi-speaker synthesis errors.

4. Stereo Dialogue and Channel Separation

ZipVoice-Dialog-Stereo extends the model to two-channel (stereo) audio output, a requirement for telecommunication, simulation, and other dual-speaker contexts. Specialized strategies support this extension:

  • Weight Initialization: Input/output projection layers are doubled for stereo output, initialized by duplicating single-channel weights, avoiding instability from random initialization.
  • Dialogue Regularization: Single-channel branches are maintained during stereo fine-tuning; batch alternation between mono and stereo data mitigates overfitting given limited stereo data availability.
  • Speaker Exclusive Loss: To disincentivize simultaneous speech (crosstalk/artifacts) in both channels, an additional loss term penalizes frames where both channels exceed an adaptive silence threshold:

LSE=1Ti1(Ei0>τEi1>τ)(Ei0τ)(Ei1τ)\mathcal{L}^{\mathrm{SE}} = \frac{1}{T} \sum_{i} \mathbb{1}\left(E^0_i > \tau \wedge E^1_i > \tau\right) (E^0_i - \tau)(E^1_i - \tau)

Here, EicE^c_i is the channel energy for channel cc at frame ii, and τ\tau is the median frame energy.

During inference, environmental/ambient noise is used in inactive channels to align with training distribution, supporting natural background modeling.

5. OpenDialog Dataset Construction

The lack of a publicly available large-scale spoken dialogue dataset motivated the introduction of OpenDialog:

  • Scale: 6.8k hours of natural spoken dialogue, with 5074 hours in English, 1759 hours in Chinese.
  • Curation Pipeline:
    • Voice Activity Detection (VAD) to segment speech regions.
    • Speaker diarization to separate speakers in recordings.
    • ASR transcription and LLM-based dialogue classification.
    • Rule-based and DNSMOS quality filtering, removing segments falling below designated MOS thresholds (e.g., DNSMOS < 2.8).

OpenDialog provides critical resources for both model training and rigorous benchmarking of spoken dialogue systems.

6. Empirical Benchmarking and Performance

ZipVoice-Dialog’s evaluation demonstrates advances over top autoregressive dialogue generation models (e.g., MoonCast, Dia):

  • Objective Metrics:
    • Intelligibility (WER): Substantially lower word error rates versus autoregressive baselines.
    • Speaker Turn-Taking (cpWER): Significant improvements, credited to explicit speaker-turn embeddings.
    • Speaker Similarity (cpSIM): High correlation between generated and reference speakers.
    • UTMOS: Objective mean opinion scores confirm overall quality improvements.
    • Inference Speed (RTF): Inference over 15× faster than best AR systems.
  • Subjective Evaluation:
    • CMOS/SMOS: High scores for naturalness, coherent turn-taking, and perceived speaker consistency.

Results from both quantitative and qualitative metrics establish the efficiency and fidelity of the ZipVoice-Dialog architecture.

7. Availability and Research Resources

ZipVoice-Dialog is fully open-sourced, with the following public contributions:

  • GitHub Codebase: Complete training and inference pipelines, including configurations for both single-channel and stereo dialogue.
  • Model Checkpoints: Pre-trained weights ready for adaptation or fine-tuning.
  • Demo Samples: Audio examples demonstrating system capabilities.
  • OpenDialog Dataset: Downloadable resources for training and benchmarking alternative dialogue synthesis approaches.

These resources facilitate reproducibility, critical evaluation, and further research on spoken dialogue modeling.


ZipVoice-Dialog represents a substantive advance in non-autoregressive, turn-aware spoken dialogue synthesis, combining efficient flow matching with curriculum learning, explicit modeling of speaker turns, and extensibility to stereo generation. Its development is bolstered by the construction of the OpenDialog dataset and validated by comprehensive benchmarking against state-of-the-art models in both speed and realism (Zhu et al., 12 Jul 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to ZipVoice-Dialog.

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube