DiFlow-TTS: Discrete Flow Matching for Zero-Shot TTS

Updated 13 September 2025

The paper introduces a novel TTS model that performs flow matching directly in the discrete token space, eliminating the need for continuous embedding.
It employs factorized attribute modeling with separate heads for prosody and acoustic details, enabling precise control and effective zero-shot voice cloning.
Experimental evaluations show high naturalness, robust speaker style preservation, and low-latency inference, making it suitable for real-time applications.

DiFlow-TTS is a zero-shot text-to-speech synthesis model that implements purely discrete flow matching with explicit factorization of speech attributes. Designed for high-fidelity, low-latency speech generation, DiFlow-TTS advances prior speech synthesis paradigms by performing flow matching directly in the discrete token domain—rather than the conventional approach of embedding discrete tokens into a continuous space. The architecture explicitly models prosody and acoustic details with separate factorized heads, enabling precise control and effective attribute cloning from a short reference sample, which is crucial for zero-shot application scenarios.

1. Discrete Flow Matching Architecture

DiFlow-TTS constructs speech by processing multiple streams of quantized speech tokens—prosody, content, and acoustic—extracted by a neural codec (e.g., FaCodec). The codec representation provides tokens $x^p$ , $x^c$ , $x^a$ for prosody, content, and acoustic details, alongside a speaker embedding $s$ :

$x^p, x^c, x^a, s = \text{CodecEncoder}(\mathcal{A})$

where $\mathcal{A}$ is the raw waveform and $[v]$ denotes the discrete token vocabulary.

The input text undergoes phoneme extraction and duration prediction, after which a Length Regulator upsamples the phoneme embeddings for time alignment. The upsampled embeddings are processed by the Phoneme-Content Mapper (PCM), which outputs hierarchical content embeddings $h_c$ and predicts content tokens via:

$h_c = \mathcal{H}_\varrho(h), \quad p(x^c | h) = \mathcal{G}_\phi(h)$

The Factorized Discrete Flow Denoiser (FDFD) performs discrete flow matching between a masked token source and the target token sequence. The denoiser uses two explicit heads ( $f_\phi$ , $f_\omega$ ) for prosody and acoustic tokens, respectively. The transition probability at each step $t$ is governed by a monotonic scheduler $\kappa_t$ :

$u_t^i(x^i, x_t) = \frac{\tilde{\kappa}_t}{1-\kappa_t}[p_{1|t}(x^i | x_t, c; \theta) - \delta_{x_t}(x^i)]$

where $i$ indexes each attribute stream, $p_{1|t}$ denotes the denoising head, $\delta_{x_t}$ is the Kronecker delta, and $c$ is the concatenated conditioning context ( $h_c$ , $x^p$ , $x^c$ , $x^a$ , $s$ ).

2. Factorized Attribute Modeling

A central advance is the factorization of the TTS output into distinct prosodic and acoustic streams. The discrete tokens for prosody and acoustics are generated in parallel, with each head learning to predict its aspect-specific distribution. This design allows:

Direct modeling of prosodic attributes (intonation, rhythm, F0 contour) for improved expressiveness.
Fine-grained control of acoustic details (energy, spectral shape) for speaker style preservation.
Unified non-autoregressive inference for all streams, avoiding mutually entangled artifacts such as unnatural repetition.

The architecture supports in-context learning by conditioning the generation process on prosodic and acoustic tokens extracted from a reference speech sample, alongside the text-derived content. This enables zero-shot voice cloning from only a few seconds of reference audio.

3. Purely Discrete Flow Matching Strategy

Prior flow-matching and diffusion TTS systems typically embed discrete quantizer tokens into a continuous space before generative modeling. DiFlow-TTS instead performs flow matching directly in discrete token space. The flow matching process defines a probability mixture path from the source to the target token sequence, and the denoiser is trained to estimate the aspect-specific transition distributions without continuous relaxation. For each token stream, at each time $t$ :

The source sequence consists of all tokens set to a mask token.
The scheduler $\kappa_t$ governs the transition mixture between source and target.
The denoiser predicts the posterior token distribution in the discrete space.

This strategy eliminates efficiency loss and mode collapse associated with continuous approximations and enables rapid, high-fidelity generation with reduced function evaluations.

4. Experimental Evaluation and Low-Latency Inference

Empirical results on the LibriSpeech test-clean dataset and controlled speaker adaptation setups show that DiFlow-TTS achieves:

Naturalness: UTMOS $\approx 3.98$ , within $0.11$ of ground truth and trailing only Spark-TTS ($4.31$).
Prosody: F0 RMSE $\approx 7.97$ , energy RMSE $\approx 0.007$ —setting state-of-the-art values for pitch and energy control.
Speaker style: Similarity metrics (SIM-R $\approx 0.54$ , SIM-O $\approx 0.45$ ) indicate robust preservation of speaker identity.
Robustness: Word error rate (WER) of $0.05$, demonstrating strong linguistic fidelity.

Crucially, DiFlow-TTS is highly efficient. The FDFD module requires as few as $16$ non-autoregressive function evaluations (NFE), delivering a real-time factor (RTF) around $0.066$—up to $25.8\times$ faster than the latest autoregressive and diffusion baselines. The low-latency operation enables practical deployment in interactive or real-time systems.

The following table summarizes selected performance comparisons (per paper tables):

Model	UTMOS	F0 RMSE	SIM-R	RTF
Spark-TTS	4.31	—	—	—
DiFlow-TTS	3.98	7.97	0.54	0.066
NaturalSpeech2	—	—	—	$>$ 0.1
VoiceCraft	—	—	—	$>$ 0.1

5. Applications and Implications

DiFlow-TTS demonstrates strong applicability for:

Personalized assistant systems, which require rapid voice adaptation from minimal user data.
Accessibility and language technology, especially for low-resource languages or speaker populations with limited training data.
Content creation, dubbing, and virtual avatars—allowing flexible attribute cloning and style control.
Real-time and edge deployment, leveraging the compact model size and fast inference.

The explicit factorization and purely discrete approach suggest further exploration of discrete generative strategies in speech and other sequential domains. Incorporating additional factorization dimensions (emotion, language-specific prosody, etc.) may yield further gains in controllability.

A plausible implication is that discrete flow matching provides an architecture for future TTS systems optimized for efficiency, attribute disentanglement, and robust zero-shot voice transfer.

6. Context Within Generative Speech Modeling

DiFlow-TTS builds on and contrasts with a series of innovations in speech synthesis:

Discrete diffusion models (e.g., DCTTS (Wu et al., 2023)) compress spectrograms into discrete tokens and perform latent diffusion for resource efficiency.
Factorized architectures such as E1 TTS (Liu et al., 2024), F5-TTS (Chen et al., 2024), and DPI-TTS (Qi et al., 2024) provide alternative formulations, but generally utilize continuous or hybrid token spaces.
Prior flow-matching and diffusion models embed tokens before processing; DiFlow-TTS's purely discrete formulation removes this bottleneck.

This suggests that future research may increasingly focus on direct discrete generative modeling, informed by efficient factorization and advanced tokenization schemes.

7. Limitations and Future Directions

The present architecture (per paper) achieves strong performance but leaves open questions regarding further attribute factorization and scalability to even larger token vocabularies. Improvements in the granularity of attribute control, multilingual extension, and continued reduction of inference cost are ongoing areas of research.

Extensions to emotional speech synthesis, multimodal attribute conditioning, and deployment on resource-constrained hardware platforms are plausible next steps. Additionally, refining the factorized flow prediction heads and the in-context learning mechanisms may result in finer control over nuanced speaker and prosody attributes.

The DiFlow-TTS model represents a significant contribution to efficient, zero-shot text-to-speech synthesis, leveraging innovations in discrete flow matching, factorized token prediction, and compact non-autoregressive architecture to approach state-of-the-art naturalness, speaker style preservation, and inference latency (Nguyen et al., 11 Sep 2025).