OZSpeech: Zero-Shot TTS with One-Step Inference

Updated 18 November 2025

OZSpeech is a zero-shot text-to-speech synthesis framework that leverages optimal-transport conditional flow matching and factorized speech token representations for robust one-step inference.
It decomposes speech into six discrete token streams from a pre-trained neural codec, enabling precise control over content, prosody, acoustics, and timbre.
The system achieves real-time synthesis with reduced computation, outperforming traditional autoregressive and multi-step methods in efficiency and speaker cloning.

OZSpeech is a zero-shot text-to-speech (TTS) synthesis framework that advances efficient, high-quality, and speaker-adaptive speech generation by integrating a learned discrete prior, factorized speech token representations, and an optimal-transport conditional flow-matching (OT-CFM) mechanism to enable single-step inference. Unlike autoregressive codec–LLMs and multi-step diffusion or flow-matching methods, OZSpeech achieves robust zero-shot speaker cloning and accurate speech synthesis with significantly reduced compute and sampling cost by leveraging a factorized neural codec and a novel application of OT-CFM with a learned prior (Huynh-Nguyen et al., 19 May 2025).

1. Motivation and Model Overview

OZSpeech addresses principal zero-shot TTS challenges, specifically speaker generalization from minimal prompts, robust disentanglement of speech attributes (content, prosody, acoustic detail, timbre), and sampling efficiency. Conventional models—either autoregressive codec-LLMs (e.g., VALL-E) or multi-step flow-matching systems (e.g., E2 TTS, F5-TTS)—suffer from inefficiencies due to hundreds of iterative sampling steps and entangled attribute modeling, limiting real-time utility and speaker/style fidelity.

OZSpeech decomposes speech into six discrete token streams (from a pre-trained factorized neural codec "FACodec": two content, one prosody, three acoustic), and a timbre embedding. A neural prior generator $f_\psi(\text{text})$ predicts intermediate token codes from phoneme sequences, positioned to approximate the FACodec codes of target speech. A conditional vector field estimator $v_\theta$ then refines these tokens to FACodec-aligned codes in a single step via an OT-CFM objective, where all previous state dependencies are removed, and the learned prior acts as the starting distribution.

2. Factorized Codec Tokenization and Representation

OZSpeech utilizes FACodec, a pre-trained codec-based speech tokenization framework. FACodec maps input waveforms $x$ to an encoder output $h = f_{\text{enc}}(x) \in \mathbb{R}^{T \times D}$ , which is then quantized via three independently trained factorized vector quantizers (FVQs):

Prosody: $N_p = 1$ sequence, $f_p(h) \in \mathbb{R}^{T \times 1}$
Content: $N_c = 2$ sequences, $f_c(h) \in \mathbb{R}^{T \times 2}$
Acoustic details: $N_a = 3$ sequences, $f_a(h) \in \mathbb{R}^{T \times 3}$

Concatenation yields $z = \text{Concat}(f_p(h), f_c(h), f_a(h)) \in \mathbb{R}^{T \times 6}$ , with discrete codes in $\{0, ..., 1023\}$ . Timbre is modeled by $z_t = \text{TemporalPooling}(\text{Conformer}(h)) \in \mathbb{R}^D$ . Quantizer track identifiers $\omega_i \in \mathbb{R}^D$ are added, and a folding operation projects $6\times L\times D \rightarrow L\times D'$ , permitting efficient Transformer-based modeling of joint code streams.

3. Learned-Prior Conditional Flow Matching Framework

The TTS pipeline components are:

Prior Codes Generator

The prior generator $f_\psi$ models the joint distribution over quantizer streams:

$p(q_{1:6}|p; \psi) = p(q_1|p; f_\psi^1) \cdot \prod_{j=2}^6 p(q_j|q_{j-1}; f_\psi^j)$

with associated loss $\mathcal{L}_{\text{prior}} = -\mathbb{E}_{(p, q_{1:6})} \log p(q_{1:6}|p; \psi)$ . A duration-predictor loss $\mathcal{L}_{\text{dur}}$ aligns phonemes to code length.

Conditional OT-CFM

Classical conditional flow matching considers $x_0 \sim \mathcal{N}(0, I)$ , $x_1 \sim p_1$ , and linearly mixes states via $x_t = t x_1 + (1-t)x_0$ , $t \sim \mathcal{U}[0,1]$ . The OT-CFM target is:

$\mathcal{L}_{\text{CFM}}(\theta) = \mathbb{E}_{t, x_0, x_1} \| v_\theta(x_t, t) - \frac{x_1 - x_t}{1-t} \|^2$

OZSpeech replaces the Gaussian start with the learned prior $x_{\text{pr}} \approx x_t$ , optimizing an implicit $\tau$ per sample:

$\mathcal{L}_{\text{CFM}}(\theta) = \mathbb{E}_{x_{\text{pr}}, x_1} \| v_\theta(x_{\text{pr}}, \tau) - \frac{x_1 - x_{\text{pr}}}{1-\tau} \|^2$

An anchor loss is used for discrete codes:

$\mathcal{L}_{\text{anchor}} = -\mathbb{E}_{z_1, \tilde z_1} \log p(z_1|\tilde z_1; \phi)$

with the total loss:

$\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{prior}} + \mathcal{L}_{\text{dur}} + \mathcal{L}_{\text{CFM}} + \mathcal{L}_{\text{anchor}}$

4. One-Step Inference and System Efficiency

The one-step generative process consists of:

Generate prior codes $x_{\text{pr}} = f_\psi(\text{phonemes})$ .
Apply folding and quantizer encoding to the concatenated prompt-masked and prior codes.
Feed codes and implicit time $\tau$ to the vector field estimator $v_\theta$ .
Refine:

$\tilde z_1 = x_{\text{pr}} + (1-\tau) \cdot F^{-1}\big(v_\theta(\overline{z}_{\text{pr}},\tau)\big)$

Decode $\tilde z_1$ to waveform with FACodec.

Numerical Function Evaluations (NFE) are reduced to one, enabling real-time synthesis. Real-time factor (RTF) is 0.26 speech-sec/sec, compared to 0.7–1.7 for prior art, with only 145M trainable parameters plus 102M for FACodec.

5. Speaker Cloning, Robustness, and Evaluation

OZSpeech exploits "acoustic prompt injection" by masking content codes within the prompt, conditioning generation on prosody and acoustic tokens for speaker style preservation. The learned prior aligns generated codes with text duration, allowing the vector field to apply speaker-specific pitch and timbre adaptation. Word Error Rate (WER) remains stable (~0.05) down to 0 dB SNR, substantially surpassing baseline noise robustness.

Key empirical results with 3 s prompt length:

Metric	OZSpeech	Best Baseline Range
WER	0.05	0.09–0.24
UTMOS	3.15	3.55–3.76
SIM-O/SIM-R	0.40/0.47	0.31–0.53/0.38–0.51
F0 Accuracy	0.81	≤0.69
RMSE (F0)	11.96	≥12.96
NFE	1	32–200
RTF (sec/sec)	0.26	0.7–1.7
Model size (M)	145	378–830

Ablation studies demonstrate that arbitrary segment prompting outperforms using only the first segment. OZSpeech-Small (100M parameters) matches the performance of the Base model (145M) on WER and UTMOS, with a minimal tradeoff in prosody.

6. Limitations and Prospects

OZSpeech exhibits slight temporal distortions from rounding in the FastSpeech duration predictor, and its UTMOS naturalness score trails some larger models. Prospective improvements include alternative alignment strategies (e.g., Monotonic Alignment Search) to enhance temporal fidelity and adaptive noise filtering. Multilingual and multimodal zero-shot synthesis constitute additional future directions.

7. Comparative Context and Significance

OZSpeech constitutes the first application of optimal-transport conditional flow matching with a learned discrete prior and single-step sampling in TTS, using a fully disentangled, factorized token representation. This approach departs from previous zero-resource and zero-shot paradigms by both eliminating multi-step sampling and enabling fine-grained attribute control within a compact architecture. The combination of state-of-the-art content accuracy (WER ≈ 0.05), robust prosody/style transfer, resistance to noise, and high sampling efficiency establishes a new practical benchmark for zero-shot TTS, with implications for scalable, real-time, speaker-adaptive speech synthesis (Huynh-Nguyen et al., 19 May 2025).

PDF Markdown Chat (Pro)

References (1)

OZSpeech: One-step Zero-shot Speech Synthesis with Learned-Prior-Conditioned Flow Matching (2025)

Follow Topic

Get notified by email when new papers are published related to OZSpeech.