Papers
Topics
Authors
Recent
2000 character limit reached

OZSpeech: Zero-Shot TTS with One-Step Inference

Updated 18 November 2025
  • OZSpeech is a zero-shot text-to-speech synthesis framework that leverages optimal-transport conditional flow matching and factorized speech token representations for robust one-step inference.
  • It decomposes speech into six discrete token streams from a pre-trained neural codec, enabling precise control over content, prosody, acoustics, and timbre.
  • The system achieves real-time synthesis with reduced computation, outperforming traditional autoregressive and multi-step methods in efficiency and speaker cloning.

OZSpeech is a zero-shot text-to-speech (TTS) synthesis framework that advances efficient, high-quality, and speaker-adaptive speech generation by integrating a learned discrete prior, factorized speech token representations, and an optimal-transport conditional flow-matching (OT-CFM) mechanism to enable single-step inference. Unlike autoregressive codec–LLMs and multi-step diffusion or flow-matching methods, OZSpeech achieves robust zero-shot speaker cloning and accurate speech synthesis with significantly reduced compute and sampling cost by leveraging a factorized neural codec and a novel application of OT-CFM with a learned prior (Huynh-Nguyen et al., 19 May 2025).

1. Motivation and Model Overview

OZSpeech addresses principal zero-shot TTS challenges, specifically speaker generalization from minimal prompts, robust disentanglement of speech attributes (content, prosody, acoustic detail, timbre), and sampling efficiency. Conventional models—either autoregressive codec-LLMs (e.g., VALL-E) or multi-step flow-matching systems (e.g., E2 TTS, F5-TTS)—suffer from inefficiencies due to hundreds of iterative sampling steps and entangled attribute modeling, limiting real-time utility and speaker/style fidelity.

OZSpeech decomposes speech into six discrete token streams (from a pre-trained factorized neural codec "FACodec": two content, one prosody, three acoustic), and a timbre embedding. A neural prior generator fψ(text)f_\psi(\text{text}) predicts intermediate token codes from phoneme sequences, positioned to approximate the FACodec codes of target speech. A conditional vector field estimator vθv_\theta then refines these tokens to FACodec-aligned codes in a single step via an OT-CFM objective, where all previous state dependencies are removed, and the learned prior acts as the starting distribution.

2. Factorized Codec Tokenization and Representation

OZSpeech utilizes FACodec, a pre-trained codec-based speech tokenization framework. FACodec maps input waveforms xx to an encoder output h=fenc(x)RT×Dh = f_{\text{enc}}(x) \in \mathbb{R}^{T \times D}, which is then quantized via three independently trained factorized vector quantizers (FVQs):

  • Prosody: Np=1N_p = 1 sequence, fp(h)RT×1f_p(h) \in \mathbb{R}^{T \times 1}
  • Content: Nc=2N_c = 2 sequences, fc(h)RT×2f_c(h) \in \mathbb{R}^{T \times 2}
  • Acoustic details: Na=3N_a = 3 sequences, fa(h)RT×3f_a(h) \in \mathbb{R}^{T \times 3}

Concatenation yields z=Concat(fp(h),fc(h),fa(h))RT×6z = \text{Concat}(f_p(h), f_c(h), f_a(h)) \in \mathbb{R}^{T \times 6}, with discrete codes in {0,...,1023}\{0, ..., 1023\}. Timbre is modeled by zt=TemporalPooling(Conformer(h))RDz_t = \text{TemporalPooling}(\text{Conformer}(h)) \in \mathbb{R}^D. Quantizer track identifiers ωiRD\omega_i \in \mathbb{R}^D are added, and a folding operation projects 6×L×DL×D6\times L\times D \rightarrow L\times D', permitting efficient Transformer-based modeling of joint code streams.

3. Learned-Prior Conditional Flow Matching Framework

The TTS pipeline components are:

Prior Codes Generator

The prior generator fψf_\psi models the joint distribution over quantizer streams:

p(q1:6p;ψ)=p(q1p;fψ1)j=26p(qjqj1;fψj)p(q_{1:6}|p; \psi) = p(q_1|p; f_\psi^1) \cdot \prod_{j=2}^6 p(q_j|q_{j-1}; f_\psi^j)

with associated loss Lprior=E(p,q1:6)logp(q1:6p;ψ)\mathcal{L}_{\text{prior}} = -\mathbb{E}_{(p, q_{1:6})} \log p(q_{1:6}|p; \psi). A duration-predictor loss Ldur\mathcal{L}_{\text{dur}} aligns phonemes to code length.

Conditional OT-CFM

Classical conditional flow matching considers x0N(0,I)x_0 \sim \mathcal{N}(0, I), x1p1x_1 \sim p_1, and linearly mixes states via xt=tx1+(1t)x0x_t = t x_1 + (1-t)x_0, tU[0,1]t \sim \mathcal{U}[0,1]. The OT-CFM target is:

LCFM(θ)=Et,x0,x1vθ(xt,t)x1xt1t2\mathcal{L}_{\text{CFM}}(\theta) = \mathbb{E}_{t, x_0, x_1} \| v_\theta(x_t, t) - \frac{x_1 - x_t}{1-t} \|^2

OZSpeech replaces the Gaussian start with the learned prior xprxtx_{\text{pr}} \approx x_t, optimizing an implicit τ\tau per sample:

LCFM(θ)=Expr,x1vθ(xpr,τ)x1xpr1τ2\mathcal{L}_{\text{CFM}}(\theta) = \mathbb{E}_{x_{\text{pr}}, x_1} \| v_\theta(x_{\text{pr}}, \tau) - \frac{x_1 - x_{\text{pr}}}{1-\tau} \|^2

An anchor loss is used for discrete codes:

Lanchor=Ez1,z~1logp(z1z~1;ϕ)\mathcal{L}_{\text{anchor}} = -\mathbb{E}_{z_1, \tilde z_1} \log p(z_1|\tilde z_1; \phi)

with the total loss:

Ltotal=Lprior+Ldur+LCFM+Lanchor\mathcal{L}_{\text{total}} = \mathcal{L}_{\text{prior}} + \mathcal{L}_{\text{dur}} + \mathcal{L}_{\text{CFM}} + \mathcal{L}_{\text{anchor}}

4. One-Step Inference and System Efficiency

The one-step generative process consists of:

  1. Generate prior codes xpr=fψ(phonemes)x_{\text{pr}} = f_\psi(\text{phonemes}).
  2. Apply folding and quantizer encoding to the concatenated prompt-masked and prior codes.
  3. Feed codes and implicit time τ\tau to the vector field estimator vθv_\theta.
  4. Refine:

z~1=xpr+(1τ)F1(vθ(zpr,τ))\tilde z_1 = x_{\text{pr}} + (1-\tau) \cdot F^{-1}\big(v_\theta(\overline{z}_{\text{pr}},\tau)\big)

  1. Decode z~1\tilde z_1 to waveform with FACodec.

Numerical Function Evaluations (NFE) are reduced to one, enabling real-time synthesis. Real-time factor (RTF) is 0.26 speech-sec/sec, compared to 0.7–1.7 for prior art, with only 145M trainable parameters plus 102M for FACodec.

5. Speaker Cloning, Robustness, and Evaluation

OZSpeech exploits "acoustic prompt injection" by masking content codes within the prompt, conditioning generation on prosody and acoustic tokens for speaker style preservation. The learned prior aligns generated codes with text duration, allowing the vector field to apply speaker-specific pitch and timbre adaptation. Word Error Rate (WER) remains stable (~0.05) down to 0 dB SNR, substantially surpassing baseline noise robustness.

Key empirical results with 3 s prompt length:

Metric OZSpeech Best Baseline Range
WER 0.05 0.09–0.24
UTMOS 3.15 3.55–3.76
SIM-O/SIM-R 0.40/0.47 0.31–0.53/0.38–0.51
F0 Accuracy 0.81 ≤0.69
RMSE (F0) 11.96 ≥12.96
NFE 1 32–200
RTF (sec/sec) 0.26 0.7–1.7
Model size (M) 145 378–830

Ablation studies demonstrate that arbitrary segment prompting outperforms using only the first segment. OZSpeech-Small (100M parameters) matches the performance of the Base model (145M) on WER and UTMOS, with a minimal tradeoff in prosody.

6. Limitations and Prospects

OZSpeech exhibits slight temporal distortions from rounding in the FastSpeech duration predictor, and its UTMOS naturalness score trails some larger models. Prospective improvements include alternative alignment strategies (e.g., Monotonic Alignment Search) to enhance temporal fidelity and adaptive noise filtering. Multilingual and multimodal zero-shot synthesis constitute additional future directions.

7. Comparative Context and Significance

OZSpeech constitutes the first application of optimal-transport conditional flow matching with a learned discrete prior and single-step sampling in TTS, using a fully disentangled, factorized token representation. This approach departs from previous zero-resource and zero-shot paradigms by both eliminating multi-step sampling and enabling fine-grained attribute control within a compact architecture. The combination of state-of-the-art content accuracy (WER ≈ 0.05), robust prosody/style transfer, resistance to noise, and high sampling efficiency establishes a new practical benchmark for zero-shot TTS, with implications for scalable, real-time, speaker-adaptive speech synthesis (Huynh-Nguyen et al., 19 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to OZSpeech.