Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 183 tok/s
Gemini 2.5 Pro 49 tok/s Pro
GPT-5 Medium 29 tok/s Pro
GPT-5 High 24 tok/s Pro
GPT-4o 97 tok/s Pro
Kimi K2 221 tok/s Pro
GPT OSS 120B 440 tok/s Pro
Claude Sonnet 4.5 38 tok/s Pro
2000 character limit reached

HierSpeech++: Hierarchical Zero-Shot Speech Synthesis

Updated 14 November 2025
  • The paper introduces HierSpeech++, a hierarchical variational inference framework that integrates a multi-level VAE with normalizing flows and adversarial decoding to achieve zero-shot speech synthesis.
  • It employs dual audio encoders and a Text-to-Vec module to disentangle semantic and acoustic features, ensuring robust voice conversion and text-to-speech synthesis even with limited or noisy data.
  • Experimental results demonstrate human-level naturalness and speaker similarity with fast inference and superior efficiency, setting new benchmarks in zero-shot TTS and VC.

HierSpeech++ is a hierarchical variational inference-based framework for zero-shot speech synthesis, targeting both text-to-speech (TTS) and voice conversion (VC), and designed to bridge the gap between self-supervised semantic speech representation and high-fidelity waveform generation. Distinct from autoregressive and LLM-based approaches, HierSpeech++ integrates a multi-level latent variable conditional variational autoencoder (VAE) equipped with normalizing flows, adversarial decoding, and an efficient super-resolution module to enable fast, robust, and high-quality zero-shot performance, even with limited or noisy training data. The system establishes new benchmarks in both naturalness and speaker similarity for zero-shot speech synthesis, achieving, for the first time, human-level quality according to subjective and objective metrics (Lee et al., 2023).

1. Hierarchical Variational Inference Architecture

HierSpeech++ implements a hierarchical latent-variable framework, combining a multi-level VAE with normalizing flows (specifically, bidirectional Transformer flows—BiT-Flow) and adversarial decoders. The architecture explicitly models and disentangles semantic and acoustic information via two principal latent variables:

  • zsrz_{sr} (“semantic” latent): Encodes speaker-related and prosodic information inferred from perturbed self-supervised Wav2Vec-style features, xw2vx_{w2v}, and a pitch representation F0F_0. This variable acts as the prior for subsequent acoustic adaptation.
  • zaz_a (“acoustic” latent): Inferred from the raw waveform xx through a dual-audio posterior encoder, it serves as input to the neural waveform generator.

The probabilistic generative process is factorized as

pθ(xza)pθ(zazsr)pθ(zsrc)where c=[xw2v,F0]perturbedp_\theta(x | z_a) \cdot p_\theta(z_a | z_{sr}) \cdot p_\theta(z_{sr} | c) \quad\text{where } c = [x_{w2v}, F_0]_{perturbed}

Training optimizes the evidence lower bound (ELBO):

$\log p_\theta(x|c) \geq \E_{q_\phi(z_{sr}, z_a | x, c)} \left[\log p_\theta(x | z_a)\right] - \mathrm{KL}(q_\phi(z_{sr} | x_{w2v}, F_0) \| p_\theta(z_{sr} | c)) - \E_{q_\phi(z_{sr} | x_{w2v}, F_0)} [\mathrm{KL}(q_\phi(z_a | x) \| p_\theta(z_a | z_{sr}))]$

Per-level decomposition:

  • Level 1 (Semantic): KL1=KL(q(zsrxw2v,F0)p(zsrc))\mathrm{KL}_1 = \mathrm{KL}(q(z_{sr}|x_{w2v},F_0) \| p(z_{sr}|c))
  • Level 2 (Acoustic): $\mathrm{KL}_2 = \E_{q(z_{sr})}[\mathrm{KL}(q(z_a|x) \| p(z_a|z_{sr}))]$
  • Reconstruction: $L_{rec} = -\E[ \log p(x|z_a) ]$

Additional regularization terms include bidirectional flow regularization (λbiKLflow\lambda_{bi} \cdot \mathrm{KL}_{flow}) to mitigate train–inference mismatch, prosody distillation (Mel1:20(x)Mel1:20(Gs(za,s))1\| \mathrm{Mel}_{1:20}(x) - \mathrm{Mel}_{1:20}(G_s(z_a, s)) \|_1) to ensure zsrz_{sr} carries prosodic information, and a 10% unconditional generator drop-out to bolster zero-shot robustness.

2. Text-to-Vec Module for Zero-shot TTS

For TTS, HierSpeech++ employs a Text-to-Vec (TTV) module, itself structured as a VAE with monotonic alignment search (MAS). TTV takes as input a phoneme sequence and a prosody prompt, producing:

  • A semantic latent ztxw2vz_t \approx x_{w2v}, a self-supervised speech representation.
  • A predicted high-resolution F0 contour ff.

The TTV objective function is:

$\mathcal{L}_{ELBO}^{TTV} = \E_{q_\phi} [ \log p_\theta(x_{w2v} | z_t) ] - \mathrm{KL}(q_\phi(z_t|T,s_p) \| p_\theta(z_t|T,s_p) ) + \ell_{F0}(f, \hat{f}) + \ell_{CTC}$

Here, F0(f,f^)\ell_{F0}(f, \hat{f}) is the L1L_1-distance between predicted and actual F0; CTC\ell_{CTC} is a connectionist temporal classification loss promoting alignment between ztz_t and phoneme targets. At inference, TTV synthesizes (zsr,F0)(z_{sr}, F_0) for the main speech synthesizer, conditioned on text embedding tit_i and prosody style sps_p derived from a reference utterance.

3. Neural Decoders and Speech Super-Resolution

HierSpeech++ synthesizes raw waveforms at 16 kHz using the Hierarchical Adaptive Generator (HAG):

  • Posterior Encoders:
    • Dual-audio (waveform via AMP blocks plus linear spectrogram encoder) produces q(zax)q(z_a | x).
    • Source–filter semantic employs perturbed and unperturbed Wav2Vec layers plus an F0 encoder for q(zsrxw2v,F0)q(z_{sr} | x_{w2v}, F_0).
  • Priors (p(zsrc)p(z_{sr}|c), p(zazsr)p(z_a|z_{sr})): parameterized by BiT-Flow with AdaLN-Zero conditioning on style ss.
  • Neural Generators:
    • Gs(za,s)G_s(z_a, s) produces a preliminary pitch php_h (pitch=pxph1\ell_{pitch} = \| p_x - p_h \|_1).
    • Gw(za,ph,s)G_w(z_a, p_h, s) synthesizes the output waveform, optimized by STFT=Sa(x)Sa(y^)1\ell_{STFT} = \|\mathrm{Sa}(x) - \mathrm{Sa}(\hat{y})\|_1 and adversarial losses (Ladv(D)\mathcal{L}_{adv}(D), Ladv(G)\mathcal{L}_{adv}(G)) with multi-period and multi-scale STFT discriminators.

Speech super-resolution from 16 kHz to 48 kHz (SpeechSR) leverages a single AMP block and nearest-neighbor upsampler. Evaluation uses multi-band discriminators (MPD, MS-STFTD, DWT-based), with losses such as log-spectral distance (LSD, LSD-HF), ViSQOL, and PESQ.

4. Zero-shot Voice Conversion and Inference Workflows

HierSpeech++ supports the following zero-shot inference protocols:

  • Voice Conversion (VC):
  1. Extract xw2vx_{w2v} and F0srcF0_{src} from source speech.
  2. Normalize F0srcF0_{src}, compute zsrz_{sr} using q(zsr)q(z_{sr}| …).
  3. Obtain style ss from target voice prompt Mel.
  4. Generate y^\hat{y} via G(za,zsrGsGw,s)G(z_a, z_{sr} \to G_s \to G_w, s) at 16 kHz.
  5. (Optional) Upsample with SpeechSR.
  • Text-to-Speech (TTS):
  1. Obtain (zsr,F0)(z_{sr}, F_0) from TTV.
  2. Extract style ss from reference speech.
  3. Hierarchical synthesizer produces the waveform.

A simple TTS pseudocode:

1
2
3
4
5
for each sentence:
    z_t, s_p = Text2Vec(text, prosody_ref)
    s       = StyleEnc(voice_ref)
    ŷ       = HAG.generate(z_t, F0, s)
    high    = SpeechSR(ŷ)  # optional

Style Prompt Replication (SPR), in which a one-second prompt is repeated nn times, empirically improves performance for brief style inputs.

5. Experimental Results and Benchmarking

HierSpeech++ demonstrates superior zero-shot synthesis performance over LLM-based, diffusion-based, and prior VAE-based methods. Key results:

  • Zero-shot VC on VCTK (unseen speakers):
    • LT-460: nMOS 4.54, sMOS 3.63, UTMOS 4.19, CER 0.90%, WER 2.99%, EER 2.50%, SECS 0.862.
    • With LT-960+data: EER drops to 1.27%, SECS increases to 0.875.
    • Outperforms AutoVC, VoiceMixer, Diff-VC, DDDM-VC, YourTTS, HierVST.
  • Zero-shot TTS (LibriTTS test-clean/other):
    • nMOS ≈ 4.55 (ground truth: 4.32), sMOS ≈ 3.74, UTMOS ≈ 4.36, CER ≈ 2.39%, WER ≈ 4.20%, SECS ≈ 0.907.
    • Baseline (YourTTS): nMOS ≈ 3.38, sMOS ≈ 3.15; LLM-based VALL-E-X: nMOS ≈ 3.50.
  • Speech Super-resolution (SpeechSR, VCTK):
    • LSD 0.82, LSD-HF 0.98, ViSQOL 3.34, PESQ_wb 4.63.
    • Preference ABX: 70% prefer SpeechSR to AudioSR.
    • Inference speed 742× faster than AudioSR; model 1,986× smaller (0.13 M vs 258 M parameters).
  • Inference Efficiency:
    • Fully parallel HAG + BiT-Flow: \sim10× real-time on a single A6000.
    • Diffusion baselines: \sim50–100× slower.

HierSpeech++ is the first system to claim human-level zero-shot TTS and VC, with robustness to noisy prompts and with moderate (∼2.8k hours, 16 kHz) open-source data requirements.

6. Component Choices and Ablation Analyses

Ablation studies confirm that the hierarchical VAE with multi-level latent structure and bidirectional flows outperforms the flat VAE baseline (HierVST) in both expressiveness and zero-shot robustness. Key findings:

  • AMP blocks (BigVGAN) over MRF: +0.05 Mel, +0.2 PESQ_wb, improved OOD robustness.
  • Source–filter semantic encoder (SFE): +15% F0 consistency.
  • Dual-audio encoder (DAE): +0.3 Mel, +0.2 PESQ, small VC EER increase if overpowered.
  • Transformer flow + AdaLN-Zero: +0.1 Mel, +0.1 PESQ, +0.04 SECS.
  • Bidirectional flow: +0.1 sMOS, minor reconstruction drop, VC EER ↓0.2%.
  • Prosody distillation: Stronger prosody control in TTS (+0.2 pMOS).
  • Unconditional generation (10% style drop): +0.05 sMOS, +0.04 EER.
  • Data scale-up: Stable nMOS; EER improves from 1.23% to 1.27%, SECS increases from 0.862 to 0.883.

The cumulative effect is more expressive output (higher MOS, better F0 consistency) and robust adaptation under the zero-shot constraint.

7. Synthesis and Practical Implications

HierSpeech++ consolidates advances in conditional hierarchical VAEs, self-supervised speech representations, adversarial decoding, and efficient super-resolution. The architecture affords fast, robust, and high-fidelity speech generation in zero-shot contexts, rivaling or exceeding human naturalness and speaker similarity metrics with moderate data and resilience to real-world prompt noise. HierSpeech++ unifies a multi-level probabilistic modeling approach with pragmatic engineering choices to establish new standards in zero-shot TTS and VC, and offers a scalable, data-efficient solution for speech synthesis in diverse practical and research settings.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to HierSpeech++.