Papers
Topics
Authors
Recent
2000 character limit reached

HierSpeech++: Hierarchical Zero-Shot Speech Synthesis

Updated 14 November 2025
  • The paper introduces HierSpeech++, a hierarchical variational inference framework that integrates a multi-level VAE with normalizing flows and adversarial decoding to achieve zero-shot speech synthesis.
  • It employs dual audio encoders and a Text-to-Vec module to disentangle semantic and acoustic features, ensuring robust voice conversion and text-to-speech synthesis even with limited or noisy data.
  • Experimental results demonstrate human-level naturalness and speaker similarity with fast inference and superior efficiency, setting new benchmarks in zero-shot TTS and VC.

HierSpeech++ is a hierarchical variational inference-based framework for zero-shot speech synthesis, targeting both text-to-speech (TTS) and voice conversion (VC), and designed to bridge the gap between self-supervised semantic speech representation and high-fidelity waveform generation. Distinct from autoregressive and LLM-based approaches, HierSpeech++ integrates a multi-level latent variable conditional variational autoencoder (VAE) equipped with normalizing flows, adversarial decoding, and an efficient super-resolution module to enable fast, robust, and high-quality zero-shot performance, even with limited or noisy training data. The system establishes new benchmarks in both naturalness and speaker similarity for zero-shot speech synthesis, achieving, for the first time, human-level quality according to subjective and objective metrics (Lee et al., 2023).

1. Hierarchical Variational Inference Architecture

HierSpeech++ implements a hierarchical latent-variable framework, combining a multi-level VAE with normalizing flows (specifically, bidirectional Transformer flows—BiT-Flow) and adversarial decoders. The architecture explicitly models and disentangles semantic and acoustic information via two principal latent variables:

  • zsrz_{sr} (“semantic” latent): Encodes speaker-related and prosodic information inferred from perturbed self-supervised Wav2Vec-style features, xw2vx_{w2v}, and a pitch representation F0F_0. This variable acts as the prior for subsequent acoustic adaptation.
  • zaz_a (“acoustic” latent): Inferred from the raw waveform xx through a dual-audio posterior encoder, it serves as input to the neural waveform generator.

The probabilistic generative process is factorized as

pθ(xza)pθ(zazsr)pθ(zsrc)where c=[xw2v,F0]perturbedp_\theta(x | z_a) \cdot p_\theta(z_a | z_{sr}) \cdot p_\theta(z_{sr} | c) \quad\text{where } c = [x_{w2v}, F_0]_{perturbed}

Training optimizes the evidence lower bound (ELBO):

$\log p_\theta(x|c) \geq \E_{q_\phi(z_{sr}, z_a | x, c)} \left[\log p_\theta(x | z_a)\right] - \mathrm{KL}(q_\phi(z_{sr} | x_{w2v}, F_0) \| p_\theta(z_{sr} | c)) - \E_{q_\phi(z_{sr} | x_{w2v}, F_0)} [\mathrm{KL}(q_\phi(z_a | x) \| p_\theta(z_a | z_{sr}))]$

Per-level decomposition:

  • Level 1 (Semantic): KL1=KL(q(zsrxw2v,F0)p(zsrc))\mathrm{KL}_1 = \mathrm{KL}(q(z_{sr}|x_{w2v},F_0) \| p(z_{sr}|c))
  • Level 2 (Acoustic): $\mathrm{KL}_2 = \E_{q(z_{sr})}[\mathrm{KL}(q(z_a|x) \| p(z_a|z_{sr}))]$
  • Reconstruction: $L_{rec} = -\E[ \log p(x|z_a) ]$

Additional regularization terms include bidirectional flow regularization (λbiKLflow\lambda_{bi} \cdot \mathrm{KL}_{flow}) to mitigate train–inference mismatch, prosody distillation (Mel1:20(x)Mel1:20(Gs(za,s))1\| \mathrm{Mel}_{1:20}(x) - \mathrm{Mel}_{1:20}(G_s(z_a, s)) \|_1) to ensure zsrz_{sr} carries prosodic information, and a 10% unconditional generator drop-out to bolster zero-shot robustness.

2. Text-to-Vec Module for Zero-shot TTS

For TTS, HierSpeech++ employs a Text-to-Vec (TTV) module, itself structured as a VAE with monotonic alignment search (MAS). TTV takes as input a phoneme sequence and a prosody prompt, producing:

  • A semantic latent ztxw2vz_t \approx x_{w2v}, a self-supervised speech representation.
  • A predicted high-resolution F0 contour ff.

The TTV objective function is:

$\mathcal{L}_{ELBO}^{TTV} = \E_{q_\phi} [ \log p_\theta(x_{w2v} | z_t) ] - \mathrm{KL}(q_\phi(z_t|T,s_p) \| p_\theta(z_t|T,s_p) ) + \ell_{F0}(f, \hat{f}) + \ell_{CTC}$

Here, F0(f,f^)\ell_{F0}(f, \hat{f}) is the L1L_1-distance between predicted and actual F0; CTC\ell_{CTC} is a connectionist temporal classification loss promoting alignment between ztz_t and phoneme targets. At inference, TTV synthesizes (zsr,F0)(z_{sr}, F_0) for the main speech synthesizer, conditioned on text embedding tit_i and prosody style sps_p derived from a reference utterance.

3. Neural Decoders and Speech Super-Resolution

HierSpeech++ synthesizes raw waveforms at 16 kHz using the Hierarchical Adaptive Generator (HAG):

  • Posterior Encoders:
    • Dual-audio (waveform via AMP blocks plus linear spectrogram encoder) produces q(zax)q(z_a | x).
    • Source–filter semantic employs perturbed and unperturbed Wav2Vec layers plus an F0 encoder for q(zsrxw2v,F0)q(z_{sr} | x_{w2v}, F_0).
  • Priors (p(zsrc)p(z_{sr}|c), p(zazsr)p(z_a|z_{sr})): parameterized by BiT-Flow with AdaLN-Zero conditioning on style ss.
  • Neural Generators:
    • Gs(za,s)G_s(z_a, s) produces a preliminary pitch php_h (pitch=pxph1\ell_{pitch} = \| p_x - p_h \|_1).
    • Gw(za,ph,s)G_w(z_a, p_h, s) synthesizes the output waveform, optimized by STFT=Sa(x)Sa(y^)1\ell_{STFT} = \|\mathrm{Sa}(x) - \mathrm{Sa}(\hat{y})\|_1 and adversarial losses (Ladv(D)\mathcal{L}_{adv}(D), Ladv(G)\mathcal{L}_{adv}(G)) with multi-period and multi-scale STFT discriminators.

Speech super-resolution from 16 kHz to 48 kHz (SpeechSR) leverages a single AMP block and nearest-neighbor upsampler. Evaluation uses multi-band discriminators (MPD, MS-STFTD, DWT-based), with losses such as log-spectral distance (LSD, LSD-HF), ViSQOL, and PESQ.

4. Zero-shot Voice Conversion and Inference Workflows

HierSpeech++ supports the following zero-shot inference protocols:

  • Voice Conversion (VC):
  1. Extract xw2vx_{w2v} and F0srcF0_{src} from source speech.
  2. Normalize F0srcF0_{src}, compute zsrz_{sr} using q(zsr)q(z_{sr}| …).
  3. Obtain style ss from target voice prompt Mel.
  4. Generate y^\hat{y} via G(za,zsrGsGw,s)G(z_a, z_{sr} \to G_s \to G_w, s) at 16 kHz.
  5. (Optional) Upsample with SpeechSR.
  • Text-to-Speech (TTS):
  1. Obtain (zsr,F0)(z_{sr}, F_0) from TTV.
  2. Extract style ss from reference speech.
  3. Hierarchical synthesizer produces the waveform.

A simple TTS pseudocode:

1
2
3
4
5
for each sentence:
    z_t, s_p = Text2Vec(text, prosody_ref)
    s       = StyleEnc(voice_ref)
    ŷ       = HAG.generate(z_t, F0, s)
    high    = SpeechSR(ŷ)  # optional

Style Prompt Replication (SPR), in which a one-second prompt is repeated nn times, empirically improves performance for brief style inputs.

5. Experimental Results and Benchmarking

HierSpeech++ demonstrates superior zero-shot synthesis performance over LLM-based, diffusion-based, and prior VAE-based methods. Key results:

  • Zero-shot VC on VCTK (unseen speakers):
    • LT-460: nMOS 4.54, sMOS 3.63, UTMOS 4.19, CER 0.90%, WER 2.99%, EER 2.50%, SECS 0.862.
    • With LT-960+data: EER drops to 1.27%, SECS increases to 0.875.
    • Outperforms AutoVC, VoiceMixer, Diff-VC, DDDM-VC, YourTTS, HierVST.
  • Zero-shot TTS (LibriTTS test-clean/other):
    • nMOS ≈ 4.55 (ground truth: 4.32), sMOS ≈ 3.74, UTMOS ≈ 4.36, CER ≈ 2.39%, WER ≈ 4.20%, SECS ≈ 0.907.
    • Baseline (YourTTS): nMOS ≈ 3.38, sMOS ≈ 3.15; LLM-based VALL-E-X: nMOS ≈ 3.50.
  • Speech Super-resolution (SpeechSR, VCTK):
    • LSD 0.82, LSD-HF 0.98, ViSQOL 3.34, PESQ_wb 4.63.
    • Preference ABX: 70% prefer SpeechSR to AudioSR.
    • Inference speed 742× faster than AudioSR; model 1,986× smaller (0.13 M vs 258 M parameters).
  • Inference Efficiency:
    • Fully parallel HAG + BiT-Flow: \sim10× real-time on a single A6000.
    • Diffusion baselines: \sim50–100× slower.

HierSpeech++ is the first system to claim human-level zero-shot TTS and VC, with robustness to noisy prompts and with moderate (∼2.8k hours, 16 kHz) open-source data requirements.

6. Component Choices and Ablation Analyses

Ablation studies confirm that the hierarchical VAE with multi-level latent structure and bidirectional flows outperforms the flat VAE baseline (HierVST) in both expressiveness and zero-shot robustness. Key findings:

  • AMP blocks (BigVGAN) over MRF: +0.05 Mel, +0.2 PESQ_wb, improved OOD robustness.
  • Source–filter semantic encoder (SFE): +15% F0 consistency.
  • Dual-audio encoder (DAE): +0.3 Mel, +0.2 PESQ, small VC EER increase if overpowered.
  • Transformer flow + AdaLN-Zero: +0.1 Mel, +0.1 PESQ, +0.04 SECS.
  • Bidirectional flow: +0.1 sMOS, minor reconstruction drop, VC EER ↓0.2%.
  • Prosody distillation: Stronger prosody control in TTS (+0.2 pMOS).
  • Unconditional generation (10% style drop): +0.05 sMOS, +0.04 EER.
  • Data scale-up: Stable nMOS; EER improves from 1.23% to 1.27%, SECS increases from 0.862 to 0.883.

The cumulative effect is more expressive output (higher MOS, better F0 consistency) and robust adaptation under the zero-shot constraint.

7. Synthesis and Practical Implications

HierSpeech++ consolidates advances in conditional hierarchical VAEs, self-supervised speech representations, adversarial decoding, and efficient super-resolution. The architecture affords fast, robust, and high-fidelity speech generation in zero-shot contexts, rivaling or exceeding human naturalness and speaker similarity metrics with moderate data and resilience to real-world prompt noise. HierSpeech++ unifies a multi-level probabilistic modeling approach with pragmatic engineering choices to establish new standards in zero-shot TTS and VC, and offers a scalable, data-efficient solution for speech synthesis in diverse practical and research settings.

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to HierSpeech++.