HierSpeech++: Hierarchical Zero-Shot Speech Synthesis

Updated 14 November 2025

The paper introduces HierSpeech++, a hierarchical variational inference framework that integrates a multi-level VAE with normalizing flows and adversarial decoding to achieve zero-shot speech synthesis.
It employs dual audio encoders and a Text-to-Vec module to disentangle semantic and acoustic features, ensuring robust voice conversion and text-to-speech synthesis even with limited or noisy data.
Experimental results demonstrate human-level naturalness and speaker similarity with fast inference and superior efficiency, setting new benchmarks in zero-shot TTS and VC.

HierSpeech++ is a hierarchical variational inference-based framework for zero-shot speech synthesis, targeting both text-to-speech (TTS) and voice conversion (VC), and designed to bridge the gap between self-supervised semantic speech representation and high-fidelity waveform generation. Distinct from autoregressive and LLM-based approaches, HierSpeech++ integrates a multi-level latent variable conditional variational autoencoder (VAE) equipped with normalizing flows, adversarial decoding, and an efficient super-resolution module to enable fast, robust, and high-quality zero-shot performance, even with limited or noisy training data. The system establishes new benchmarks in both naturalness and speaker similarity for zero-shot speech synthesis, achieving, for the first time, human-level quality according to subjective and objective metrics (Lee et al., 2023).

1. Hierarchical Variational Inference Architecture

HierSpeech++ implements a hierarchical latent-variable framework, combining a multi-level VAE with normalizing flows (specifically, bidirectional Transformer flows—BiT-Flow) and adversarial decoders. The architecture explicitly models and disentangles semantic and acoustic information via two principal latent variables:

$z_{sr}$ (“semantic” latent): Encodes speaker-related and prosodic information inferred from perturbed self-supervised Wav2Vec-style features, $x_{w2v}$ , and a pitch representation $F_0$ . This variable acts as the prior for subsequent acoustic adaptation.
$z_a$ (“acoustic” latent): Inferred from the raw waveform $x$ through a dual-audio posterior encoder, it serves as input to the neural waveform generator.

The probabilistic generative process is factorized as

$p_\theta(x | z_a) \cdot p_\theta(z_a | z_{sr}) \cdot p_\theta(z_{sr} | c) \quad\text{where } c = [x_{w2v}, F_0]_{perturbed}$

Training optimizes the evidence lower bound (ELBO):

$\log p_\theta(x|c) \geq \E_{q_\phi(z_{sr}, z_a | x, c)} \left[\log p_\theta(x | z_a)\right] - \mathrm{KL}(q_\phi(z_{sr} | x_{w2v}, F_0) \| p_\theta(z_{sr} | c)) - \E_{q_\phi(z_{sr} | x_{w2v}, F_0)} [\mathrm{KL}(q_\phi(z_a | x) \| p_\theta(z_a | z_{sr}))]$

Per-level decomposition:

Level 1 (Semantic): $\mathrm{KL}_1 = \mathrm{KL}(q(z_{sr}|x_{w2v},F_0) \| p(z_{sr}|c))$
Level 2 (Acoustic): $\mathrm{KL}_2 = \E_{q(z_{sr})}[\mathrm{KL}(q(z_a|x) \| p(z_a|z_{sr}))]$
Reconstruction: $L_{rec} = -\E[ \log p(x|z_a) ]$

Additional regularization terms include bidirectional flow regularization ( $\lambda_{bi} \cdot \mathrm{KL}_{flow}$ ) to mitigate train–inference mismatch, prosody distillation ( $\| \mathrm{Mel}_{1:20}(x) - \mathrm{Mel}_{1:20}(G_s(z_a, s)) \|_1$ ) to ensure $z_{sr}$ carries prosodic information, and a 10% unconditional generator drop-out to bolster zero-shot robustness.

2. Text-to-Vec Module for Zero-shot TTS

For TTS, HierSpeech++ employs a Text-to-Vec (TTV) module, itself structured as a VAE with monotonic alignment search (MAS). TTV takes as input a phoneme sequence and a prosody prompt, producing:

A semantic latent $z_t \approx x_{w2v}$ , a self-supervised speech representation.
A predicted high-resolution F0 contour $f$ .

The TTV objective function is:

$\mathcal{L}_{ELBO}^{TTV} = \E_{q_\phi} [ \log p_\theta(x_{w2v} | z_t) ] - \mathrm{KL}(q_\phi(z_t|T,s_p) \| p_\theta(z_t|T,s_p) ) + \ell_{F0}(f, \hat{f}) + \ell_{CTC}$

Here, $\ell_{F0}(f, \hat{f})$ is the $L_1$ -distance between predicted and actual F0; $\ell_{CTC}$ is a connectionist temporal classification loss promoting alignment between $z_t$ and phoneme targets. At inference, TTV synthesizes $(z_{sr}, F_0)$ for the main speech synthesizer, conditioned on text embedding $t_i$ and prosody style $s_p$ derived from a reference utterance.

3. Neural Decoders and Speech Super-Resolution

HierSpeech++ synthesizes raw waveforms at 16 kHz using the Hierarchical Adaptive Generator (HAG):

Posterior Encoders:
- Dual-audio (waveform via AMP blocks plus linear spectrogram encoder) produces $q(z_a | x)$ .
- Source–filter semantic employs perturbed and unperturbed Wav2Vec layers plus an F0 encoder for $q(z_{sr} | x_{w2v}, F_0)$ .
Priors ( $p(z_{sr}|c)$ , $p(z_a|z_{sr})$ ): parameterized by BiT-Flow with AdaLN-Zero conditioning on style $s$ .
Neural Generators:
- $G_s(z_a, s)$ produces a preliminary pitch $p_h$ ( $\ell_{pitch} = \| p_x - p_h \|_1$ ).
- $G_w(z_a, p_h, s)$ synthesizes the output waveform, optimized by $\ell_{STFT} = \|\mathrm{Sa}(x) - \mathrm{Sa}(\hat{y})\|_1$ and adversarial losses ( $\mathcal{L}_{adv}(D)$ , $\mathcal{L}_{adv}(G)$ ) with multi-period and multi-scale STFT discriminators.

Speech super-resolution from 16 kHz to 48 kHz (SpeechSR) leverages a single AMP block and nearest-neighbor upsampler. Evaluation uses multi-band discriminators (MPD, MS-STFTD, DWT-based), with losses such as log-spectral distance (LSD, LSD-HF), ViSQOL, and PESQ.

4. Zero-shot Voice Conversion and Inference Workflows

HierSpeech++ supports the following zero-shot inference protocols:

Voice Conversion (VC):

Extract $x_{w2v}$ and $F0_{src}$ from source speech.
Normalize $F0_{src}$ , compute $z_{sr}$ using $q(z_{sr}| …)$ .
Obtain style $s$ from target voice prompt Mel.
Generate $\hat{y}$ via $G(z_a, z_{sr} \to G_s \to G_w, s)$ at 16 kHz.
(Optional) Upsample with SpeechSR.

Text-to-Speech (TTS):

Obtain $(z_{sr}, F_0)$ from TTV.
Extract style $s$ from reference speech.
Hierarchical synthesizer produces the waveform.

A simple TTS pseudocode:

for each sentence:
    z_t, s_p = Text2Vec(text, prosody_ref)
    s       = StyleEnc(voice_ref)
    ŷ       = HAG.generate(z_t, F0, s)
    high    = SpeechSR(ŷ)  # optional

Style Prompt Replication (SPR), in which a one-second prompt is repeated $n$ times, empirically improves performance for brief style inputs.

5. Experimental Results and Benchmarking

HierSpeech++ demonstrates superior zero-shot synthesis performance over LLM-based, diffusion-based, and prior VAE-based methods. Key results:

Zero-shot VC on VCTK (unseen speakers):
- LT-460: nMOS 4.54, sMOS 3.63, UTMOS 4.19, CER 0.90%, WER 2.99%, EER 2.50%, SECS 0.862.
- With LT-960+data: EER drops to 1.27%, SECS increases to 0.875.
- Outperforms AutoVC, VoiceMixer, Diff-VC, DDDM-VC, YourTTS, HierVST.
Zero-shot TTS (LibriTTS test-clean/other):
- nMOS ≈ 4.55 (ground truth: 4.32), sMOS ≈ 3.74, UTMOS ≈ 4.36, CER ≈ 2.39%, WER ≈ 4.20%, SECS ≈ 0.907.
- Baseline (YourTTS): nMOS ≈ 3.38, sMOS ≈ 3.15; LLM-based VALL-E-X: nMOS ≈ 3.50.
Speech Super-resolution (SpeechSR, VCTK):
- LSD 0.82, LSD-HF 0.98, ViSQOL 3.34, PESQ_wb 4.63.
- Preference ABX: 70% prefer SpeechSR to AudioSR.
- Inference speed 742× faster than AudioSR; model 1,986× smaller (0.13 M vs 258 M parameters).
Inference Efficiency:
- Fully parallel HAG + BiT-Flow: $\sim$ 10× real-time on a single A6000.
- Diffusion baselines: $\sim$ 50–100× slower.

HierSpeech++ is the first system to claim human-level zero-shot TTS and VC, with robustness to noisy prompts and with moderate (∼2.8k hours, 16 kHz) open-source data requirements.

6. Component Choices and Ablation Analyses

Ablation studies confirm that the hierarchical VAE with multi-level latent structure and bidirectional flows outperforms the flat VAE baseline (HierVST) in both expressiveness and zero-shot robustness. Key findings:

AMP blocks (BigVGAN) over MRF: +0.05 Mel, +0.2 PESQ_wb, improved OOD robustness.
Source–filter semantic encoder (SFE): +15% F0 consistency.
Dual-audio encoder (DAE): +0.3 Mel, +0.2 PESQ, small VC EER increase if overpowered.
Transformer flow + AdaLN-Zero: +0.1 Mel, +0.1 PESQ, +0.04 SECS.
Bidirectional flow: +0.1 sMOS, minor reconstruction drop, VC EER ↓0.2%.
Prosody distillation: Stronger prosody control in TTS (+0.2 pMOS).
Unconditional generation (10% style drop): +0.05 sMOS, +0.04 EER.
Data scale-up: Stable nMOS; EER improves from 1.23% to 1.27%, SECS increases from 0.862 to 0.883.

The cumulative effect is more expressive output (higher MOS, better F0 consistency) and robust adaptation under the zero-shot constraint.

7. Synthesis and Practical Implications

HierSpeech++ consolidates advances in conditional hierarchical VAEs, self-supervised speech representations, adversarial decoding, and efficient super-resolution. The architecture affords fast, robust, and high-fidelity speech generation in zero-shot contexts, rivaling or exceeding human naturalness and speaker similarity metrics with moderate data and resilience to real-world prompt noise. HierSpeech++ unifies a multi-level probabilistic modeling approach with pragmatic engineering choices to establish new standards in zero-shot TTS and VC, and offers a scalable, data-efficient solution for speech synthesis in diverse practical and research settings.

PDF Markdown Chat (Pro)

References (1)

HierSpeech++: Bridging the Gap between Semantic and Acoustic Representation of Speech by Hierarchical Variational Inference for Zero-shot Speech Synthesis (2023)

Whiteboard

Generate a whiteboard explanation of this topic.

Topic to Video (Beta)

Generate a video overview of this topic.

Follow Topic

Get notified by email when new papers are published related to HierSpeech++.