HierSpeech++: Hierarchical Zero-Shot Speech Synthesis
- The paper introduces HierSpeech++, a hierarchical variational inference framework that integrates a multi-level VAE with normalizing flows and adversarial decoding to achieve zero-shot speech synthesis.
- It employs dual audio encoders and a Text-to-Vec module to disentangle semantic and acoustic features, ensuring robust voice conversion and text-to-speech synthesis even with limited or noisy data.
- Experimental results demonstrate human-level naturalness and speaker similarity with fast inference and superior efficiency, setting new benchmarks in zero-shot TTS and VC.
HierSpeech++ is a hierarchical variational inference-based framework for zero-shot speech synthesis, targeting both text-to-speech (TTS) and voice conversion (VC), and designed to bridge the gap between self-supervised semantic speech representation and high-fidelity waveform generation. Distinct from autoregressive and LLM-based approaches, HierSpeech++ integrates a multi-level latent variable conditional variational autoencoder (VAE) equipped with normalizing flows, adversarial decoding, and an efficient super-resolution module to enable fast, robust, and high-quality zero-shot performance, even with limited or noisy training data. The system establishes new benchmarks in both naturalness and speaker similarity for zero-shot speech synthesis, achieving, for the first time, human-level quality according to subjective and objective metrics (Lee et al., 2023).
1. Hierarchical Variational Inference Architecture
HierSpeech++ implements a hierarchical latent-variable framework, combining a multi-level VAE with normalizing flows (specifically, bidirectional Transformer flows—BiT-Flow) and adversarial decoders. The architecture explicitly models and disentangles semantic and acoustic information via two principal latent variables:
- (“semantic” latent): Encodes speaker-related and prosodic information inferred from perturbed self-supervised Wav2Vec-style features, , and a pitch representation . This variable acts as the prior for subsequent acoustic adaptation.
- (“acoustic” latent): Inferred from the raw waveform through a dual-audio posterior encoder, it serves as input to the neural waveform generator.
The probabilistic generative process is factorized as
Training optimizes the evidence lower bound (ELBO):
$\log p_\theta(x|c) \geq \E_{q_\phi(z_{sr}, z_a | x, c)} \left[\log p_\theta(x | z_a)\right] - \mathrm{KL}(q_\phi(z_{sr} | x_{w2v}, F_0) \| p_\theta(z_{sr} | c)) - \E_{q_\phi(z_{sr} | x_{w2v}, F_0)} [\mathrm{KL}(q_\phi(z_a | x) \| p_\theta(z_a | z_{sr}))]$
Per-level decomposition:
- Level 1 (Semantic):
- Level 2 (Acoustic): $\mathrm{KL}_2 = \E_{q(z_{sr})}[\mathrm{KL}(q(z_a|x) \| p(z_a|z_{sr}))]$
- Reconstruction: $L_{rec} = -\E[ \log p(x|z_a) ]$
Additional regularization terms include bidirectional flow regularization () to mitigate train–inference mismatch, prosody distillation () to ensure carries prosodic information, and a 10% unconditional generator drop-out to bolster zero-shot robustness.
2. Text-to-Vec Module for Zero-shot TTS
For TTS, HierSpeech++ employs a Text-to-Vec (TTV) module, itself structured as a VAE with monotonic alignment search (MAS). TTV takes as input a phoneme sequence and a prosody prompt, producing:
- A semantic latent , a self-supervised speech representation.
- A predicted high-resolution F0 contour .
The TTV objective function is:
$\mathcal{L}_{ELBO}^{TTV} = \E_{q_\phi} [ \log p_\theta(x_{w2v} | z_t) ] - \mathrm{KL}(q_\phi(z_t|T,s_p) \| p_\theta(z_t|T,s_p) ) + \ell_{F0}(f, \hat{f}) + \ell_{CTC}$
Here, is the -distance between predicted and actual F0; is a connectionist temporal classification loss promoting alignment between and phoneme targets. At inference, TTV synthesizes for the main speech synthesizer, conditioned on text embedding and prosody style derived from a reference utterance.
3. Neural Decoders and Speech Super-Resolution
HierSpeech++ synthesizes raw waveforms at 16 kHz using the Hierarchical Adaptive Generator (HAG):
- Posterior Encoders:
- Dual-audio (waveform via AMP blocks plus linear spectrogram encoder) produces .
- Source–filter semantic employs perturbed and unperturbed Wav2Vec layers plus an F0 encoder for .
- Priors (, ): parameterized by BiT-Flow with AdaLN-Zero conditioning on style .
- Neural Generators:
- produces a preliminary pitch ().
- synthesizes the output waveform, optimized by and adversarial losses (, ) with multi-period and multi-scale STFT discriminators.
Speech super-resolution from 16 kHz to 48 kHz (SpeechSR) leverages a single AMP block and nearest-neighbor upsampler. Evaluation uses multi-band discriminators (MPD, MS-STFTD, DWT-based), with losses such as log-spectral distance (LSD, LSD-HF), ViSQOL, and PESQ.
4. Zero-shot Voice Conversion and Inference Workflows
HierSpeech++ supports the following zero-shot inference protocols:
- Voice Conversion (VC):
- Extract and from source speech.
- Normalize , compute using .
- Obtain style from target voice prompt Mel.
- Generate via at 16 kHz.
- (Optional) Upsample with SpeechSR.
- Text-to-Speech (TTS):
- Obtain from TTV.
- Extract style from reference speech.
- Hierarchical synthesizer produces the waveform.
A simple TTS pseudocode:
1 2 3 4 5 |
for each sentence: z_t, s_p = Text2Vec(text, prosody_ref) s = StyleEnc(voice_ref) ŷ = HAG.generate(z_t, F0, s) high = SpeechSR(ŷ) # optional |
Style Prompt Replication (SPR), in which a one-second prompt is repeated times, empirically improves performance for brief style inputs.
5. Experimental Results and Benchmarking
HierSpeech++ demonstrates superior zero-shot synthesis performance over LLM-based, diffusion-based, and prior VAE-based methods. Key results:
- Zero-shot VC on VCTK (unseen speakers):
- LT-460: nMOS 4.54, sMOS 3.63, UTMOS 4.19, CER 0.90%, WER 2.99%, EER 2.50%, SECS 0.862.
- With LT-960+data: EER drops to 1.27%, SECS increases to 0.875.
- Outperforms AutoVC, VoiceMixer, Diff-VC, DDDM-VC, YourTTS, HierVST.
- Zero-shot TTS (LibriTTS test-clean/other):
- nMOS ≈ 4.55 (ground truth: 4.32), sMOS ≈ 3.74, UTMOS ≈ 4.36, CER ≈ 2.39%, WER ≈ 4.20%, SECS ≈ 0.907.
- Baseline (YourTTS): nMOS ≈ 3.38, sMOS ≈ 3.15; LLM-based VALL-E-X: nMOS ≈ 3.50.
- Speech Super-resolution (SpeechSR, VCTK):
- LSD 0.82, LSD-HF 0.98, ViSQOL 3.34, PESQ_wb 4.63.
- Preference ABX: 70% prefer SpeechSR to AudioSR.
- Inference speed 742× faster than AudioSR; model 1,986× smaller (0.13 M vs 258 M parameters).
- Inference Efficiency:
- Fully parallel HAG + BiT-Flow: 10× real-time on a single A6000.
- Diffusion baselines: 50–100× slower.
HierSpeech++ is the first system to claim human-level zero-shot TTS and VC, with robustness to noisy prompts and with moderate (∼2.8k hours, 16 kHz) open-source data requirements.
6. Component Choices and Ablation Analyses
Ablation studies confirm that the hierarchical VAE with multi-level latent structure and bidirectional flows outperforms the flat VAE baseline (HierVST) in both expressiveness and zero-shot robustness. Key findings:
- AMP blocks (BigVGAN) over MRF: +0.05 Mel, +0.2 PESQ_wb, improved OOD robustness.
- Source–filter semantic encoder (SFE): +15% F0 consistency.
- Dual-audio encoder (DAE): +0.3 Mel, +0.2 PESQ, small VC EER increase if overpowered.
- Transformer flow + AdaLN-Zero: +0.1 Mel, +0.1 PESQ, +0.04 SECS.
- Bidirectional flow: +0.1 sMOS, minor reconstruction drop, VC EER ↓0.2%.
- Prosody distillation: Stronger prosody control in TTS (+0.2 pMOS).
- Unconditional generation (10% style drop): +0.05 sMOS, +0.04 EER.
- Data scale-up: Stable nMOS; EER improves from 1.23% to 1.27%, SECS increases from 0.862 to 0.883.
The cumulative effect is more expressive output (higher MOS, better F0 consistency) and robust adaptation under the zero-shot constraint.
7. Synthesis and Practical Implications
HierSpeech++ consolidates advances in conditional hierarchical VAEs, self-supervised speech representations, adversarial decoding, and efficient super-resolution. The architecture affords fast, robust, and high-fidelity speech generation in zero-shot contexts, rivaling or exceeding human naturalness and speaker similarity metrics with moderate data and resilience to real-world prompt noise. HierSpeech++ unifies a multi-level probabilistic modeling approach with pragmatic engineering choices to establish new standards in zero-shot TTS and VC, and offers a scalable, data-efficient solution for speech synthesis in diverse practical and research settings.