WaveGrad 2: Diffusion-Based End-to-End TTS

Updated 7 August 2025

WaveGrad 2 is a diffusion-based text-to-speech model that directly refines Gaussian noise into high-fidelity audio from text-derived phoneme sequences.
It employs a non-autoregressive architecture with a Tacotron-style encoder, duration-informed Gaussian upsampling, and an iterative WaveGrad decoder for efficient and end-to-end synthesis.
The model achieves state-of-the-art MOS ratings with a tunable speed-quality tradeoff, making it suitable for real-time and on-device TTS applications.

WaveGrad 2 is a non-autoregressive, diffusion-based generative model for end-to-end text-to-speech (TTS) synthesis. Distinguished from the original WaveGrad, which conditions solely on mel-spectrograms, WaveGrad 2 synthesizes the raw audio waveform directly from text-level phoneme sequences via iterative gradient-based refinement, incorporating recent advances from score-based generative modeling and diffusion probabilistic models. The model combines a Tacotron-style encoder, a duration-informed Gaussian resampling module, and a WaveGrad decoder that refines an initial Gaussian noise sequence into high-fidelity audio through a controlled denoising process. WaveGrad 2 achieves state-of-the-art audio quality, exhibits a natural speed-quality trade-off, and offers an efficient, end-to-end alternative to established TTS pipelines.

1. Architectural Foundations and Conditioning Strategy

WaveGrad 2 transitions from the classical two-stage TTS pipeline to a fully end-to-end, non-autoregressive synthesis framework. The architecture is composed of three principal components:

Phoneme Sequence Encoder: Input phonemes (augmented with silence and EOS tokens) are embedded and processed through convolutional layers with batch normalization and dropout, followed by a bidirectional LSTM with ZoneOut regularization. This provides temporally contextualized hidden states $h$ aligned to the sequence of linguistic tokens.
Gaussian Upsampling Layer: To accommodate the temporal resolution mismatch between the encoded phoneme sequence and the high-rate audio waveform, a Gaussian upsampling mechanism (inspired by non-attentive Tacotron) relies on duration information for alignment. During training, ground-truth durations are used; at inference, durations are predicted by an external duration predictor. This upsampling produces conditioning features $x$ at the waveform frame rate.
Iterative WaveGrad Decoder: The decoder (akin to the original WaveGrad architecture) comprises a stack of downsampling ("DBlock") and upsampling ("UBlock") convolutional blocks. It employs FiLM-style (Feature-wise Linear Modulation) conditioning—modulating internal activations by scale and shift parameters derived from the upsampled encoder output as well as the current noise level. The decoder iteratively refines a noise-initialized waveform using the predicted gradient (score) of the target conditional density.

The entire synthesis pipeline bypasses the generation of intermediate spectrogram representations, allowing WaveGrad 2 to directly model $p(y | x)$ , where $y$ is the waveform and $x$ encodes linguistic information at audio resolution.

2. Diffusion Process and Gradient-Based Inference

WaveGrad 2 leverages a score-matching and diffusion probabilistic model formulation to facilitate the progressive denoising of a raw signal initialized from Gaussian noise:

Score Network Learning: The central task is to estimate the gradient of the log-conditional data density:

$s(y \mid x) = \nabla_y \log p(y \mid x),$

where $x$ is the upsampled, aligned conditioning sequence. Rather than predicting clean waveforms, the network is trained via a denoising score matching objective to predict the additive noise $\epsilon$ embedded in perturbed samples:

$\tilde{y} = \sqrt{\bar{\alpha}}y_0 + \sqrt{1-\bar{\alpha}}\epsilon, \quad \epsilon \sim \mathcal{N}(0, I),$

with $\bar{\alpha}$ sampled from a predetermined noise schedule.

Objective Function: The L₁ loss between the predicted and true noise is minimized:

$\mathcal{L}_\theta = \mathbb{E}_{\bar{\alpha}, \epsilon} \big[ \| \epsilon_\theta(\tilde{y}, x, \sqrt{\bar{\alpha}}) - \epsilon \|_1 \big].$

This loss term ensures robust and stable learning performance.

Iterative Refinement Step: At inference, a Langevin-style update decomposes the waveform recovery into $N$ sequential steps. At each iteration, the model computes:

$y_{n-1} = \frac{1}{\sqrt{\alpha_n}}\Big[ y_n - \frac{\beta_n}{\sqrt{1-\bar{\alpha}_n}}\epsilon_\theta(y_n, x, \sqrt{\bar{\alpha}_n}) \Big] + \sigma_n z,$

where $z \sim \mathcal{N}(0, I)$ , and $\alpha, \beta, \sigma$ are schedule parameters. The process removes noise in a controlled, schedule-driven manner, steered by the learned gradient of the log-probability.

The repeated application of this update guides the noisy initial waveform toward a mode of the target conditional distribution defined by the linguistic condition $x$ .

3. Performance, Evaluation Metrics, and Trade-off Mechanisms

WaveGrad 2 achieves high subjective and objective synthesis quality under various evaluation scenarios:

Mean Opinion Score (MOS): For configurations with large encoder channels (1024 or 2048) and the large WaveGrad decoder using 1000 iterations, MOS values of 4.37–4.43 were obtained (on a 5-point scale), closely matching the performance of Tacotron-2 + WaveRNN (MOS 4.49).
Speed-Quality Tradeoff: By adjusting the number of denoising iterations at inference, practitioners can choose between faster generation and maximal fidelity. Experiments report only a 0.07 MOS degradation when reducing the number of refinement steps from 1000 to 50.
Ablation Analysis: Larger sampling window sizes during training yield measurable quality improvements (e.g., increasing from 0.8s to 3.2s windows raised MOS from 3.80 to 3.88), and increasing decoder size improves synthesis quality more substantially than increasing encoder width. Regularization methods (SpecAugment-like masking on internal resampled features) and multi-task learning (e.g., predicting mel-spectrograms as an auxiliary target) provide minor or negligible robustness/fidelity gains.

4. Technical and Mathematical Formulation

WaveGrad 2 is grounded in the mathematical apparatus of modern diffusion TTS frameworks. Key formulations include:

Score Definition:

$s(y | x) = \nabla_y \log p(y | x)$

Diffusion Forward Process:

$\tilde{y} = \sqrt{\bar{\alpha}} y_0 + \sqrt{1-\bar{\alpha}} \epsilon, \quad \epsilon \sim \mathcal{N}(0, I)$

Training Loss:

$\mathcal{L}_\theta = \mathbb{E}_{\bar{\alpha}, \epsilon}\left[ \| \epsilon_\theta(\tilde{y}, x, \sqrt{\bar{\alpha}}) - \epsilon \|_1 \right]$

Inference Update:

$y_{n-1} = \frac{1}{\sqrt{\alpha_n}} \left( y_n - \frac{\beta_n}{\sqrt{1 - \bar{\alpha}_n}} \epsilon_\theta(y_n, x, \sqrt{\bar{\alpha}_n}) \right) + \sigma_n z$

These formulas define the iterative restoration and sampling procedure and encode the mechanisms of model training and generation.

5. Comparative Analysis and Relational Context

Advances over WaveGrad: The principal innovation is direct conditioning on linguistic sequences, bypassing intermediate spectrogram prediction, and integrating the noise schedule and vocoding steps into a single, non-autoregressive loop. This reduces error propagation and removes the dependency on an external acoustic model.
Comparison to Autoregressive and Non-Autoregressive Systems: WaveGrad 2 matches the audio quality of top-tier autoregressive models while offering substantially lower inference latency—a benefit previously restricted to adversarial non-autoregressive vocoders, which have historically underperformed in naturalness metrics.
Ablation on Model Size: Decoder scaling provides tangible improvements, whereas aggressive encoder scaling returns only modest gains, confirming the decoder’s pivotal role in waveform refinement.

6. Extensions, Fine-Tuning, and Resource-Optimized TTS

Recent work explores reinforcement learning from human feedback (RLHF) to further align WaveGrad 2 outputs with perceived naturalness, while mitigating the inefficiency of many denoising steps (Chen et al., 5 Aug 2025):

Diffusion Loss-Guided Policy Optimization (DLPO): Fine-tunes WaveGrad 2 with an RL objective that includes both a human preference reward (measured using naturalness metrics such as UTMOS) and a regularization term reflecting the original denoising diffusion loss. This joint optimization:
- Improves UTMOS from 2.9 to 3.65 and NISQA from 3.74 to 4.02.
- Maintains low word error rate ( $\approx$ 1.2%).
- Achieves a 67% subjective preference rate over standard WaveGrad 2.
Resource-Constrained Deployment: The combination of non-autoregressive inference, adjustable iteration counts, and RLHF-based regularization in DLPO enables WaveGrad 2 deployments on real-time and on-device TTS systems without significant quality compromise.

7. Resources, Benchmarks, and Future Directions

Experimental Materials: Audio samples produced by WaveGrad 2 are provided at https://wavegrad.github.io/v2 for qualitative assessment.
Research Trajectory: Current trends involve integrating phase recovery and adaptation techniques (e.g., Griffin-Lim based post-processing as in GLA-Grad (Liu et al., 9 Feb 2024)), robust inferencing under reduced iteration counts (e.g., InferGrad (Chen et al., 2022)), and efficient adaptation/fine-tuning with human-in-the-loop or self-supervised objectives.
Open Questions: The impact of conditioning granularity, further decoder architectural innovations, integration with more granular prosodic and expressive cues, and adaptive inference scheduling remain areas of active investigation.

WaveGrad 2 systematizes the integration of score-based diffusion synthesis into text-to-waveform TTS, achieving high-fidelity, efficient, and practical speech generation suitable for both research and real-world TTS applications.