FastSpeech 2s: End-to-End TTS Architecture

Updated 10 February 2026

The paper introduces a novel end-to-end TTS architecture that bypasses intermediate mel-spectrograms using parallel waveform synthesis and adversarial training.
It employs explicit conditioning on duration, pitch, and energy via dedicated neural predictors to achieve fine-grained prosody control.
The system delivers over 50× real-time speedup with competitive audio quality (MOS ≳ 3.7) compared to traditional cascaded TTS models.

FastSpeech 2s is a fully non-autoregressive, end-to-end text-to-speech (TTS) architecture that generates speech waveforms directly from input text, bypassing intermediate mel-spectrogram representations and vocoder cascades. Introduced as an extension of FastSpeech 2, FastSpeech 2s utilizes parallelized inference for accelerated synthesis and is designed to capture both magnitude and phase information in waveforms through a novel adversarial and spectral loss regime. The system incorporates explicit conditioning on duration, pitch, and energy, each estimated by dedicated neural predictors, enabling fine-grained prosodic control. FastSpeech 2s achieves high mean opinion score (MOS ≳ 3.7) and synthesis speedup exceeding 50× real time while maintaining output quality competitive with established cascaded systems (Ren et al., 2020).

1. Architectural Foundations and System Modules

FastSpeech 2s is built upon a modular stack comprising several interlocking components:

Encoder: A sequence of feed-forward Transformer blocks, each integrating multi-head self-attention and one-dimensional convolutional sublayers with kernel sizes decreasing from 9 to 1. The encoder processes input phoneme embeddings, producing a latent sequence $H_p \in \mathbb{R}^{L \times d}$ .
Variance Adaptor: This module injects prosodic conditioning by predicting and adding duration, pitch, and energy attributes to the encoder output. It includes:
- Duration Predictor: Outputs log-duration estimates per encoder frame, trained to match phoneme-level durations extracted via forced alignment. Length regulation expands $H_p$ by repeating frames based on predicted durations.
- Pitch Predictor: Predicts a multi-resolution "pitch spectrogram" via continuous wavelet transform (CWT) targets; inverse CWT reconstructs frame-level $F_0$ estimates.
- Energy Predictor: Outputs per-frame energy, regressed as the $L_2$ -norm of STFT magnitude and embedded after quantization.
Waveform Decoder: A non-causal, dilated, gated convolutional stack (WaveNet-style), preceded by a transposed convolutional upsampler, maps the adapted hidden-state sequence to waveform samples in parallel.
Mel-Spectrogram Decoder (Training Only): Mirror of the encoder used for auxiliary supervision via mean absolute error (MAE) on mel-spectrogram reconstruction.
Discriminator (Training Only): A 10-layer non-causal dilated convolutional network (identical to the Parallel WaveGAN discriminator) used for adversarial training via least squares GAN (LSGAN) objectives.

Key architectural departures from FastSpeech 2 include direct waveform generation (replacing the mel + vocoder stack), auxiliary use of the mel decoder during training, and the integration of adversarial and multi-resolution spectral losses (Ren et al., 2020).

2. End-to-End Parallel Text-to-Waveform Inference

FastSpeech 2s implements a fully parallel text-to-waveform pipeline, executing the following steps:

Text Analysis: Convert input text to a phoneme sequence using a grapheme-to-phoneme tool.
Phoneme Encoding: Map the phoneme sequence to embeddings, and pass through the Transformer encoder for latent state extraction ( $H_p$ ).
Variance Adaptation:
- Predict log-durations, exponentiate, and repeat encoder frames to align with duration targets.
- Predict CWT-based pitch representations, invert to time-domain $F_0$ , quantize, and embed.
- Predict frame-wise energy, quantize, and embed.
- Aggregate all embeddings for the expanded sequence ( $H_{exp}$ ).
Waveform Synthesis: Decode $H_{exp}$ into waveform samples via the parallel waveform decoder.
Inference Output: The discriminator and mel-spectrogram decoder are bypassed in inference; only trained predictors are evaluated.

The full process enables entirely parallelized waveform output, distinct from autoregressive sequential models (Ren et al., 2020).

3. Conditioning on Duration, Pitch, and Energy

Variance adaptation in FastSpeech 2s explicitly models prosodic variations, resolving ambiguities in the one-to-many mapping from text to speech:

Duration: Extracted by forced alignment (Montreal Forced Aligner), represented as log-duration, and predicted with MSE loss.
Pitch: Frame-level $F_0$ traces (WORLD vocoder) are interpolated and CWT-transformed to ten-scale spectral representations. The predictor learns to output matching spectral coefficients, with iCWT used for time-domain inversion at inference. After quantization, pitch is embedded and added to the hidden state.
Energy: Calculated as the $L_2$ -norm of the magnitude STFT for each frame, then quantized and embedded.

This explicit variance conditioning enables flexible prosody, robust synthesis across speaker variations, and improved alignment between predicted and ground-truth pitch/energy distributions (Ren et al., 2020). Ablation studies demonstrate that removing pitch conditioning results in the largest degradation of subjective audio quality.

4. Loss Functions and Training Strategy

The training regime for FastSpeech 2s combines supervised and adversarial objectives:

Mel-Spectrogram Reconstruction (Auxiliary, Training Only): $L_{mel} = \frac{1}{T \times 80} \sum_{t=1}^T \| Y_{mel}(t) - \hat{Y}_{mel}(t) \|_1$
Duration Prediction: $L_{dur} = \frac{1}{L} \sum_{i=1}^L (\log d_i^* - \log \hat{d}_i)^2$
Pitch Prediction (CWT Domain): $L_{pitch} = \frac{1}{T C} \sum_{t=1}^T \sum_{c=1}^C (W_{t, c}^* - \hat{W}_{t, c})^2$
Energy Prediction: $L_{energy} = \frac{1}{T} \sum_{t=1}^T (e_t^* - \hat{e}_t)^2$
Multi-Resolution STFT Loss: Combines $L_1$ losses on both amplitude and log amplitude across multiple STFT resolutions.
LSGAN Adversarial Loss: Standard least-squares adversarial losses for generator and discriminator as in Parallel WaveGAN.

Joint optimization of these losses is performed using Adam with a Noam learning rate schedule. All modules are trained jointly on LJSpeech, with the waveform decoder converging after approximately 600k steps (Ren et al., 2020).

5. Empirical Results and Performance Metrics

FastSpeech 2s is evaluated on the LJSpeech dataset with the following outcomes:

Subjective Audio Quality: MOS 3.71 ± 0.09 (FastSpeech 2s), competitive with cascaded FastSpeech 2 + vocoder and TransformerTTS, but slightly behind ground truth (GT mels + PWG) at MOS 3.92 ± 0.08.
Inference Speed: Real-time factor (RTF) of 1.80×10⁻² for FastSpeech 2s, corresponding to a 51.8× speedup over real-time, outperforming all compared systems.
Objective Prosody Metrics: Closest match to ground-truth pitch distribution moments (standard deviation, skewness, kurtosis) and improved dynamic time warping (DTW).
Ablation Analysis: Removal of pitch or energy predictors significantly degrades subjective quality (–1.13 CMOS for pitch; –0.16 CMOS for energy), confirming the importance of explicit variance conditioning.

A summary of the key MOS and RTF figures is shown below:

Model	MOS (LJS, test)	RTF	Speedup
Ground truth (recorded)	4.30 ± 0.07	–	–
GT (mels → PWG)	3.92 ± 0.08	–	–
FastSpeech 2 (Mel+PWG)	3.83 ± 0.08	1.95e-2	47.8×
FastSpeech 2s (end-to-end)	3.71 ± 0.09	1.80e-2	51.8×

(Ren et al., 2020)

6. Significance, Limitations, and Broader Impact

FastSpeech 2s constitutes the first fully non-autoregressive, parallel text-to-waveform TTS framework integrating explicit variance control, adversarial spectral losses, and auxiliary supervision in a single end-to-end trainable system. By modeling both magnitude and phase, FastSpeech 2s reduces reliance on hand-engineered intermediate representations and separate vocoder networks. Parallel inference and efficient training enable deployment in latency-sensitive applications.

A plausible implication is that while FastSpeech 2s demonstrates high performance on neutral speech benchmarks (e.g., LJSpeech), its efficacy on highly expressive, multi-speaker, or code-switched corpora may depend on further architectural enhancements—such as explicit modeling of residual multimodality, as addressed by subsequent research using mixture-of-Gaussians decoders (e.g., TVC-GMM) (Kögel et al., 2023). Adoption of FastSpeech 2s in production pipelines will require careful consideration of prosody modeling, phase recovery, and domain-specific adaptation.

FastSpeech 2s is directly descended from FastSpeech 2, which introduced explicit variance adaptation and removed dependency on teacher-student distillation. In contrast to TransformerTTS and Tacotron 2, which utilize autoregressive mel-spectrogram prediction followed by vocoding, FastSpeech 2s offers fully parallel, end-to-end waveform synthesis. The architecture is compatible with techniques for overcoming over-smoothness and residual multimodality, such as trivariate-chain Gaussian mixtures (TVC-GMM), which can be incorporated into non-autoregressive TTS decoders to address expressivity and naturalness in expressive speech domains (Kögel et al., 2023).

Markdown Upgrade to Chat

References (2)

FastSpeech 2: Fast and High-Quality End-to-End Text to Speech (2020)

Towards Robust FastSpeech 2 by Modelling Residual Multimodality (2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FastSpeech 2s.