FastPitch: Parallel TTS with Prosodic Control

Updated 2 December 2025

FastPitch is a fully-parallel, non-autoregressive TTS architecture that offers explicit prosodic control for efficient, high-quality speech synthesis.
It utilizes modular components including a text encoder, variance adaptor with pitch and duration predictors, length regulator, and mel-spectrogram decoder to perform simultaneous computation.
Incremental FastPitch builds on this design by incorporating chunk-based streaming synthesis with cached context, enabling low-latency real-time applications.

FastPitch is a fully-parallel, non-autoregressive text-to-speech (TTS) architecture designed to synthesize Mel-spectrograms with high efficiency and explicit prosodic control. It achieves state-of-the-art synthesis speed by decoupling duration and pitch prediction from sequence generation, allowing for controllable, high-fidelity speech. More recent variants, such as Incremental FastPitch, introduce modifications enabling low-latency, chunk-based streaming synthesis suitable for real-time applications by constraining decoder context and introducing per-chunk state caching (Łańcucki, 2020, Du et al., 3 Jan 2024).

1. Architectural Principles of FastPitch

FastPitch is structured to factorize TTS synthesis into distinct modules for rapid and controllable Mel-spectrogram generation. The main pipeline consists of the following stages:

Text (phoneme/grapheme) encoder: Converts sequence input into embeddings augmented by sine-cosine positional encoding, forming the encoded sequence $h \in \mathbb{R}^{n\times d}$ .
Variance adaptor: Composed of a duration predictor (1D-CNN), a pitch predictor (1D-CNN), an optional energy predictor, and associated embedding layers. Both duration and pitch are modeled per input symbol.
Length regulator: Expands encoded representations by predicted durations, producing a frame-level sequence.
Mel-spectrogram decoder: A feed-forward Transformer stack (FFTr) conditioned on the expanded, pitch/energy-augmented representations.
Vocoder: Transforms output Mel-spectrograms into audio waveforms (e.g., HiFi-GAN or WaveGlow) (Łańcucki, 2020, Du et al., 3 Jan 2024).

This architecture is fully parallelizable because duration and pitch predictions are performed in one pass, allowing all downstream expansions and transformations to operate simultaneously across time.

2. Core Modules and Data Flow

Each component is architected for parallelism and data transformation efficiency:

Stage	Description	Example Dimensions
Phoneme Encoder	Embedding + positional encoding + FFTr	$d_\text{model} = 384$
Duration/Pitch Predictor	Two-layer 1D-CNN (kernel size 3), ReLU, LayerNorm, Dropout(0.1), Linear head	$384 \rightarrow 256 \rightarrow 256 \rightarrow 1$
Pitch Embedding	Linear projection of scalar pitch/energy	$384 \times 1$ (to $d_\text{model}$ )
Length Regulator	Expands $g_i$ by $\hat{d}_i\in \mathbb{N}$	Output $[T, d_\text{model}]$
Decoder FFTr	6 transformer blocks, MHA + position-wise FFN	$d_\text{model} = 384$

The typical sequence is:

Input symbol sequences are embedded and position-encoded.
Variance adaptor predicts durations and pitch per symbol.
Pitch is embedded and added to encoder output; length regulator expands the sequence to match predicted frame durations.
Decoder processes the regulated sequence to produce Mel-spectrograms (Łańcucki, 2020).
Output is synthesized to waveform via a neural vocoder.

3. Pitch and Duration Prediction

Pitch and duration are predicted with specialized 1D convolutional networks, each independently applied to encoder outputs. The pitch predictor operates as follows:

Two 1D conv layers (384 → 256, 256 → 256), each followed by ReLU, LayerNorm, and Dropout(0.1).
A final linear projection outputs scalar pitch per symbol.
Pitch prediction loss $L_\text{pitch}$ is computed as mean-squared error (MSE) to ground-truth $F_0$ , and duration loss $L_\text{dur}$ uses MSE to ground-truth durations as extracted from Tacotron 2 attention alignments (Łańcucki, 2020).

For each input $i$ :

$\hat{p}_i = w_\text{out}^\top(\text{Conv}_2(\text{ReLU}(\text{Conv}_1(h_i)))) + b_\text{out}$

The length regulator duplicates the encoder+pitch representation $g_i$ for each predicted frame up to $\hat{d}_i$ .

4. Mel-Spectrogram Decoder and Non-Autoregressive Synthesis

The decoder consists of a 6-layer feed-forward Transformer stack:

Each block contains multi-head self-attention (2 heads of size 192) with Dropout(0.1) and a position-wise FFN (1D conv 384 $\rightarrow$ 1536, ReLU, 1D conv 1536 $\rightarrow$ 384, Dropout(0.1), LayerNorm).
Output is projected linearly to Mel bins (e.g., $M = 80$ ).

All operations are non-autoregressive: Given predicted durations and pitch, the entire frame sequence can be synthesized simultaneously. On NVIDIA A100 GPUs, FastPitch achieves >900 $\times$ real-time factor for Mel-spectrogram generation; end-to-end speech generation (with WaveGlow) attains $\approx$ 63 $\times$ real-time (Łańcucki, 2020).

5. Incremental FastPitch: Chunk-Based Streaming Synthesis

Incremental FastPitch modifies the decoder for truly incremental, low-latency streaming:

Chunk-based FFT blocks: Each decoder block processes short, non-overlapping time chunks of length $S_c$ (e.g., 30 frames) using states cached from the $S_p$ most recent frames of the previous chunk (e.g., $S_p=5$ ).
Per-chunk multi-head attention: Each chunk concatenates current-chunk queries with cached past keys/values. Attention masks constrain queries to attend only to up to $S_p$ past frames and current/past chunk positions, enforced by a causal-with-lookback mask $M$ .
Causal convolutional FFNs: Within each chunk block, the FFN consists of two 1D causal convolutions (kernel sizes $k_1, k_2$ ), again with per-chunk state caching.
Caching mechanism: Past key, value, and conv states are tailed to $S_p$ or $k_i-1$ length respectively per layer, eliminating the need to recompute earlier output (Du et al., 3 Jan 2024).

The per-chunk mask for self-attention $M$ is:

$M_{i,j} = 1 \iff j \leq i \text{ and } i - j \leq S_p$

This structure ensures temporal continuity without overlap, preserving computational efficiency: per-chunk complexity is $O(S_c^2 + S_c S_p)$ , independent of utterance length.

6. Training Objectives and Receptive Field Masking

All losses mirror those of parallel FastPitch, with the critical difference that the decoder attends only to chunk-limited context:

Losses: $L = \lambda_\text{dur}L_\text{dur} + \lambda_\text{pitch}L_\text{pitch} + \lambda_\text{energy}L_\text{energy} + \lambda_\text{mel}L_\text{mel}$
- $L_\text{mel}$ may additionally include $L_\text{SSIM}(Ŷ, Y)$ with weight $\alpha$ (Du et al., 3 Jan 2024).
Masking during training: The decoder's mask mimics inference-time chunk constraints. Two modes are used:
- Static: Fixed $(S_c, S_p)$ for all examples.
- Dynamic: Random $(S_c, S_p)$ per batch, with $S_c\in[1,50]$ , $S_p\in\{0,0.25,0.5,1,2,3\}\cdot S_c$ or "all". This forces robustness to varying chunk/past horizon (Du et al., 3 Jan 2024).

Losses are computed chunk-wise but summed over the full utterance.

7. Architectural Hyperparameters and Practical Considerations

Key architectural parameters for standard and incremental FastPitch are detailed below:

Parameter	FastPitch (Łańcucki, 2020)	Incremental FastPitch (Du et al., 3 Jan 2024)
Decoder Layers	6	6
Model Dim. ( $d_\text{model}$ )	384	256
FFN Dim.	1536	1024
Attention Heads	2 ( $d_k$ =192)	2 ( $d_k=d_v=128$ )
Chunk Size ( $S_c$ )	N/A	30 frames ( $\approx$ 120ms)
Past Cache ( $S_p$ )	N/A	5 frames ( $\approx$ 20ms)
Causal FFN Kernels	N/A	( $k_1=3,k_2=3$ )

A crucial observation is that the only structural change enabling streaming is the replacement of full-sequence Transformer blocks with streaming, cache-based chunk FFT blocks. The encoder, pitch/duration predictors, and all loss formulations remain as in the original FastPitch (Du et al., 3 Jan 2024).

Objective evaluation demonstrates equality in synthesis quality (MOS $\approx$ 4.18 vs. 4.19) with a more than $4\times$ reduction in first-chunk latency, establishing Incremental FastPitch as a robust solution for high-quality real-time TTS.