Papers
Topics
Authors
Recent
2000 character limit reached

FastPitch: Parallel TTS with Prosodic Control

Updated 2 December 2025
  • FastPitch is a fully-parallel, non-autoregressive TTS architecture that offers explicit prosodic control for efficient, high-quality speech synthesis.
  • It utilizes modular components including a text encoder, variance adaptor with pitch and duration predictors, length regulator, and mel-spectrogram decoder to perform simultaneous computation.
  • Incremental FastPitch builds on this design by incorporating chunk-based streaming synthesis with cached context, enabling low-latency real-time applications.

FastPitch is a fully-parallel, non-autoregressive text-to-speech (TTS) architecture designed to synthesize Mel-spectrograms with high efficiency and explicit prosodic control. It achieves state-of-the-art synthesis speed by decoupling duration and pitch prediction from sequence generation, allowing for controllable, high-fidelity speech. More recent variants, such as Incremental FastPitch, introduce modifications enabling low-latency, chunk-based streaming synthesis suitable for real-time applications by constraining decoder context and introducing per-chunk state caching (Łańcucki, 2020, Du et al., 3 Jan 2024).

1. Architectural Principles of FastPitch

FastPitch is structured to factorize TTS synthesis into distinct modules for rapid and controllable Mel-spectrogram generation. The main pipeline consists of the following stages:

  • Text (phoneme/grapheme) encoder: Converts sequence input into embeddings augmented by sine-cosine positional encoding, forming the encoded sequence hRn×dh \in \mathbb{R}^{n\times d}.
  • Variance adaptor: Composed of a duration predictor (1D-CNN), a pitch predictor (1D-CNN), an optional energy predictor, and associated embedding layers. Both duration and pitch are modeled per input symbol.
  • Length regulator: Expands encoded representations by predicted durations, producing a frame-level sequence.
  • Mel-spectrogram decoder: A feed-forward Transformer stack (FFTr) conditioned on the expanded, pitch/energy-augmented representations.
  • Vocoder: Transforms output Mel-spectrograms into audio waveforms (e.g., HiFi-GAN or WaveGlow) (Łańcucki, 2020, Du et al., 3 Jan 2024).

This architecture is fully parallelizable because duration and pitch predictions are performed in one pass, allowing all downstream expansions and transformations to operate simultaneously across time.

2. Core Modules and Data Flow

Each component is architected for parallelism and data transformation efficiency:

Stage Description Example Dimensions
Phoneme Encoder Embedding + positional encoding + FFTr dmodel=384d_\text{model} = 384
Duration/Pitch Predictor Two-layer 1D-CNN (kernel size 3), ReLU, LayerNorm, Dropout(0.1), Linear head 3842562561384 \rightarrow 256 \rightarrow 256 \rightarrow 1
Pitch Embedding Linear projection of scalar pitch/energy 384×1384 \times 1 (to dmodeld_\text{model})
Length Regulator Expands gig_i by d^iN\hat{d}_i\in \mathbb{N} Output [T,dmodel][T, d_\text{model}]
Decoder FFTr 6 transformer blocks, MHA + position-wise FFN dmodel=384d_\text{model} = 384

The typical sequence is:

  1. Input symbol sequences are embedded and position-encoded.
  2. Variance adaptor predicts durations and pitch per symbol.
  3. Pitch is embedded and added to encoder output; length regulator expands the sequence to match predicted frame durations.
  4. Decoder processes the regulated sequence to produce Mel-spectrograms (Łańcucki, 2020).
  5. Output is synthesized to waveform via a neural vocoder.

3. Pitch and Duration Prediction

Pitch and duration are predicted with specialized 1D convolutional networks, each independently applied to encoder outputs. The pitch predictor operates as follows:

  • Two 1D conv layers (384 → 256, 256 → 256), each followed by ReLU, LayerNorm, and Dropout(0.1).
  • A final linear projection outputs scalar pitch per symbol.
  • Pitch prediction loss LpitchL_\text{pitch} is computed as mean-squared error (MSE) to ground-truth F0F_0, and duration loss LdurL_\text{dur} uses MSE to ground-truth durations as extracted from Tacotron 2 attention alignments (Łańcucki, 2020).

For each input ii:

p^i=wout(Conv2(ReLU(Conv1(hi))))+bout\hat{p}_i = w_\text{out}^\top(\text{Conv}_2(\text{ReLU}(\text{Conv}_1(h_i)))) + b_\text{out}

The length regulator duplicates the encoder+pitch representation gig_i for each predicted frame up to d^i\hat{d}_i.

4. Mel-Spectrogram Decoder and Non-Autoregressive Synthesis

The decoder consists of a 6-layer feed-forward Transformer stack:

  • Each block contains multi-head self-attention (2 heads of size 192) with Dropout(0.1) and a position-wise FFN (1D conv 384\rightarrow1536, ReLU, 1D conv 1536\rightarrow384, Dropout(0.1), LayerNorm).
  • Output is projected linearly to Mel bins (e.g., M=80M = 80).

All operations are non-autoregressive: Given predicted durations and pitch, the entire frame sequence can be synthesized simultaneously. On NVIDIA A100 GPUs, FastPitch achieves >900×\times real-time factor for Mel-spectrogram generation; end-to-end speech generation (with WaveGlow) attains \approx63×\times real-time (Łańcucki, 2020).

5. Incremental FastPitch: Chunk-Based Streaming Synthesis

Incremental FastPitch modifies the decoder for truly incremental, low-latency streaming:

  • Chunk-based FFT blocks: Each decoder block processes short, non-overlapping time chunks of length ScS_c (e.g., 30 frames) using states cached from the SpS_p most recent frames of the previous chunk (e.g., Sp=5S_p=5).
  • Per-chunk multi-head attention: Each chunk concatenates current-chunk queries with cached past keys/values. Attention masks constrain queries to attend only to up to SpS_p past frames and current/past chunk positions, enforced by a causal-with-lookback mask MM.
  • Causal convolutional FFNs: Within each chunk block, the FFN consists of two 1D causal convolutions (kernel sizes k1,k2k_1, k_2), again with per-chunk state caching.
  • Caching mechanism: Past key, value, and conv states are tailed to SpS_p or ki1k_i-1 length respectively per layer, eliminating the need to recompute earlier output (Du et al., 3 Jan 2024).

The per-chunk mask for self-attention MM is:

Mi,j=1    ji and ijSpM_{i,j} = 1 \iff j \leq i \text{ and } i - j \leq S_p

This structure ensures temporal continuity without overlap, preserving computational efficiency: per-chunk complexity is O(Sc2+ScSp)O(S_c^2 + S_c S_p), independent of utterance length.

6. Training Objectives and Receptive Field Masking

All losses mirror those of parallel FastPitch, with the critical difference that the decoder attends only to chunk-limited context:

  • Losses: L=λdurLdur+λpitchLpitch+λenergyLenergy+λmelLmelL = \lambda_\text{dur}L_\text{dur} + \lambda_\text{pitch}L_\text{pitch} + \lambda_\text{energy}L_\text{energy} + \lambda_\text{mel}L_\text{mel}
    • LmelL_\text{mel} may additionally include LSSIM(Y^,Y)L_\text{SSIM}(Ŷ, Y) with weight α\alpha (Du et al., 3 Jan 2024).
  • Masking during training: The decoder's mask mimics inference-time chunk constraints. Two modes are used:
    • Static: Fixed (Sc,Sp)(S_c, S_p) for all examples.
    • Dynamic: Random (Sc,Sp)(S_c, S_p) per batch, with Sc[1,50]S_c\in[1,50], Sp{0,0.25,0.5,1,2,3}ScS_p\in\{0,0.25,0.5,1,2,3\}\cdot S_c or "all". This forces robustness to varying chunk/past horizon (Du et al., 3 Jan 2024).

Losses are computed chunk-wise but summed over the full utterance.

7. Architectural Hyperparameters and Practical Considerations

Key architectural parameters for standard and incremental FastPitch are detailed below:

Parameter FastPitch (Łańcucki, 2020) Incremental FastPitch (Du et al., 3 Jan 2024)
Decoder Layers 6 6
Model Dim. (dmodeld_\text{model}) 384 256
FFN Dim. 1536 1024
Attention Heads 2 (dkd_k=192) 2 (dk=dv=128d_k=d_v=128)
Chunk Size (ScS_c) N/A 30 frames (\approx120ms)
Past Cache (SpS_p) N/A 5 frames (\approx20ms)
Causal FFN Kernels N/A (k1=3,k2=3k_1=3,k_2=3)

A crucial observation is that the only structural change enabling streaming is the replacement of full-sequence Transformer blocks with streaming, cache-based chunk FFT blocks. The encoder, pitch/duration predictors, and all loss formulations remain as in the original FastPitch (Du et al., 3 Jan 2024).

Objective evaluation demonstrates equality in synthesis quality (MOS \approx4.18 vs. 4.19) with a more than 4×4\times reduction in first-chunk latency, establishing Incremental FastPitch as a robust solution for high-quality real-time TTS.


References:

FastPitch: Parallel Text-to-speech with Pitch Prediction, (Łańcucki, 2020) Incremental FastPitch: Chunk-based High Quality Text to Speech, (Du et al., 3 Jan 2024)

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to FastPitch Architecture.