Incremental FastPitch: Real-Time TTS

Updated 12 December 2025

Incremental FastPitch is a text-to-speech model variant that uses chunk-based transformer decoders for incremental, low-latency speech synthesis.
It employs advanced attention masking and state-caching to maintain high speech fidelity while dramatically reducing end-to-end latency.
The architecture scales linearly with utterance length, making it ideal for real-time streaming and interactive TTS applications.

Incremental FastPitch is a variant of the FastPitch text-to-speech (TTS) architecture that enables low-latency, high-quality, incremental speech synthesis by re-architecting the decoder to operate on fixed-size Mel-spectrogram chunks using transformer-style chunk-based Feed-Forward Transformer (FFT) blocks, advanced attention masking for receptive-field control, and a state-caching inference regime. The method delivers real-time response suitable for streaming and interactive speech applications, cutting end-to-end latency dramatically compared to fully parallel transformer decoders while preserving output speech fidelity (Du et al., 3 Jan 2024).

1. Architectural Modifications in Incremental FastPitch

The canonical FastPitch system synthesizes speech in a fully parallel manner using transformer FFT blocks that process the upsampled encoder features $\bar u \in \mathbb{R}^{T \times d}$ for the entire utterance in a single pass, yielding the full Mel-spectrogram $Y \in \mathbb{R}^{T \times 80}$ . Incremental FastPitch preserves the encoder, duration, pitch, and energy predictors from FastPitch, but restructures the decoder to operate on non-overlapping chunks $\bar u = [\bar u_1; \dots; \bar u_N]$ , where each $\bar u_i \in \mathbb{R}^{S_c \times d}$ and $S_c$ is the chunk size.

Each decoder layer comprises a stack of chunk-based FFT blocks, each with:

Multi-Head Attention (MHA) Sub-layer: At chunk $t$ , MHA computes query-key-value attention where keys and values are concatenated from a fixed-size cache of previous keys/values ( $pk_i^{t-1}$ , $pv_i^{t-1}$ , size $S_p$ ) and the current chunk ( $\bar u_i$ projected by $W^K_i$ , $W^V_i$ ). This design ensures the per-chunk inference cost is independent of chunk index $i$ or global sequence length $T$ and supports Mel-continuity.
Two-Layer Causal Convolutional Feed-Forward Network (FFN): Each FFN layer caches its last $K_j-1$ states ( $pc^t_j$ for $j=1,2$ ) to support causality over the chunk sequence.

The chunk-based FFT computational diagram can be summarized as:

┌────────────────────────────────────────────────────────┐
│ Input chunk: \bar u_i ∈ R^{S_c×d}                     │
├─► [MHA w/ past-cache] ──► Add & Norm ──►                │
│   • pk_i^{t−1}, pv_i^{t−1}                            │
│   • produce o^t_M and update pk_i^t, pv_i^t           │
├─► [Causal-Conv FFN w/ past-state] ──► Add & Norm ──►    │
│   • pc^{t−1}_{1}, pc^{t−1}_{2}                        │
│   • produce o^t_{c2} and update pc^t_{1}, pc^t_{2}    │
└────────────────────────────────────────────────────────┘

Mathematically, for MHA in head $i$ , chunk $t$ :

$k^t_i = \mathrm{concat}(pk^{t-1}_i,\, \bar u_i W^K_i)$
$v^t_i = \mathrm{concat}(pv^{t-1}_i,\, \bar u_i W^V_i)$
$o^t_i = \mathrm{Attention}(Q=\bar u_i W^Q_i,\, K=k^t_i,\, V=v^t_i)$
$o^t_M = \mathrm{concat}_i(o^t_i)\,W^O$
$pk^t_i = \mathrm{tail\_slice}(k^t_i,\, S_p)$ ; $pv^t_i = \mathrm{tail\_slice}(v^t_i,\, S_p)$

The causal-conv FFN processes similarly, caching and updating past conv states.

2. Training with Receptive-Field-Constrained Chunk Attention Masks

To ensure the model operates within its inference constraints during training, a strictly block-diagonal attention mask $M \in \{0, -\infty\}^{(S_p+S_c) \times (S_p+S_c)}$ is applied in each decoder layer. The local receptive window per chunk is $L = S_p + S_c$ ; positions $j=1\ldots S_p$ refer to the past cache and $j=S_p+1\ldots S_p+S_c$ are the current chunk. The attention mask enforces:

$M_{q,k} = \begin{cases} 0, & \text{if } k \geq \max(1, q-L+1) \wedge k \leq q \ -\infty, & \text{otherwise} \end{cases}$

Two regimes are explored:

Static Mask: $(S_c, S_p)$ fixed through training.
Dynamic Mask: $(S_c, S_p)$ randomly sampled per batch (e.g., $S_p \in \{0,\, 0.25S_c,\, 0.5S_c,\, 1S_c,\, 2S_c,\, 3S_c,\, \infty\}$ ), enabling generalization to varying chunk configurations.

3. State-Caching Inference Algorithm

The inference algorithm maintains, per decoder layer $\ell$ , fixed-size past-key/value and past-convolution caches for each processed chunk. For an input of $N = \lceil T/S_c \rceil$ chunks:

For each chunk $t$ , extract $\bar u_t$ .
For every decoder layer:
- Concatenate past caches with the current chunk for MHA.
- Update caches using $\mathrm{tail\_slice}$ operations post-attention and post-convolution.
- Normalize and propagate to the next block.
Project final output to yield Mel chunk $y_t$ and emit for audio synthesis.

Per-chunk runtime is $O(N_d S_c d^2)$ , and the total cost scales linearly with the number of chunks ( $T/S_c$ ); inference cost per chunk is independent of chunk index.

4. Training Objectives and Optimization

The multi-loss objective is retained from FastPitch:

$\mathcal{L} = \mathcal{L}_{\mathrm{mel}} + \lambda_{\mathrm{dur}} \mathcal{L}_{\mathrm{dur}} + \lambda_{\mathrm{pitch}} \mathcal{L}_{\mathrm{pitch}} + \lambda_{\mathrm{energy}} \mathcal{L}_{\mathrm{energy}}$

with:

$\mathcal{L}_{\mathrm{mel}} = \|Y_{\mathrm{pred}} - Y_{\mathrm{target}}\|_1$ ,
$\mathcal{L}_{\mathrm{dur}} = \mathrm{MSE}(\log d_{\mathrm{pred}}, \log d_{\mathrm{target}})$ ,
$\mathcal{L}_{\mathrm{pitch}} = \mathrm{MSE}(p_{\mathrm{pred}}, p_{\mathrm{target}})$ ,
$\mathcal{L}_{\mathrm{energy}} = \mathrm{MSE}(e_{\mathrm{pred}}, e_{\mathrm{target}})$ .

No extra regularization or loss terms are used; the chunked receptive field is enforced purely by the attention mask.

5. Experimental Results: Speech Quality and Latency

Empirical evaluation demonstrates Incremental FastPitch produces speech of comparable quality to parallel FastPitch at substantially reduced latency.

Table 1: Comparative Metrics

Model	MOS ( $\pm$ 95% CI)	Latency (ms)	RTF
Parallel FastPitch	4.185 $\pm$ 0.043	125.8	0.029
Inc. FastPitch (Static Mask)	4.178 $\pm$ 0.047	30.4	0.045
Inc. FastPitch (Dynamic Mask)	4.145 $\pm$ 0.052	30.4	0.045
Ground Truth	4.545 $\pm$ 0.039	—	—

Mel-Spectrogram Distance: Lowest when $(S_c, S_p)$ match at train/test; dynamic masking generalizes but with an $\approx$ 8% higher MSD.
MOS: Incremental FastPitch matches FastPitch (MOS $\approx$ 4.18) while cutting end-to-end latency by $\approx$ 4x (from 125.8 ms to 30.4 ms).
RTF: Remains well above real-time requirements ( $\approx$ 1/22).

6. Computational Complexity and Latency Profile

Let $N_d$ = number of decoder layers, $S_c$ = chunk size, $S_p$ = cache size, $d$ = model dimension, $h$ = attention heads, $d_k = d/h$ , $d_{ff}$ = FFN hidden dimension, $K_{1,2}$ = conv kernel sizes.

MHA Cost/Chunk: $O(h (S_p+S_c) S_c d_k) = O(N_d d S_c (S_p + S_c))$
Conv-FFN Cost/Chunk: $O(N_d (K_1 + K_2) S_c d_{ff})$
Total Per-Chunk: $O\left(N_d S_c [d(S_p+S_c) + (K_1+K_2)d_{ff}]\right)$

These costs are constant with respect to total utterance length ( $T$ ); only the number of chunks and chunk size matter. Start-up latency is determined by $O(N_d S_c d / \mathrm{GPU\_FLOPS})$ . With small $S_c$ (e.g., $30$ frames, $\approx1.3$ ms at $22.05$ kHz and hop $=256$ ), sub-$30$ ms startup latency is achievable.

7. Significance and Applications

Incremental FastPitch integrates chunked processing, cache-based attention, and tailored attention masking to deliver high MOS, low-latency, and scalable synthesis. Its design enables real-time, streaming, and interactive TTS workloads with reduced response times and memory requirements, while retaining the benefits of transformer-based parallel TTS models. The architecture provides a framework for incremental speech synthesis with complexity governed by chunk parameters rather than utterance length, pointing to new possibilities for low-latency, high-continuity TTS systems (Du et al., 3 Jan 2024).

PDF Markdown Chat (Pro)

References (1)

Incremental FastPitch: Chunk-based High Quality Text to Speech (2024)

Whiteboard

Generate a whiteboard explanation of this topic.

Follow Topic

Get notified by email when new papers are published related to Incremental FastPitch.