Papers
Topics
Authors
Recent
2000 character limit reached

Incremental FastPitch: Real-Time TTS

Updated 12 December 2025
  • Incremental FastPitch is a text-to-speech model variant that uses chunk-based transformer decoders for incremental, low-latency speech synthesis.
  • It employs advanced attention masking and state-caching to maintain high speech fidelity while dramatically reducing end-to-end latency.
  • The architecture scales linearly with utterance length, making it ideal for real-time streaming and interactive TTS applications.

Incremental FastPitch is a variant of the FastPitch text-to-speech (TTS) architecture that enables low-latency, high-quality, incremental speech synthesis by re-architecting the decoder to operate on fixed-size Mel-spectrogram chunks using transformer-style chunk-based Feed-Forward Transformer (FFT) blocks, advanced attention masking for receptive-field control, and a state-caching inference regime. The method delivers real-time response suitable for streaming and interactive speech applications, cutting end-to-end latency dramatically compared to fully parallel transformer decoders while preserving output speech fidelity (Du et al., 3 Jan 2024).

1. Architectural Modifications in Incremental FastPitch

The canonical FastPitch system synthesizes speech in a fully parallel manner using transformer FFT blocks that process the upsampled encoder features uˉRT×d\bar u \in \mathbb{R}^{T \times d} for the entire utterance in a single pass, yielding the full Mel-spectrogram YRT×80Y \in \mathbb{R}^{T \times 80}. Incremental FastPitch preserves the encoder, duration, pitch, and energy predictors from FastPitch, but restructures the decoder to operate on non-overlapping chunks uˉ=[uˉ1;;uˉN]\bar u = [\bar u_1; \dots; \bar u_N], where each uˉiRSc×d\bar u_i \in \mathbb{R}^{S_c \times d} and ScS_c is the chunk size.

Each decoder layer comprises a stack of chunk-based FFT blocks, each with:

  • Multi-Head Attention (MHA) Sub-layer: At chunk tt, MHA computes query-key-value attention where keys and values are concatenated from a fixed-size cache of previous keys/values (pkit1pk_i^{t-1}, pvit1pv_i^{t-1}, size SpS_p) and the current chunk (uˉi\bar u_i projected by WiKW^K_i, WiVW^V_i). This design ensures the per-chunk inference cost is independent of chunk index ii or global sequence length TT and supports Mel-continuity.
  • Two-Layer Causal Convolutional Feed-Forward Network (FFN): Each FFN layer caches its last Kj1K_j-1 states (pcjtpc^t_j for j=1,2j=1,2) to support causality over the chunk sequence.

The chunk-based FFT computational diagram can be summarized as:

1
2
3
4
5
6
7
8
9
┌────────────────────────────────────────────────────────┐
│ Input chunk: \bar u_i ∈ R^{S_c×d}                     │
├─► [MHA w/ past-cache] ──► Add & Norm ──►                │
│   • pk_i^{t−1}, pv_i^{t−1}                            │
│   • produce o^t_M and update pk_i^t, pv_i^t           │
├─► [Causal-Conv FFN w/ past-state] ──► Add & Norm ──►    │
│   • pc^{t−1}_{1}, pc^{t−1}_{2}                        │
│   • produce o^t_{c2} and update pc^t_{1}, pc^t_{2}    │
└────────────────────────────────────────────────────────┘

Mathematically, for MHA in head ii, chunk tt:

  • kit=concat(pkit1,uˉiWiK)k^t_i = \mathrm{concat}(pk^{t-1}_i,\, \bar u_i W^K_i)
  • vit=concat(pvit1,uˉiWiV)v^t_i = \mathrm{concat}(pv^{t-1}_i,\, \bar u_i W^V_i)
  • oit=Attention(Q=uˉiWiQ,K=kit,V=vit)o^t_i = \mathrm{Attention}(Q=\bar u_i W^Q_i,\, K=k^t_i,\, V=v^t_i)
  • oMt=concati(oit)WOo^t_M = \mathrm{concat}_i(o^t_i)\,W^O
  • pkit=tail_slice(kit,Sp)pk^t_i = \mathrm{tail\_slice}(k^t_i,\, S_p); pvit=tail_slice(vit,Sp)pv^t_i = \mathrm{tail\_slice}(v^t_i,\, S_p)

The causal-conv FFN processes similarly, caching and updating past conv states.

2. Training with Receptive-Field-Constrained Chunk Attention Masks

To ensure the model operates within its inference constraints during training, a strictly block-diagonal attention mask M{0,}(Sp+Sc)×(Sp+Sc)M \in \{0, -\infty\}^{(S_p+S_c) \times (S_p+S_c)} is applied in each decoder layer. The local receptive window per chunk is L=Sp+ScL = S_p + S_c; positions j=1Spj=1\ldots S_p refer to the past cache and j=Sp+1Sp+Scj=S_p+1\ldots S_p+S_c are the current chunk. The attention mask enforces:

Mq,k={0,if kmax(1,qL+1)kq ,otherwiseM_{q,k} = \begin{cases} 0, & \text{if } k \geq \max(1, q-L+1) \wedge k \leq q \ -\infty, & \text{otherwise} \end{cases}

Two regimes are explored:

  • Static Mask: (Sc,Sp)(S_c, S_p) fixed through training.
  • Dynamic Mask: (Sc,Sp)(S_c, S_p) randomly sampled per batch (e.g., Sp{0,0.25Sc,0.5Sc,1Sc,2Sc,3Sc,}S_p \in \{0,\, 0.25S_c,\, 0.5S_c,\, 1S_c,\, 2S_c,\, 3S_c,\, \infty\}), enabling generalization to varying chunk configurations.

3. State-Caching Inference Algorithm

The inference algorithm maintains, per decoder layer \ell, fixed-size past-key/value and past-convolution caches for each processed chunk. For an input of N=T/ScN = \lceil T/S_c \rceil chunks:

  1. For each chunk tt, extract uˉt\bar u_t.
  2. For every decoder layer:
    • Concatenate past caches with the current chunk for MHA.
    • Update caches using tail_slice\mathrm{tail\_slice} operations post-attention and post-convolution.
    • Normalize and propagate to the next block.
  3. Project final output to yield Mel chunk yty_t and emit for audio synthesis.

Per-chunk runtime is O(NdScd2)O(N_d S_c d^2), and the total cost scales linearly with the number of chunks (T/ScT/S_c); inference cost per chunk is independent of chunk index.

4. Training Objectives and Optimization

The multi-loss objective is retained from FastPitch:

L=Lmel+λdurLdur+λpitchLpitch+λenergyLenergy\mathcal{L} = \mathcal{L}_{\mathrm{mel}} + \lambda_{\mathrm{dur}} \mathcal{L}_{\mathrm{dur}} + \lambda_{\mathrm{pitch}} \mathcal{L}_{\mathrm{pitch}} + \lambda_{\mathrm{energy}} \mathcal{L}_{\mathrm{energy}}

with:

  • Lmel=YpredYtarget1\mathcal{L}_{\mathrm{mel}} = \|Y_{\mathrm{pred}} - Y_{\mathrm{target}}\|_1,
  • Ldur=MSE(logdpred,logdtarget)\mathcal{L}_{\mathrm{dur}} = \mathrm{MSE}(\log d_{\mathrm{pred}}, \log d_{\mathrm{target}}),
  • Lpitch=MSE(ppred,ptarget)\mathcal{L}_{\mathrm{pitch}} = \mathrm{MSE}(p_{\mathrm{pred}}, p_{\mathrm{target}}),
  • Lenergy=MSE(epred,etarget)\mathcal{L}_{\mathrm{energy}} = \mathrm{MSE}(e_{\mathrm{pred}}, e_{\mathrm{target}}).

No extra regularization or loss terms are used; the chunked receptive field is enforced purely by the attention mask.

5. Experimental Results: Speech Quality and Latency

Empirical evaluation demonstrates Incremental FastPitch produces speech of comparable quality to parallel FastPitch at substantially reduced latency.

Table 1: Comparative Metrics

Model MOS (±\pm 95% CI) Latency (ms) RTF
Parallel FastPitch 4.185 ±\pm 0.043 125.8 0.029
Inc. FastPitch (Static Mask) 4.178 ±\pm 0.047 30.4 0.045
Inc. FastPitch (Dynamic Mask) 4.145 ±\pm 0.052 30.4 0.045
Ground Truth 4.545 ±\pm 0.039
  • Mel-Spectrogram Distance: Lowest when (Sc,Sp)(S_c, S_p) match at train/test; dynamic masking generalizes but with an \approx8% higher MSD.
  • MOS: Incremental FastPitch matches FastPitch (MOS \approx 4.18) while cutting end-to-end latency by \approx4x (from 125.8 ms to 30.4 ms).
  • RTF: Remains well above real-time requirements (\approx1/22).

6. Computational Complexity and Latency Profile

Let NdN_d = number of decoder layers, ScS_c = chunk size, SpS_p = cache size, dd = model dimension, hh = attention heads, dk=d/hd_k = d/h, dffd_{ff} = FFN hidden dimension, K1,2K_{1,2} = conv kernel sizes.

  • MHA Cost/Chunk: O(h(Sp+Sc)Scdk)=O(NddSc(Sp+Sc))O(h (S_p+S_c) S_c d_k) = O(N_d d S_c (S_p + S_c))
  • Conv-FFN Cost/Chunk: O(Nd(K1+K2)Scdff)O(N_d (K_1 + K_2) S_c d_{ff})
  • Total Per-Chunk: O(NdSc[d(Sp+Sc)+(K1+K2)dff])O\left(N_d S_c [d(S_p+S_c) + (K_1+K_2)d_{ff}]\right)

These costs are constant with respect to total utterance length (TT); only the number of chunks and chunk size matter. Start-up latency is determined by O(NdScd/GPU_FLOPS)O(N_d S_c d / \mathrm{GPU\_FLOPS}). With small ScS_c (e.g., $30$ frames, 1.3\approx1.3 ms at $22.05$ kHz and hop =256=256), sub-$30$ ms startup latency is achievable.

7. Significance and Applications

Incremental FastPitch integrates chunked processing, cache-based attention, and tailored attention masking to deliver high MOS, low-latency, and scalable synthesis. Its design enables real-time, streaming, and interactive TTS workloads with reduced response times and memory requirements, while retaining the benefits of transformer-based parallel TTS models. The architecture provides a framework for incremental speech synthesis with complexity governed by chunk parameters rather than utterance length, pointing to new possibilities for low-latency, high-continuity TTS systems (Du et al., 3 Jan 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Incremental FastPitch.