Papers
Topics
Authors
Recent
Search
2000 character limit reached

Incremental FastPitch: Real-Time TTS

Updated 12 December 2025
  • Incremental FastPitch is a text-to-speech model variant that uses chunk-based transformer decoders for incremental, low-latency speech synthesis.
  • It employs advanced attention masking and state-caching to maintain high speech fidelity while dramatically reducing end-to-end latency.
  • The architecture scales linearly with utterance length, making it ideal for real-time streaming and interactive TTS applications.

Incremental FastPitch is a variant of the FastPitch text-to-speech (TTS) architecture that enables low-latency, high-quality, incremental speech synthesis by re-architecting the decoder to operate on fixed-size Mel-spectrogram chunks using transformer-style chunk-based Feed-Forward Transformer (FFT) blocks, advanced attention masking for receptive-field control, and a state-caching inference regime. The method delivers real-time response suitable for streaming and interactive speech applications, cutting end-to-end latency dramatically compared to fully parallel transformer decoders while preserving output speech fidelity (Du et al., 2024).

1. Architectural Modifications in Incremental FastPitch

The canonical FastPitch system synthesizes speech in a fully parallel manner using transformer FFT blocks that process the upsampled encoder features uˉRT×d\bar u \in \mathbb{R}^{T \times d} for the entire utterance in a single pass, yielding the full Mel-spectrogram YRT×80Y \in \mathbb{R}^{T \times 80}. Incremental FastPitch preserves the encoder, duration, pitch, and energy predictors from FastPitch, but restructures the decoder to operate on non-overlapping chunks uˉ=[uˉ1;;uˉN]\bar u = [\bar u_1; \dots; \bar u_N], where each uˉiRSc×d\bar u_i \in \mathbb{R}^{S_c \times d} and ScS_c is the chunk size.

Each decoder layer comprises a stack of chunk-based FFT blocks, each with:

  • Multi-Head Attention (MHA) Sub-layer: At chunk tt, MHA computes query-key-value attention where keys and values are concatenated from a fixed-size cache of previous keys/values (pkit1pk_i^{t-1}, pvit1pv_i^{t-1}, size SpS_p) and the current chunk (uˉi\bar u_i projected by WiKW^K_i, WiVW^V_i). This design ensures the per-chunk inference cost is independent of chunk index ii or global sequence length TT and supports Mel-continuity.
  • Two-Layer Causal Convolutional Feed-Forward Network (FFN): Each FFN layer caches its last Kj1K_j-1 states (pcjtpc^t_j for j=1,2j=1,2) to support causality over the chunk sequence.

The chunk-based FFT computational diagram can be summarized as:

1
2
3
4
5
6
7
8
9
┌────────────────────────────────────────────────────────┐
│ Input chunk: \bar u_i ∈ R^{S_c×d}                     │
├─► [MHA w/ past-cache] ──► Add & Norm ──►                │
│   • pk_i^{t−1}, pv_i^{t−1}                            │
│   • produce o^t_M and update pk_i^t, pv_i^t           │
├─► [Causal-Conv FFN w/ past-state] ──► Add & Norm ──►    │
│   • pc^{t−1}_{1}, pc^{t−1}_{2}                        │
│   • produce o^t_{c2} and update pc^t_{1}, pc^t_{2}    │
└────────────────────────────────────────────────────────┘

Mathematically, for MHA in head ii, chunk tt:

  • kit=concat(pkit1,uˉiWiK)k^t_i = \mathrm{concat}(pk^{t-1}_i,\, \bar u_i W^K_i)
  • vit=concat(pvit1,uˉiWiV)v^t_i = \mathrm{concat}(pv^{t-1}_i,\, \bar u_i W^V_i)
  • oit=Attention(Q=uˉiWiQ,K=kit,V=vit)o^t_i = \mathrm{Attention}(Q=\bar u_i W^Q_i,\, K=k^t_i,\, V=v^t_i)
  • oMt=concati(oit)WOo^t_M = \mathrm{concat}_i(o^t_i)\,W^O
  • pkit=tail_slice(kit,Sp)pk^t_i = \mathrm{tail\_slice}(k^t_i,\, S_p); pvit=tail_slice(vit,Sp)pv^t_i = \mathrm{tail\_slice}(v^t_i,\, S_p)

The causal-conv FFN processes similarly, caching and updating past conv states.

2. Training with Receptive-Field-Constrained Chunk Attention Masks

To ensure the model operates within its inference constraints during training, a strictly block-diagonal attention mask M{0,}(Sp+Sc)×(Sp+Sc)M \in \{0, -\infty\}^{(S_p+S_c) \times (S_p+S_c)} is applied in each decoder layer. The local receptive window per chunk is L=Sp+ScL = S_p + S_c; positions j=1Spj=1\ldots S_p refer to the past cache and j=Sp+1Sp+Scj=S_p+1\ldots S_p+S_c are the current chunk. The attention mask enforces:

Mq,k={0,if kmax(1,qL+1)kq ,otherwiseM_{q,k} = \begin{cases} 0, & \text{if } k \geq \max(1, q-L+1) \wedge k \leq q \ -\infty, & \text{otherwise} \end{cases}

Two regimes are explored:

  • Static Mask: (Sc,Sp)(S_c, S_p) fixed through training.
  • Dynamic Mask: (Sc,Sp)(S_c, S_p) randomly sampled per batch (e.g., Sp{0,0.25Sc,0.5Sc,1Sc,2Sc,3Sc,}S_p \in \{0,\, 0.25S_c,\, 0.5S_c,\, 1S_c,\, 2S_c,\, 3S_c,\, \infty\}), enabling generalization to varying chunk configurations.

3. State-Caching Inference Algorithm

The inference algorithm maintains, per decoder layer \ell, fixed-size past-key/value and past-convolution caches for each processed chunk. For an input of N=T/ScN = \lceil T/S_c \rceil chunks:

  1. For each chunk tt, extract uˉt\bar u_t.
  2. For every decoder layer:
    • Concatenate past caches with the current chunk for MHA.
    • Update caches using tail_slice\mathrm{tail\_slice} operations post-attention and post-convolution.
    • Normalize and propagate to the next block.
  3. Project final output to yield Mel chunk yty_t and emit for audio synthesis.

Per-chunk runtime is O(NdScd2)O(N_d S_c d^2), and the total cost scales linearly with the number of chunks (T/ScT/S_c); inference cost per chunk is independent of chunk index.

4. Training Objectives and Optimization

The multi-loss objective is retained from FastPitch:

L=Lmel+λdurLdur+λpitchLpitch+λenergyLenergy\mathcal{L} = \mathcal{L}_{\mathrm{mel}} + \lambda_{\mathrm{dur}} \mathcal{L}_{\mathrm{dur}} + \lambda_{\mathrm{pitch}} \mathcal{L}_{\mathrm{pitch}} + \lambda_{\mathrm{energy}} \mathcal{L}_{\mathrm{energy}}

with:

  • Lmel=YpredYtarget1\mathcal{L}_{\mathrm{mel}} = \|Y_{\mathrm{pred}} - Y_{\mathrm{target}}\|_1,
  • Ldur=MSE(logdpred,logdtarget)\mathcal{L}_{\mathrm{dur}} = \mathrm{MSE}(\log d_{\mathrm{pred}}, \log d_{\mathrm{target}}),
  • Lpitch=MSE(ppred,ptarget)\mathcal{L}_{\mathrm{pitch}} = \mathrm{MSE}(p_{\mathrm{pred}}, p_{\mathrm{target}}),
  • Lenergy=MSE(epred,etarget)\mathcal{L}_{\mathrm{energy}} = \mathrm{MSE}(e_{\mathrm{pred}}, e_{\mathrm{target}}).

No extra regularization or loss terms are used; the chunked receptive field is enforced purely by the attention mask.

5. Experimental Results: Speech Quality and Latency

Empirical evaluation demonstrates Incremental FastPitch produces speech of comparable quality to parallel FastPitch at substantially reduced latency.

Table 1: Comparative Metrics

Model MOS (±\pm 95% CI) Latency (ms) RTF
Parallel FastPitch 4.185 ±\pm 0.043 125.8 0.029
Inc. FastPitch (Static Mask) 4.178 ±\pm 0.047 30.4 0.045
Inc. FastPitch (Dynamic Mask) 4.145 ±\pm 0.052 30.4 0.045
Ground Truth 4.545 ±\pm 0.039
  • Mel-Spectrogram Distance: Lowest when (Sc,Sp)(S_c, S_p) match at train/test; dynamic masking generalizes but with an \approx8% higher MSD.
  • MOS: Incremental FastPitch matches FastPitch (MOS \approx 4.18) while cutting end-to-end latency by \approx4x (from 125.8 ms to 30.4 ms).
  • RTF: Remains well above real-time requirements (\approx1/22).

6. Computational Complexity and Latency Profile

Let NdN_d = number of decoder layers, ScS_c = chunk size, SpS_p = cache size, dd = model dimension, hh = attention heads, dk=d/hd_k = d/h, dffd_{ff} = FFN hidden dimension, K1,2K_{1,2} = conv kernel sizes.

  • MHA Cost/Chunk: O(h(Sp+Sc)Scdk)=O(NddSc(Sp+Sc))O(h (S_p+S_c) S_c d_k) = O(N_d d S_c (S_p + S_c))
  • Conv-FFN Cost/Chunk: O(Nd(K1+K2)Scdff)O(N_d (K_1 + K_2) S_c d_{ff})
  • Total Per-Chunk: O(NdSc[d(Sp+Sc)+(K1+K2)dff])O\left(N_d S_c [d(S_p+S_c) + (K_1+K_2)d_{ff}]\right)

These costs are constant with respect to total utterance length (TT); only the number of chunks and chunk size matter. Start-up latency is determined by O(NdScd/GPU_FLOPS)O(N_d S_c d / \mathrm{GPU\_FLOPS}). With small ScS_c (e.g., $30$ frames, 1.3\approx1.3 ms at $22.05$ kHz and hop =256=256), sub-$30$ ms startup latency is achievable.

7. Significance and Applications

Incremental FastPitch integrates chunked processing, cache-based attention, and tailored attention masking to deliver high MOS, low-latency, and scalable synthesis. Its design enables real-time, streaming, and interactive TTS workloads with reduced response times and memory requirements, while retaining the benefits of transformer-based parallel TTS models. The architecture provides a framework for incremental speech synthesis with complexity governed by chunk parameters rather than utterance length, pointing to new possibilities for low-latency, high-continuity TTS systems (Du et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Incremental FastPitch.