Incremental FastPitch: Real-Time TTS
- Incremental FastPitch is a text-to-speech model variant that uses chunk-based transformer decoders for incremental, low-latency speech synthesis.
- It employs advanced attention masking and state-caching to maintain high speech fidelity while dramatically reducing end-to-end latency.
- The architecture scales linearly with utterance length, making it ideal for real-time streaming and interactive TTS applications.
Incremental FastPitch is a variant of the FastPitch text-to-speech (TTS) architecture that enables low-latency, high-quality, incremental speech synthesis by re-architecting the decoder to operate on fixed-size Mel-spectrogram chunks using transformer-style chunk-based Feed-Forward Transformer (FFT) blocks, advanced attention masking for receptive-field control, and a state-caching inference regime. The method delivers real-time response suitable for streaming and interactive speech applications, cutting end-to-end latency dramatically compared to fully parallel transformer decoders while preserving output speech fidelity (Du et al., 2024).
1. Architectural Modifications in Incremental FastPitch
The canonical FastPitch system synthesizes speech in a fully parallel manner using transformer FFT blocks that process the upsampled encoder features for the entire utterance in a single pass, yielding the full Mel-spectrogram . Incremental FastPitch preserves the encoder, duration, pitch, and energy predictors from FastPitch, but restructures the decoder to operate on non-overlapping chunks , where each and is the chunk size.
Each decoder layer comprises a stack of chunk-based FFT blocks, each with:
- Multi-Head Attention (MHA) Sub-layer: At chunk , MHA computes query-key-value attention where keys and values are concatenated from a fixed-size cache of previous keys/values (, , size ) and the current chunk ( projected by , ). This design ensures the per-chunk inference cost is independent of chunk index or global sequence length and supports Mel-continuity.
- Two-Layer Causal Convolutional Feed-Forward Network (FFN): Each FFN layer caches its last states ( for ) to support causality over the chunk sequence.
The chunk-based FFT computational diagram can be summarized as:
1 2 3 4 5 6 7 8 9 |
┌────────────────────────────────────────────────────────┐
│ Input chunk: \bar u_i ∈ R^{S_c×d} │
├─► [MHA w/ past-cache] ──► Add & Norm ──► │
│ • pk_i^{t−1}, pv_i^{t−1} │
│ • produce o^t_M and update pk_i^t, pv_i^t │
├─► [Causal-Conv FFN w/ past-state] ──► Add & Norm ──► │
│ • pc^{t−1}_{1}, pc^{t−1}_{2} │
│ • produce o^t_{c2} and update pc^t_{1}, pc^t_{2} │
└────────────────────────────────────────────────────────┘ |
Mathematically, for MHA in head , chunk :
- ;
The causal-conv FFN processes similarly, caching and updating past conv states.
2. Training with Receptive-Field-Constrained Chunk Attention Masks
To ensure the model operates within its inference constraints during training, a strictly block-diagonal attention mask is applied in each decoder layer. The local receptive window per chunk is ; positions refer to the past cache and are the current chunk. The attention mask enforces:
Two regimes are explored:
- Static Mask: fixed through training.
- Dynamic Mask: randomly sampled per batch (e.g., ), enabling generalization to varying chunk configurations.
3. State-Caching Inference Algorithm
The inference algorithm maintains, per decoder layer , fixed-size past-key/value and past-convolution caches for each processed chunk. For an input of chunks:
- For each chunk , extract .
- For every decoder layer:
- Concatenate past caches with the current chunk for MHA.
- Update caches using operations post-attention and post-convolution.
- Normalize and propagate to the next block.
- Project final output to yield Mel chunk and emit for audio synthesis.
Per-chunk runtime is , and the total cost scales linearly with the number of chunks (); inference cost per chunk is independent of chunk index.
4. Training Objectives and Optimization
The multi-loss objective is retained from FastPitch:
with:
- ,
- ,
- ,
- .
No extra regularization or loss terms are used; the chunked receptive field is enforced purely by the attention mask.
5. Experimental Results: Speech Quality and Latency
Empirical evaluation demonstrates Incremental FastPitch produces speech of comparable quality to parallel FastPitch at substantially reduced latency.
Table 1: Comparative Metrics
| Model | MOS ( 95% CI) | Latency (ms) | RTF |
|---|---|---|---|
| Parallel FastPitch | 4.185 0.043 | 125.8 | 0.029 |
| Inc. FastPitch (Static Mask) | 4.178 0.047 | 30.4 | 0.045 |
| Inc. FastPitch (Dynamic Mask) | 4.145 0.052 | 30.4 | 0.045 |
| Ground Truth | 4.545 0.039 | — | — |
- Mel-Spectrogram Distance: Lowest when match at train/test; dynamic masking generalizes but with an 8% higher MSD.
- MOS: Incremental FastPitch matches FastPitch (MOS 4.18) while cutting end-to-end latency by 4x (from 125.8 ms to 30.4 ms).
- RTF: Remains well above real-time requirements (1/22).
6. Computational Complexity and Latency Profile
Let = number of decoder layers, = chunk size, = cache size, = model dimension, = attention heads, , = FFN hidden dimension, = conv kernel sizes.
- MHA Cost/Chunk:
- Conv-FFN Cost/Chunk:
- Total Per-Chunk:
These costs are constant with respect to total utterance length (); only the number of chunks and chunk size matter. Start-up latency is determined by . With small (e.g., $30$ frames, ms at $22.05$ kHz and hop ), sub-$30$ ms startup latency is achievable.
7. Significance and Applications
Incremental FastPitch integrates chunked processing, cache-based attention, and tailored attention masking to deliver high MOS, low-latency, and scalable synthesis. Its design enables real-time, streaming, and interactive TTS workloads with reduced response times and memory requirements, while retaining the benefits of transformer-based parallel TTS models. The architecture provides a framework for incremental speech synthesis with complexity governed by chunk parameters rather than utterance length, pointing to new possibilities for low-latency, high-continuity TTS systems (Du et al., 2024).