Incremental FastPitch: Real-Time TTS
- Incremental FastPitch is a text-to-speech model variant that uses chunk-based transformer decoders for incremental, low-latency speech synthesis.
- It employs advanced attention masking and state-caching to maintain high speech fidelity while dramatically reducing end-to-end latency.
- The architecture scales linearly with utterance length, making it ideal for real-time streaming and interactive TTS applications.
Incremental FastPitch is a variant of the FastPitch text-to-speech (TTS) architecture that enables low-latency, high-quality, incremental speech synthesis by re-architecting the decoder to operate on fixed-size Mel-spectrogram chunks using transformer-style chunk-based Feed-Forward Transformer (FFT) blocks, advanced attention masking for receptive-field control, and a state-caching inference regime. The method delivers real-time response suitable for streaming and interactive speech applications, cutting end-to-end latency dramatically compared to fully parallel transformer decoders while preserving output speech fidelity (Du et al., 3 Jan 2024).
1. Architectural Modifications in Incremental FastPitch
The canonical FastPitch system synthesizes speech in a fully parallel manner using transformer FFT blocks that process the upsampled encoder features for the entire utterance in a single pass, yielding the full Mel-spectrogram . Incremental FastPitch preserves the encoder, duration, pitch, and energy predictors from FastPitch, but restructures the decoder to operate on non-overlapping chunks , where each and is the chunk size.
Each decoder layer comprises a stack of chunk-based FFT blocks, each with:
- Multi-Head Attention (MHA) Sub-layer: At chunk , MHA computes query-key-value attention where keys and values are concatenated from a fixed-size cache of previous keys/values (, , size ) and the current chunk ( projected by , ). This design ensures the per-chunk inference cost is independent of chunk index or global sequence length and supports Mel-continuity.
- Two-Layer Causal Convolutional Feed-Forward Network (FFN): Each FFN layer caches its last states ( for ) to support causality over the chunk sequence.
The chunk-based FFT computational diagram can be summarized as:
1 2 3 4 5 6 7 8 9 |
┌────────────────────────────────────────────────────────┐
│ Input chunk: \bar u_i ∈ R^{S_c×d} │
├─► [MHA w/ past-cache] ──► Add & Norm ──► │
│ • pk_i^{t−1}, pv_i^{t−1} │
│ • produce o^t_M and update pk_i^t, pv_i^t │
├─► [Causal-Conv FFN w/ past-state] ──► Add & Norm ──► │
│ • pc^{t−1}_{1}, pc^{t−1}_{2} │
│ • produce o^t_{c2} and update pc^t_{1}, pc^t_{2} │
└────────────────────────────────────────────────────────┘ |
Mathematically, for MHA in head , chunk :
- ;
The causal-conv FFN processes similarly, caching and updating past conv states.
2. Training with Receptive-Field-Constrained Chunk Attention Masks
To ensure the model operates within its inference constraints during training, a strictly block-diagonal attention mask is applied in each decoder layer. The local receptive window per chunk is ; positions refer to the past cache and are the current chunk. The attention mask enforces:
Two regimes are explored:
- Static Mask: fixed through training.
- Dynamic Mask: randomly sampled per batch (e.g., ), enabling generalization to varying chunk configurations.
3. State-Caching Inference Algorithm
The inference algorithm maintains, per decoder layer , fixed-size past-key/value and past-convolution caches for each processed chunk. For an input of chunks:
- For each chunk , extract .
- For every decoder layer:
- Concatenate past caches with the current chunk for MHA.
- Update caches using operations post-attention and post-convolution.
- Normalize and propagate to the next block.
- Project final output to yield Mel chunk and emit for audio synthesis.
Per-chunk runtime is , and the total cost scales linearly with the number of chunks (); inference cost per chunk is independent of chunk index.
4. Training Objectives and Optimization
The multi-loss objective is retained from FastPitch:
with:
- ,
- ,
- ,
- .
No extra regularization or loss terms are used; the chunked receptive field is enforced purely by the attention mask.
5. Experimental Results: Speech Quality and Latency
Empirical evaluation demonstrates Incremental FastPitch produces speech of comparable quality to parallel FastPitch at substantially reduced latency.
Table 1: Comparative Metrics
| Model | MOS ( 95% CI) | Latency (ms) | RTF |
|---|---|---|---|
| Parallel FastPitch | 4.185 0.043 | 125.8 | 0.029 |
| Inc. FastPitch (Static Mask) | 4.178 0.047 | 30.4 | 0.045 |
| Inc. FastPitch (Dynamic Mask) | 4.145 0.052 | 30.4 | 0.045 |
| Ground Truth | 4.545 0.039 | — | — |
- Mel-Spectrogram Distance: Lowest when match at train/test; dynamic masking generalizes but with an 8% higher MSD.
- MOS: Incremental FastPitch matches FastPitch (MOS 4.18) while cutting end-to-end latency by 4x (from 125.8 ms to 30.4 ms).
- RTF: Remains well above real-time requirements (1/22).
6. Computational Complexity and Latency Profile
Let = number of decoder layers, = chunk size, = cache size, = model dimension, = attention heads, , = FFN hidden dimension, = conv kernel sizes.
- MHA Cost/Chunk:
- Conv-FFN Cost/Chunk:
- Total Per-Chunk:
These costs are constant with respect to total utterance length (); only the number of chunks and chunk size matter. Start-up latency is determined by . With small (e.g., $30$ frames, ms at $22.05$ kHz and hop ), sub-$30$ ms startup latency is achievable.
7. Significance and Applications
Incremental FastPitch integrates chunked processing, cache-based attention, and tailored attention masking to deliver high MOS, low-latency, and scalable synthesis. Its design enables real-time, streaming, and interactive TTS workloads with reduced response times and memory requirements, while retaining the benefits of transformer-based parallel TTS models. The architecture provides a framework for incremental speech synthesis with complexity governed by chunk parameters rather than utterance length, pointing to new possibilities for low-latency, high-continuity TTS systems (Du et al., 3 Jan 2024).