FLY-TTS: Efficient End-to-End TTS
- FLY-TTS is a fast, lightweight end-to-end TTS system that achieves an 8.8× CPU speedup and 1.6× model compression compared to VITS while maintaining naturalness.
- It employs a ConvNeXt-based decoder with iSTFT for efficient waveform reconstruction and uses grouped parameter sharing in both the text encoder and flow model.
- Incorporating a pre-trained WavLM discriminator with adversarial training, FLY-TTS delivers competitive quality metrics such as MOS, MCD, and WER.
FLY-TTS is a fast, lightweight, and high-quality end-to-end text-to-speech (TTS) system that significantly improves computational efficiency and model compression without sacrificing naturalness or synthesis quality. FLY-TTS is architecturally based on the conditional variational autoencoder (VAE) plus flow framework of VITS, incorporating a ConvNeXt block–driven decoder that predicts Fourier coefficients and leverages the inverse short-time Fourier transform (iSTFT) for waveform reconstruction. The system introduces grouped parameter sharing in the text encoder and flow model, and utilizes adversarial loss from a frozen, large-scale pre-trained WavLM model to boost output quality. FLY-TTS achieves a real-time factor (RTF) of 0.0139 on Intel Core i9 CPUs, representing an 8.8× speedup and 1.6× model compression in comparison to the VITS baseline, while maintaining comparable mean opinion scores (MOS) and objective metrics (Guo et al., 2024).
1. System Architecture and Core Components
FLY-TTS retains the key VITS framework: a conditional VAE prior, a flow-based prior, and a neural decoder, though with extensive architectural modifications. At inference, the model comprises the following stages:
- Prior Encoder (): Maps input phoneme sequence to latent code . Consists of a text encoder (Transformer layers with grouped parameter sharing) and a normalizing flow .
- Decoder (): Converts latent to the target waveform using stacked ConvNeXt blocks followed by a fast iSTFT operation.
- Discriminators: Includes the original HiFi-GAN–style multi-period and multi-scale discriminator set () as well as a fixed, pre-trained WavLM model () with a lightweight prediction head, providing adversarial feedback.
Grouped Parameter Sharing: Transformer layers in the encoder are organized into groups, each with 0 consecutive layers (1), with parameters shared within each group. The normalizing flow contains 2 steps, partitioned into 3 groups, sharing parameters only in WaveNet-based projection sub-modules within each group.
ConvNeXt-Based Decoder + iSTFT:
Waveform synthesis proceeds by having the decoder generate frame-wise Fourier coefficients:
- Latent sample 4 yields 5, 6.
- 7 is projected to match FFT bin size 8.
- 9 stacked ConvNeXt blocks produce amplitude (0) and phase (1) matrices.
- Time-domain signal is reconstructed via iSTFT:
2
where 3 is hop size, 4 is the window function, and 5 is FFT length; batch FFT/iFFT is used for computational efficiency.
2. Training Objectives and Optimization
VAE + Flow Variational Training
FLY-TTS maximizes the evidence lower bound (ELBO) as in VITS:
6
The likelihood 7 is approximated by an L1 or spectrogram reconstruction loss, combined with adversarial and feature-matching losses from HiFi-GAN discriminators.
WavLM-Based Adversarial Training
The system incorporates a frozen pre-trained WavLM encoder (8), to which a lightweight CNN prediction head is attached. The adversarial losses used are least-squares GAN objectives:
9
The total generator loss 0 aggregates the VAE + flow terms, HiFi-GAN-based adversarial/feature-matching losses, WavLM adversarial loss, and the auxiliary duration/variance prediction losses as in VITS. All training hyperparameters mirror VITS settings: AdamW with 1, weight decay 0.01, learning rate 2, exponential decay 3 per epoch.
3. Computational Efficiency and Compression
FLY-TTS achieves substantial gains in CPU inference speed and model size:
| Model | #Params (M) | RTF-CPU | Speedup vs. VITS-base | Compression Ratio |
|---|---|---|---|---|
| VITS-base | 28.11 | 0.1221 | 1.0× | 1.0× |
| FLY-TTS | 17.89 | 0.0139 | 8.8× | 1.6× |
| Mini FLY-TTS | 10.92 | 0.0127 | 9.6× | 2.6× |
RTF is defined as the ratio of wall-clock synthesis time to output audio duration, measured on Intel Core i9-10920X @ 3.5 GHz (no further optimizations). FLY-TTS eliminates transposed-convolution upsampling by shifting the majority of computation to fast FFT/iFFT routines. Grouped parameter sharing enables 36%–61% reduction in model size with negligible impact on synthesized speech quality.
4. Quality Evaluation and Ablation Analysis
Objective and subjective performance is established through standard metrics:
| Model | #Params | RTF-CPU | MCD | WER (%) | MOS ±95% CI |
|---|---|---|---|---|---|
| Ground truth | – | – | – | 1.56 | 4.21 ± 0.10 |
| VITS-base | 28.11 M | 0.1221 | 5.49 | 1.71 | 4.15 ± 0.09 |
| MB-iSTFT-base | 27.49 M | 0.0274 | 5.57 | 1.89 | 4.08 ± 0.11 |
| FLY-TTS | 17.89 M | 0.0139 | 5.56 | 1.77 | 4.12 ± 0.09 |
| Mini FLY-TTS | 10.92 M | 0.0127 | 5.63 | 2.09 | 4.05 ± 0.09 |
- MCD: Mel-cepstral distortion.
- WER: Word error rate of TTS output.
- MOS: Mean opinion score (naturalness).
FLY-TTS matches or narrowly trails VITS in MCD/ WER, but maintains comparable MOS (4.12 ± 0.09 vs. 4.15 ± 0.09), indicating minimal loss in perceptual quality. Mini FLY-TTS (further compressed) demonstrates only a minor MOS reduction.
Ablation studies show that replacing the ConvNeXt+iSTFT decoder with a multi-band upsampler doubles the RTF and slightly reduces MOS (from 4.12±0.09 to 4.01±0.11). Removal of the WavLM discriminator results in a MOS drop of ~0.14, substantiating its impact on output quality.
5. Distinctive Contributions and Methodological Insights
FLY-TTS introduces several impactful design elements:
- ConvNeXt + iSTFT Decoder: Removes the dependency on computationally heavy transposed-convolution upsamplers, shifting decoding computation to efficient FFT/iFFT operations and enabling an 8.8× CPU speedup.
- Grouped Parameter Sharing: Empirical results show that parameter sharing in contiguous Transformer and flow-step groups is effective, supporting compression rates of up to 61% with marginal performance loss.
- WavLM-Based Discriminator: Leveraging a large, fixed self-supervised speech representation via WavLM (with a lightweight head) yields adversarial gradients that significantly enhance speech naturalness, without impacting inference cost.
6. Limitations and Potential Research Directions
FLY-TTS is evaluated exclusively in a single-speaker setting (LJSpeech). Extending grouped parameter sharing, ConvNeXt+iSTFT decoding, and WavLM-guided adversarial training to multi-speaker, cross-lingual, or style-conditioned scenarios remains unexplored. Further investigation into more aggressive parameter tying or lighter self-supervised discriminators could provide additional efficiency gains and broader applicability. A plausible implication is that these design elements are orthogonal and can be combined with other lightweight architectures in TTS research (Guo et al., 2024).