Papers
Topics
Authors
Recent
Search
2000 character limit reached

FLY-TTS: Efficient End-to-End TTS

Updated 5 April 2026
  • FLY-TTS is a fast, lightweight end-to-end TTS system that achieves an 8.8× CPU speedup and 1.6× model compression compared to VITS while maintaining naturalness.
  • It employs a ConvNeXt-based decoder with iSTFT for efficient waveform reconstruction and uses grouped parameter sharing in both the text encoder and flow model.
  • Incorporating a pre-trained WavLM discriminator with adversarial training, FLY-TTS delivers competitive quality metrics such as MOS, MCD, and WER.

FLY-TTS is a fast, lightweight, and high-quality end-to-end text-to-speech (TTS) system that significantly improves computational efficiency and model compression without sacrificing naturalness or synthesis quality. FLY-TTS is architecturally based on the conditional variational autoencoder (VAE) plus flow framework of VITS, incorporating a ConvNeXt block–driven decoder that predicts Fourier coefficients and leverages the inverse short-time Fourier transform (iSTFT) for waveform reconstruction. The system introduces grouped parameter sharing in the text encoder and flow model, and utilizes adversarial loss from a frozen, large-scale pre-trained WavLM model to boost output quality. FLY-TTS achieves a real-time factor (RTF) of 0.0139 on Intel Core i9 CPUs, representing an 8.8× speedup and 1.6× model compression in comparison to the VITS baseline, while maintaining comparable mean opinion scores (MOS) and objective metrics (Guo et al., 2024).

1. System Architecture and Core Components

FLY-TTS retains the key VITS framework: a conditional VAE prior, a flow-based prior, and a neural decoder, though with extensive architectural modifications. At inference, the model comprises the following stages:

  • Prior Encoder (EpriorE_{\rm prior}): Maps input phoneme sequence cc to latent code zz. Consists of a text encoder (Transformer layers with grouped parameter sharing) and a normalizing flow fθf_\theta.
  • Decoder (GG): Converts latent zz to the target waveform y^\hat{y} using stacked ConvNeXt blocks followed by a fast iSTFT operation.
  • Discriminators: Includes the original HiFi-GAN–style multi-period and multi-scale discriminator set (DD) as well as a fixed, pre-trained WavLM model (DWD_{\rm W}) with a lightweight prediction head, providing adversarial feedback.

Grouped Parameter Sharing: Transformer layers in the encoder are organized into g1g_1 groups, each with cc0 consecutive layers (cc1), with parameters shared within each group. The normalizing flow contains cc2 steps, partitioned into cc3 groups, sharing parameters only in WaveNet-based projection sub-modules within each group.

ConvNeXt-Based Decoder + iSTFT:

Waveform synthesis proceeds by having the decoder generate frame-wise Fourier coefficients:

  1. Latent sample cc4 yields cc5, cc6.
  2. cc7 is projected to match FFT bin size cc8.
  3. cc9 stacked ConvNeXt blocks produce amplitude (zz0) and phase (zz1) matrices.
  4. Time-domain signal is reconstructed via iSTFT:

zz2

where zz3 is hop size, zz4 is the window function, and zz5 is FFT length; batch FFT/iFFT is used for computational efficiency.

2. Training Objectives and Optimization

VAE + Flow Variational Training

FLY-TTS maximizes the evidence lower bound (ELBO) as in VITS:

zz6

The likelihood zz7 is approximated by an L1 or spectrogram reconstruction loss, combined with adversarial and feature-matching losses from HiFi-GAN discriminators.

WavLM-Based Adversarial Training

The system incorporates a frozen pre-trained WavLM encoder (zz8), to which a lightweight CNN prediction head is attached. The adversarial losses used are least-squares GAN objectives:

zz9

The total generator loss fθf_\theta0 aggregates the VAE + flow terms, HiFi-GAN-based adversarial/feature-matching losses, WavLM adversarial loss, and the auxiliary duration/variance prediction losses as in VITS. All training hyperparameters mirror VITS settings: AdamW with fθf_\theta1, weight decay 0.01, learning rate fθf_\theta2, exponential decay fθf_\theta3 per epoch.

3. Computational Efficiency and Compression

FLY-TTS achieves substantial gains in CPU inference speed and model size:

Model #Params (M) RTF-CPU Speedup vs. VITS-base Compression Ratio
VITS-base 28.11 0.1221 1.0× 1.0×
FLY-TTS 17.89 0.0139 8.8× 1.6×
Mini FLY-TTS 10.92 0.0127 9.6× 2.6×

RTF is defined as the ratio of wall-clock synthesis time to output audio duration, measured on Intel Core i9-10920X @ 3.5 GHz (no further optimizations). FLY-TTS eliminates transposed-convolution upsampling by shifting the majority of computation to fast FFT/iFFT routines. Grouped parameter sharing enables 36%–61% reduction in model size with negligible impact on synthesized speech quality.

4. Quality Evaluation and Ablation Analysis

Objective and subjective performance is established through standard metrics:

Model #Params RTF-CPU MCD WER (%) MOS ±95% CI
Ground truth – – – 1.56 4.21 ± 0.10
VITS-base 28.11 M 0.1221 5.49 1.71 4.15 ± 0.09
MB-iSTFT-base 27.49 M 0.0274 5.57 1.89 4.08 ± 0.11
FLY-TTS 17.89 M 0.0139 5.56 1.77 4.12 ± 0.09
Mini FLY-TTS 10.92 M 0.0127 5.63 2.09 4.05 ± 0.09

FLY-TTS matches or narrowly trails VITS in MCD/ WER, but maintains comparable MOS (4.12 ± 0.09 vs. 4.15 ± 0.09), indicating minimal loss in perceptual quality. Mini FLY-TTS (further compressed) demonstrates only a minor MOS reduction.

Ablation studies show that replacing the ConvNeXt+iSTFT decoder with a multi-band upsampler doubles the RTF and slightly reduces MOS (from 4.12±0.09 to 4.01±0.11). Removal of the WavLM discriminator results in a MOS drop of ~0.14, substantiating its impact on output quality.

5. Distinctive Contributions and Methodological Insights

FLY-TTS introduces several impactful design elements:

  • ConvNeXt + iSTFT Decoder: Removes the dependency on computationally heavy transposed-convolution upsamplers, shifting decoding computation to efficient FFT/iFFT operations and enabling an 8.8× CPU speedup.
  • Grouped Parameter Sharing: Empirical results show that parameter sharing in contiguous Transformer and flow-step groups is effective, supporting compression rates of up to 61% with marginal performance loss.
  • WavLM-Based Discriminator: Leveraging a large, fixed self-supervised speech representation via WavLM (with a lightweight head) yields adversarial gradients that significantly enhance speech naturalness, without impacting inference cost.

6. Limitations and Potential Research Directions

FLY-TTS is evaluated exclusively in a single-speaker setting (LJSpeech). Extending grouped parameter sharing, ConvNeXt+iSTFT decoding, and WavLM-guided adversarial training to multi-speaker, cross-lingual, or style-conditioned scenarios remains unexplored. Further investigation into more aggressive parameter tying or lighter self-supervised discriminators could provide additional efficiency gains and broader applicability. A plausible implication is that these design elements are orthogonal and can be combined with other lightweight architectures in TTS research (Guo et al., 2024).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FLY-TTS.