FLY-TTS: Efficient End-to-End TTS

Updated 5 April 2026

FLY-TTS is a fast, lightweight end-to-end TTS system that achieves an 8.8× CPU speedup and 1.6× model compression compared to VITS while maintaining naturalness.
It employs a ConvNeXt-based decoder with iSTFT for efficient waveform reconstruction and uses grouped parameter sharing in both the text encoder and flow model.
Incorporating a pre-trained WavLM discriminator with adversarial training, FLY-TTS delivers competitive quality metrics such as MOS, MCD, and WER.

FLY-TTS is a fast, lightweight, and high-quality end-to-end text-to-speech (TTS) system that significantly improves computational efficiency and model compression without sacrificing naturalness or synthesis quality. FLY-TTS is architecturally based on the conditional variational autoencoder (VAE) plus flow framework of VITS, incorporating a ConvNeXt block–driven decoder that predicts Fourier coefficients and leverages the inverse short-time Fourier transform (iSTFT) for waveform reconstruction. The system introduces grouped parameter sharing in the text encoder and flow model, and utilizes adversarial loss from a frozen, large-scale pre-trained WavLM model to boost output quality. FLY-TTS achieves a real-time factor (RTF) of 0.0139 on Intel Core i9 CPUs, representing an 8.8× speedup and 1.6× model compression in comparison to the VITS baseline, while maintaining comparable mean opinion scores (MOS) and objective metrics (Guo et al., 2024).

1. System Architecture and Core Components

FLY-TTS retains the key VITS framework: a conditional VAE prior, a flow-based prior, and a neural decoder, though with extensive architectural modifications. At inference, the model comprises the following stages:

Prior Encoder ( $E_{\rm prior}$ ): Maps input phoneme sequence $c$ to latent code $z$ . Consists of a text encoder (Transformer layers with grouped parameter sharing) and a normalizing flow $f_\theta$ .
Decoder ( $G$ ): Converts latent $z$ to the target waveform $\hat{y}$ using stacked ConvNeXt blocks followed by a fast iSTFT operation.
Discriminators: Includes the original HiFi-GAN–style multi-period and multi-scale discriminator set ( $D$ ) as well as a fixed, pre-trained WavLM model ( $D_{\rm W}$ ) with a lightweight prediction head, providing adversarial feedback.

Grouped Parameter Sharing: Transformer layers in the encoder are organized into $g_1$ groups, each with $c$ 0 consecutive layers ( $c$ 1), with parameters shared within each group. The normalizing flow contains $c$ 2 steps, partitioned into $c$ 3 groups, sharing parameters only in WaveNet-based projection sub-modules within each group.

ConvNeXt-Based Decoder + iSTFT:

Waveform synthesis proceeds by having the decoder generate frame-wise Fourier coefficients:

Latent sample $c$ 4 yields $c$ 5, $c$ 6.
$c$ 7 is projected to match FFT bin size $c$ 8.
$c$ 9 stacked ConvNeXt blocks produce amplitude ( $z$ 0) and phase ( $z$ 1) matrices.
Time-domain signal is reconstructed via iSTFT:

$z$ 2

where $z$ 3 is hop size, $z$ 4 is the window function, and $z$ 5 is FFT length; batch FFT/iFFT is used for computational efficiency.

2. Training Objectives and Optimization

VAE + Flow Variational Training

FLY-TTS maximizes the evidence lower bound (ELBO) as in VITS:

$z$ 6

The likelihood $z$ 7 is approximated by an L1 or spectrogram reconstruction loss, combined with adversarial and feature-matching losses from HiFi-GAN discriminators.

WavLM-Based Adversarial Training

The system incorporates a frozen pre-trained WavLM encoder ( $z$ 8), to which a lightweight CNN prediction head is attached. The adversarial losses used are least-squares GAN objectives:

$z$ 9

The total generator loss $f_\theta$ 0 aggregates the VAE + flow terms, HiFi-GAN-based adversarial/feature-matching losses, WavLM adversarial loss, and the auxiliary duration/variance prediction losses as in VITS. All training hyperparameters mirror VITS settings: AdamW with $f_\theta$ 1, weight decay 0.01, learning rate $f_\theta$ 2, exponential decay $f_\theta$ 3 per epoch.

3. Computational Efficiency and Compression

FLY-TTS achieves substantial gains in CPU inference speed and model size:

Model	#Params (M)	RTF-CPU	Speedup vs. VITS-base	Compression Ratio
VITS-base	28.11	0.1221	1.0×	1.0×
FLY-TTS	17.89	0.0139	8.8×	1.6×
Mini FLY-TTS	10.92	0.0127	9.6×	2.6×

RTF is defined as the ratio of wall-clock synthesis time to output audio duration, measured on Intel Core i9-10920X @ 3.5 GHz (no further optimizations). FLY-TTS eliminates transposed-convolution upsampling by shifting the majority of computation to fast FFT/iFFT routines. Grouped parameter sharing enables 36%–61% reduction in model size with negligible impact on synthesized speech quality.

4. Quality Evaluation and Ablation Analysis

Objective and subjective performance is established through standard metrics:

Model	#Params	RTF-CPU	MCD	WER (%)	MOS ±95% CI
Ground truth	–	–	–	1.56	4.21 ± 0.10
VITS-base	28.11 M	0.1221	5.49	1.71	4.15 ± 0.09
MB-iSTFT-base	27.49 M	0.0274	5.57	1.89	4.08 ± 0.11
FLY-TTS	17.89 M	0.0139	5.56	1.77	4.12 ± 0.09
Mini FLY-TTS	10.92 M	0.0127	5.63	2.09	4.05 ± 0.09

MCD: Mel-cepstral distortion.
WER: Word error rate of TTS output.
MOS: Mean opinion score (naturalness).

FLY-TTS matches or narrowly trails VITS in MCD/ WER, but maintains comparable MOS (4.12 ± 0.09 vs. 4.15 ± 0.09), indicating minimal loss in perceptual quality. Mini FLY-TTS (further compressed) demonstrates only a minor MOS reduction.

Ablation studies show that replacing the ConvNeXt+iSTFT decoder with a multi-band upsampler doubles the RTF and slightly reduces MOS (from 4.12±0.09 to 4.01±0.11). Removal of the WavLM discriminator results in a MOS drop of ~0.14, substantiating its impact on output quality.

5. Distinctive Contributions and Methodological Insights

FLY-TTS introduces several impactful design elements:

ConvNeXt + iSTFT Decoder: Removes the dependency on computationally heavy transposed-convolution upsamplers, shifting decoding computation to efficient FFT/iFFT operations and enabling an 8.8× CPU speedup.
Grouped Parameter Sharing: Empirical results show that parameter sharing in contiguous Transformer and flow-step groups is effective, supporting compression rates of up to 61% with marginal performance loss.
WavLM-Based Discriminator: Leveraging a large, fixed self-supervised speech representation via WavLM (with a lightweight head) yields adversarial gradients that significantly enhance speech naturalness, without impacting inference cost.

6. Limitations and Potential Research Directions

FLY-TTS is evaluated exclusively in a single-speaker setting (LJSpeech). Extending grouped parameter sharing, ConvNeXt+iSTFT decoding, and WavLM-guided adversarial training to multi-speaker, cross-lingual, or style-conditioned scenarios remains unexplored. Further investigation into more aggressive parameter tying or lighter self-supervised discriminators could provide additional efficiency gains and broader applicability. A plausible implication is that these design elements are orthogonal and can be combined with other lightweight architectures in TTS research (Guo et al., 2024).

Markdown Report Issue Upgrade to Chat

References (1)

FLY-TTS: Fast, Lightweight and High-Quality End-to-End Text-to-Speech Synthesis (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to FLY-TTS.

FLY-TTS: Efficient End-to-End TTS

1. System Architecture and Core Components

2. Training Objectives and Optimization

VAE + Flow Variational Training

WavLM-Based Adversarial Training

3. Computational Efficiency and Compression

4. Quality Evaluation and Ablation Analysis

5. Distinctive Contributions and Methodological Insights

6. Limitations and Potential Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Don't miss out on important new AI/ML research

FLY-TTS: Efficient End-to-End TTS

1. System Architecture and Core Components

2. Training Objectives and Optimization

VAE + Flow Variational Training

WavLM-Based Adversarial Training

3. Computational Efficiency and Compression

4. Quality Evaluation and Ablation Analysis

5. Distinctive Contributions and Methodological Insights

6. Limitations and Potential Research Directions

Topic to Video (Beta)

Whiteboard

Follow Topic

Continue Learning

Related Topics

Don't miss out on important new AI/ML research

Sign up for free to explore the frontiers of research