FNH-TTS System: Optimized NAR TTS
- The paper introduces the FNH-TTS system, integrating a Mixture-of-Experts duration predictor and an enhanced VOCOS vocoder to improve prosody and synthesis quality.
- It employs a Switch-Transformer gating mechanism and multi-scale discriminators to accurately model speaker-dependent prosody and suppress synthesis artifacts.
- Empirical results show higher MOS scores, improved phoneme duration accuracy, and faster inference on datasets like LJSpeech, VCTK, and LibriTTS.
The FNH-TTS system is a non-autoregressive (NAR) text-to-speech model designed to produce fast, natural, and human-like synthetic speech by advancing prosody modeling and addressing synthesis artifacts prevalent in prior architectures. Developed by integrating a novel mixture-of-experts duration predictor and an enhanced vocoder with multi-scale discriminators into the VITS backbone, FNH-TTS sets new benchmarks for synthesis quality, phoneme duration accuracy, and inference speed across multiple standard datasets, including LJSpeech, VCTK, and LibriTTS (Meng et al., 16 Aug 2025).
1. System Architecture and Module Integration
FNH-TTS builds on the VITS end-to-end generative framework, preserving components central to alignment and latent variable modeling—specifically, the Text Encoder, Speaker Encoder, Posterior Encoder, and normalizing flow. Two core innovations differentiate the system:
- Mixture of Experts Duration Predictor (MoE-DP): A Switch-Transformer-based module for context- and speaker-dependent prosody modeling.
- VOCOS Vocoder: An adversarially-trained neural vocoder based on ConvNeXt blocks and equipped with two new discriminators for increased synthesis fidelity and artifact suppression.
This modular integration retains VITS’s benefits in monotonic alignment and sampling-based inference while providing new mechanisms for capturing prosodic diversity and spectral harmony.
2. Advanced Prosody Modeling with MoE-DP
FNH-TTS introduces the Mixture of Experts Duration Predictor to address the under-specification and bias problems of conventional duration predictors. The process is as follows:
- Text encoding and speaker embedding are concatenated and processed by a lightweight 1D convolutional block.
- The feature map is routed, via a Switch-Transformer-based gating mechanism, to a dynamic top- selection of expert submodules . Each expert specializes in modeling a particular prosodic context.
- The output duration prediction is computed as:
where is the routing probability for expert , and is the selected expert set.
To enforce balanced usage of all experts, FNH-TTS applies an auxiliary load-balancing loss:
where is the fraction of samples routed to expert , is the normalized expert selection probability, is the number of experts, and is a scaling factor.
Both the main Monotonic Alignment Search (MAS) loss and the auxiliary load-balancing loss comprise the total duration prediction loss:
This MoE-based modeling enables FNH-TTS to represent speaker- and context-dependent prosodic variation with significantly higher fidelity than conventional duration predictors.
3. Neural Vocoder and Multi-Scale Discrimination
The VOCOS vocoder is designed to resolve spectral artifacts commonly introduced by NAR architectures especially when richer prosody is modeled. Key aspects include:
- Backbone: Constructed from 8 ConvNeXt blocks, the vocoder receives (latent variable and speaker embedding) as input.
- Signal Reconstruction: The final waveform is produced via:
which applies the inverse short-time Fourier transform to the vocoder output.
- Collaborative Multi-Band Discriminator (CMoBD): Uses multi-scale discriminators to analyze waveform coherence at multiple resolutions (global and local).
- Sub-Band Discriminator (SBD): Applies PQMF to split the signal into frequency sub-bands, using dilated convolutions per sub-band for detailed spectral analysis and artifact reduction.
These discriminators are trained adversarially alongside the generator, with adversarial losses (, ) jointly optimized to enforce waveform realism.
4. Training Objectives and Optimization
FNH-TTS is optimized using a composite loss:
- and : Reconstruction and KL divergence losses inherited from VITS, ensuring latent variables encode meaningful speech variation.
- : Joint duration loss as above.
- , : GAN-based adversarial losses for the VOCOS vocoder.
Hyperparameters include AdamW optimization (, ), an initial learning rate of (decaying), batch size 24 on 4 NVIDIA RTX 3090 GPUs, hop length 256, and FFT size 1024.
5. Empirical Results and Benchmarking
FNH-TTS demonstrates statistically significant improvements in multiple quantitative and qualitative metrics:
Dataset | Metric | FNH-TTS | Best Baseline (e.g., VITS, FastSpeech2, StyleTTS2) |
---|---|---|---|
LJSpeech | MOS | 4.48 | Lower (e.g., VITS/HiFiGAN) |
VCTK | MOS | 4.63 | Lower |
Libri460 | Phoneme Duration Accuracy (%) | 67.07 | Markedly Lower |
- Synthesis Speed: VOCOS vocoder achieves faster real-time factors (RTF) on both CPU and GPU versus HiFiGAN-based systems.
- Phoneme Duration Prediction: FNH-TTS matches the ground truth duration distribution for both single- and multi-speaker setups, avoiding the peaked or over-smoothed patterns typical of older models. On multispeaker data, FNH-TTS reliably separates speaker-specific prosodic profiles.
- Robustness: Integration of MoE-DP with superior vocoding is critical; the complexity of richer prosodic predictions degrades synthesis quality with baseline vocoders, but this is rectified by the enhanced adversarial training.
6. Visualization and Interpretation of Prosodic Patterns
Prosody visualization in FNH-TTS reveals:
- For single speakers, predicted duration histograms align closely with those from ground truth audio, reflecting the model’s capacity to learn the genuine range of natural prosodic variation.
- In multispeaker environments, the model's predictions exhibit characteristics unique to each speaker, demonstrating that the mixture-of-experts mechanism conditions effectively on speaker embeddings and context.
Interpretation of these findings indicates that FNH-TTS not only matches average prosody, but also captures higher-order distributional structure necessary for human-likeness in synthetic speech.
7. Implications, Limitations, and Future Directions
FNH-TTS marks a substantive advance in NAR text-to-speech research by jointly addressing prosody coverage and vocoding artifact limitations endemic to prior models:
- The MoE-DP equips the system to model prosodic variation that is contextually relevant and speaker-specific, resolving the oversmoothing and mode-collapse pitfalls of conventional duration prediction.
- The VOCOS vocoder with multi-scale and sub-band discriminators delivers synthesis quality commensurate with the increased prosodic complexity, mitigating waveform artifacts and improving inference speed.
- Empirical validation across datasets and speaker settings corroborates FNH-TTS’s improvement in naturalness and phoneme alignment.
A plausible implication is that further exploration of expert routing strategies, joint learning with expressive prosody control, and adaptation to code-switched or cross-lingual data could further enhance system flexibility and robustness. Additionally, expanding discriminative adversarial architectures may allow even finer spectral and temporal control.
FNH-TTS exemplifies the integration of advanced duration modeling and adversarial vocoder design, aligning non-autoregressive TTS with human prosodic standards and providing a versatile basis for high-quality speech synthesis research and deployment.