- The paper achieves 4x faster synthesis by integrating iSTFT in the TTS decoder, significantly reducing computational expense.
- The paper employs a multi-band generation approach that enables parallel waveform processing, outperforming models like Nix-TTS in both speed and quality.
- The paper maintains human-level naturalness in speech synthesis, making it highly suitable for on-device applications with limited resources.
Lightweight and High-Fidelity End-to-End Text-to-Speech
The paper presents an innovative approach to enhancing the efficiency of text-to-speech (TTS) systems by introducing a lightweight end-to-end model based on multi-band generation and inverse short-time Fourier transform (iSTFT). This work builds upon the foundational VITS model, a high-quality TTS framework, and proposes strategic modifications to improve inference speed without compromising synthesis quality.
Key Contributions
The primary contribution of the paper is a significantly faster TTS model achieved through two major modifications:
- Inverse STFT Integration: A portion of the VITS decoder, known for its computational expense, is substituted with an iSTFT operation. This change simplifies the frequency-to-time domain transformation, cutting down processing time.
- Multi-Band Generation: By employing a multi-band approach, where waveforms are generated using either fixed or trainable synthesis filters, the model utilizes parallel processing effectively. This approach capitalizes on existing vocoder strategies but maintains end-to-end optimization, unlike conventional models that use separate optimization processes for acoustic models and vocoders.
Numerical Results
The experimental results affirm the model's capabilities:
- Real-Time Factor (RTF): The proposed model achieves a remarkable RTF of 0.066 on an Intel Core i7 CPU, 4.1 times faster than VITS, demonstrating substantial improvements in inference speed.
- Naturalness: Speech synthesized by the proposed model is as natural as that of VITS, verified through mean opinion scores (MOS) that reflect listener assessments.
- Comparison with Nix-TTS: A smaller version of the proposed model surpasses Nix-TTS in both speed and quality, obtaining an RTF of 0.028 (versus 0.062 for Nix-TTS) and a superior MOS of 4.43 compared to 3.69.
Methodological Insights
The proposed methodology retains the end-to-end architecture of VITS while mitigating its computational bottlenecks primarily located in the decoder. Notably, the model leverages the efficiency of iSTFT-based sample generation and multi-band processing, ensuring each component is optimally utilized to preserve quality while enhancing performance speed.
Implications and Future Directions
The implications of this research are significant for on-device speech synthesis applications where computational resources are constrained. By providing a TTS model that achieves human-level naturalness at a fraction of the processing time, practical deployment in real-world scenarios becomes more viable.
Theoretically, this work underscores the potential of combining end-to-end optimization with innovative architectural modifications—such as multi-band and iSTFT techniques—to overcome traditional limitations in TTS systems.
Future research could explore the expansion of this approach to multi-speaker models and further optimization of synthesis filters to enhance adaptability and efficiency across diverse linguistic datasets. Extending the model to incorporate diverse speaker profiles could significantly broaden the application scope of lightweight TTS systems.
In summary, this paper offers substantial advancements in TTS model efficiency, paving the way for faster and more resource-efficient speech synthesis technologies.