Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
156 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Lightweight and High-Fidelity End-to-End Text-to-Speech with Multi-Band Generation and Inverse Short-Time Fourier Transform (2210.15975v2)

Published 28 Oct 2022 in eess.AS, cs.LG, cs.SD, and eess.SP

Abstract: We propose a lightweight end-to-end text-to-speech model using multi-band generation and inverse short-time Fourier transform. Our model is based on VITS, a high-quality end-to-end text-to-speech model, but adopts two changes for more efficient inference: 1) the most computationally expensive component is partially replaced with a simple inverse short-time Fourier transform, and 2) multi-band generation, with fixed or trainable synthesis filters, is used to generate waveforms. Unlike conventional lightweight models, which employ optimization or knowledge distillation separately to train two cascaded components, our method enjoys the full benefits of end-to-end optimization. Experimental results show that our model synthesized speech as natural as that synthesized by VITS, while achieving a real-time factor of 0.066 on an Intel Core i7 CPU, 4.1 times faster than VITS. Moreover, a smaller version of the model significantly outperformed a lightweight baseline model with respect to both naturalness and inference speed. Code and audio samples are available from https://github.com/MasayaKawamura/MB-iSTFT-VITS.

Citations (12)

Summary

  • The paper achieves 4x faster synthesis by integrating iSTFT in the TTS decoder, significantly reducing computational expense.
  • The paper employs a multi-band generation approach that enables parallel waveform processing, outperforming models like Nix-TTS in both speed and quality.
  • The paper maintains human-level naturalness in speech synthesis, making it highly suitable for on-device applications with limited resources.

Lightweight and High-Fidelity End-to-End Text-to-Speech

The paper presents an innovative approach to enhancing the efficiency of text-to-speech (TTS) systems by introducing a lightweight end-to-end model based on multi-band generation and inverse short-time Fourier transform (iSTFT). This work builds upon the foundational VITS model, a high-quality TTS framework, and proposes strategic modifications to improve inference speed without compromising synthesis quality.

Key Contributions

The primary contribution of the paper is a significantly faster TTS model achieved through two major modifications:

  1. Inverse STFT Integration: A portion of the VITS decoder, known for its computational expense, is substituted with an iSTFT operation. This change simplifies the frequency-to-time domain transformation, cutting down processing time.
  2. Multi-Band Generation: By employing a multi-band approach, where waveforms are generated using either fixed or trainable synthesis filters, the model utilizes parallel processing effectively. This approach capitalizes on existing vocoder strategies but maintains end-to-end optimization, unlike conventional models that use separate optimization processes for acoustic models and vocoders.

Numerical Results

The experimental results affirm the model's capabilities:

  • Real-Time Factor (RTF): The proposed model achieves a remarkable RTF of 0.066 on an Intel Core i7 CPU, 4.1 times faster than VITS, demonstrating substantial improvements in inference speed.
  • Naturalness: Speech synthesized by the proposed model is as natural as that of VITS, verified through mean opinion scores (MOS) that reflect listener assessments.
  • Comparison with Nix-TTS: A smaller version of the proposed model surpasses Nix-TTS in both speed and quality, obtaining an RTF of 0.028 (versus 0.062 for Nix-TTS) and a superior MOS of 4.43 compared to 3.69.

Methodological Insights

The proposed methodology retains the end-to-end architecture of VITS while mitigating its computational bottlenecks primarily located in the decoder. Notably, the model leverages the efficiency of iSTFT-based sample generation and multi-band processing, ensuring each component is optimally utilized to preserve quality while enhancing performance speed.

Implications and Future Directions

The implications of this research are significant for on-device speech synthesis applications where computational resources are constrained. By providing a TTS model that achieves human-level naturalness at a fraction of the processing time, practical deployment in real-world scenarios becomes more viable.

Theoretically, this work underscores the potential of combining end-to-end optimization with innovative architectural modifications—such as multi-band and iSTFT techniques—to overcome traditional limitations in TTS systems.

Future research could explore the expansion of this approach to multi-speaker models and further optimization of synthesis filters to enhance adaptability and efficiency across diverse linguistic datasets. Extending the model to incorporate diverse speaker profiles could significantly broaden the application scope of lightweight TTS systems.

In summary, this paper offers substantial advancements in TTS model efficiency, paving the way for faster and more resource-efficient speech synthesis technologies.