Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech (2106.06103v1)

Published 11 Jun 2021 in cs.SD and eess.AS

Abstract: Several recent end-to-end text-to-speech (TTS) models enabling single-stage training and parallel sampling have been proposed, but their sample quality does not match that of two-stage TTS systems. In this work, we present a parallel end-to-end TTS method that generates more natural sounding audio than current two-stage models. Our method adopts variational inference augmented with normalizing flows and an adversarial training process, which improves the expressive power of generative modeling. We also propose a stochastic duration predictor to synthesize speech with diverse rhythms from input text. With the uncertainty modeling over latent variables and the stochastic duration predictor, our method expresses the natural one-to-many relationship in which a text input can be spoken in multiple ways with different pitches and rhythms. A subjective human evaluation (mean opinion score, or MOS) on the LJ Speech, a single speaker dataset, shows that our method outperforms the best publicly available TTS systems and achieves a MOS comparable to ground truth.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (3)
  1. Jaehyeon Kim (16 papers)
  2. Jungil Kong (5 papers)
  3. Juhee Son (4 papers)
Citations (750)

Summary

Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

This paper, authored by Jaehyeon Kim, Jungil Kong, and Juhee Son, presents an innovative approach to end-to-end text-to-speech (TTS) synthesis, utilizing a conditional variational autoencoder (VAE) framework augmented with adversarial learning mechanisms. The proposed model, named Variational Inference with adversarial learning for end-to-end Text-to-Speech (VITS), demonstrates superior performance over existing two-stage TTS systems, offering more natural-sounding audio that is comparable to human quality.

Methodological Advancements

VITS integrates several advanced methodologies to enhance TTS outputs:

  1. Conditional VAE Architecture: The paper leverages a conditional VAE to model the one-to-many relationship inherent in speech synthesis. This allows the system to generate diverse speech variations in terms of rhythm and pitch, addressing the challenge that a single text input can yield multiple valid speech outputs.
  2. Normalizing Flows: These are employed to increase the flexibility of the prior distribution in the VAE, ensuring more expressive power and higher quality audio synthesis.
  3. Adversarial Training: By incorporating adversarial learning, the model enhances the realism of generated waveforms, utilizing discriminators to distinguish between real and generated audio.
  4. Stochastic Duration Predictor: This component predicts phoneme duration in a way that captures natural variations in speaking rates, further distinguishing VITS from deterministic models that yield more static outputs.

Experimental Insights

The experimental evaluation demonstrates notable improvements in mean opinion scores (MOS) when compared against prominent TTS models such as Tacotron 2 and Glow-TTS, especially when paired with HiFi-GAN for waveform generation. The proposed model received a MOS of 4.43, closely approaching the ground truth score of 4.46, showcasing its effectiveness.

Implications and Future Work

The implications of this research are significant for the evolution of TTS systems. By eliminating the reliance on intermediate speech representations like mel-spectrograms, VITS delivers a streamlined and potentially more robust model training and deployment process. This direct text-to-waveform synthesis represents a pivotal shift towards more integrated TTS solutions.

Additionally, the paper suggests future directions, including the exploration of self-supervised language representations to bypass text preprocessing, which remains an unaddressed step in the pipeline.

Conclusion

VITS marks a substantial contribution to the field of TTS, setting a high benchmark for audio quality in end-to-end models. It opens opportunities for further research into even more nuanced and versatile speech synthesis systems, driving advancements in accessibility and interaction technology through improved natural language interfaces. The paper successfully establishes a framework that not only competes with existing paradigms but also paves the way for future exploration in voice synthesis applications.

Youtube Logo Streamline Icon: https://streamlinehq.com