Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
43 tokens/sec
GPT-4o
13 tokens/sec
Gemini 2.5 Pro Pro
37 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
4 tokens/sec
DeepSeek R1 via Azure Pro
33 tokens/sec
2000 character limit reached

VITS2: Improving Quality and Efficiency of Single-Stage Text-to-Speech with Adversarial Learning and Architecture Design (2307.16430v1)

Published 31 Jul 2023 in cs.SD, cs.LG, and eess.AS

Abstract: Single-stage text-to-speech models have been actively studied recently, and their results have outperformed two-stage pipeline systems. Although the previous single-stage model has made great progress, there is room for improvement in terms of its intermittent unnaturalness, computational efficiency, and strong dependence on phoneme conversion. In this work, we introduce VITS2, a single-stage text-to-speech model that efficiently synthesizes a more natural speech by improving several aspects of the previous work. We propose improved structures and training mechanisms and present that the proposed methods are effective in improving naturalness, similarity of speech characteristics in a multi-speaker model, and efficiency of training and inference. Furthermore, we demonstrate that the strong dependence on phoneme conversion in previous works can be significantly reduced with our method, which allows a fully end-to-end single-stage approach.

Citations (25)

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.