Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
80 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Interpretable Style Transfer for Text-to-Speech with ControlVAE and Diffusion Bridge (2306.04301v2)

Published 7 Jun 2023 in cs.SD and eess.AS

Abstract: With the demand for autonomous control and personalized speech generation, the style control and transfer in Text-to-Speech (TTS) is becoming more and more important. In this paper, we propose a new TTS system that can perform style transfer with interpretability and high fidelity. Firstly, we design a TTS system that combines variational autoencoder (VAE) and diffusion refiner to get refined mel-spectrograms. Specifically, a two-stage and a one-stage system are designed respectively, to improve the audio quality and the performance of style transfer. Secondly, a diffusion bridge of quantized VAE is designed to efficiently learn complex discrete style representations and improve the performance of style transfer. To have a better ability of style transfer, we introduce ControlVAE to improve the reconstruction quality and have good interpretability simultaneously. Experiments on LibriTTS dataset demonstrate that our method is more effective than baseline models.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (6)
  1. Wenhao Guan (13 papers)
  2. Tao Li (440 papers)
  3. Yishuang Li (3 papers)
  4. Hukai Huang (8 papers)
  5. Qingyang Hong (29 papers)
  6. Lin Li (329 papers)
Citations (5)