High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models (2309.15512v2)

Published 27 Sep 2023 in cs.SD, cs.AI, cs.CL, and eess.AS

Abstract: Text-to-speech (TTS) methods have shown promising results in voice cloning, but they require a large number of labeled text-speech pairs. Minimally-supervised speech synthesis decouples TTS by combining two types of discrete speech representations(semantic & acoustic) and using two sequence-to-sequence tasks to enable training with minimal supervision. However, existing methods suffer from information redundancy and dimension explosion in semantic representation, and high-frequency waveform distortion in discrete acoustic representation. Autoregressive frameworks exhibit typical instability and uncontrollability issues. And non-autoregressive frameworks suffer from prosodic averaging caused by duration prediction models. To address these issues, we propose a minimally-supervised high-fidelity speech synthesis method, where all modules are constructed based on the diffusion models. The non-autoregressive framework enhances controllability, and the duration diffusion model enables diversified prosodic expression. Contrastive Token-Acoustic Pretraining (CTAP) is used as an intermediate semantic representation to solve the problems of information redundancy and dimension explosion in existing semantic coding methods. Mel-spectrogram is used as the acoustic representation. Both semantic and acoustic representations are predicted by continuous variable regression tasks to solve the problem of high-frequency fine-grained waveform distortion. Experimental results show that our proposed method outperforms the baseline method. We provide audio samples on our website.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (28)

Authors (7)

Chunyu Qiang (21 papers)
Hao Li (803 papers)
Yixin Tian (2 papers)
Yi Zhao (222 papers)
Ying Zhang (388 papers)
Longbiao Wang (46 papers)
Jianwu Dang (41 papers)

Citations (2)

View on Semantic Scholar

High-Fidelity Speech Synthesis with Minimal Supervision: All Using Diffusion Models (2309.15512v2)

Related Papers