ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations (2312.14398v2)

Published 22 Dec 2023 in cs.SD and eess.AS

Abstract: Neural text-to-speech (TTS) has achieved human-like synthetic speech for single-speaker, single-language synthesis. Multilingual TTS systems are limited to resource-rich languages due to the lack of large paired text and studio-quality audio data. TTS systems are typically built using a single speaker's voices, but there is growing interest in developing systems that can synthesize voices for new speakers using only a few seconds of their speech. This paper presents ZMM-TTS, a multilingual and multispeaker framework utilizing quantized latent speech representations from a large-scale, pre-trained, self-supervised model. Our paper combines text-based and speech-based self-supervised learning models for multilingual speech synthesis. Our proposed model has zero-shot generalization ability not only for unseen speakers but also for unseen languages. We have conducted comprehensive subjective and objective evaluations through a series of experiments. Our model has proven effective in terms of speech naturalness and similarity for both seen and unseen speakers in six high-resource languages. We also tested the efficiency of our method on two hypothetically low-resource languages. The results are promising, indicating that our proposed approach can synthesize audio that is intelligible and has a high degree of similarity to the target speaker's voice, even without any training data for the new, unseen language.

PDF HTML Abstract

Summarize Bookmark Chat (Pro)

References (67)

Authors (8)

Cheng Gong (51 papers)
Xin Wang (1306 papers)
Erica Cooper (45 papers)
Dan Wells (3 papers)
Longbiao Wang (46 papers)
Jianwu Dang (41 papers)
Korin Richmond (23 papers)
Junichi Yamagishi (178 papers)

Citations (17)

View on Semantic Scholar

Tweets

https://twitter.com/1115880604560691200/status/1742807425022136637

ZMM-TTS: Zero-shot Multilingual and Multispeaker Speech Synthesis Conditioned on Self-supervised Discrete Speech Representations (2312.14398v2)

Related Papers

Tweets