ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec (2406.01205v2)

Published 3 Jun 2024 in eess.AS, cs.LG, and cs.SD

Abstract: In this paper, we present ControlSpeech, a text-to-speech (TTS) system capable of fully cloning the speaker's voice and enabling arbitrary control and adjustment of speaking style, merely based on a few seconds of audio prompt and a simple textual style description prompt. Prior zero-shot TTS models and controllable TTS models either could only mimic the speaker's voice without further control and adjustment capabilities or were unrelated to speaker-specific voice generation. Therefore, ControlSpeech focuses on a more challenging new task-a TTS system with controllable timbre, content, and style at the same time. ControlSpeech takes speech prompts, content prompts, and style prompts as inputs and utilizes bidirectional attention and mask-based parallel decoding to capture corresponding codec representations in a discrete decoupling codec space. Moreover, we discovered the issue of text style controllability in a many-to-many mapping fashion and proposed the Style Mixture Semantic Density (SMSD) model to resolve this problem. SMSD module which is based on Gaussian mixture density networks, is designed to enhance the fine-grained partitioning and sampling capabilities of style semantic information and generate speech with more diverse styles. In terms of experiments, we make available a controllable model toolkit called ControlToolkit with a new style controllable dataset, some replicated baseline models and propose new metrics to evaluate both the control capability and the quality of generated audio in ControlSpeech. The relevant ablation studies validate the necessity of each component in ControlSpeech is necessary. We hope that ControlSpeech can establish the next foundation paradigm of controllable speech synthesis. The relevant code and demo are available at https://github.com/jishengpeng/ControlSpeech .

PDF HTML Abstract

ControlSpeech: Simultaneous Zero-shot Speaker Cloning and Style Control in TTS

The paper under discussion introduces ControlSpeech, a text-to-speech (TTS) system that addresses an intricate task in speech synthesis: achieving zero-shot speaker cloning while also providing control over the language style, such as prosody, accent, and emotion. This task is tackled by employing a decoupled codec framework where the synthesis process is based on only a few seconds of an audio prompt combined with a style description, enabling both cloning and control simultaneously. This capability is notably absent from existing TTS models, which tend to either replicate voice characteristics or manipulate speech style, but seldom both.

Methodological Innovations

ControlSpeech builds on contemporary advances in neural codec and generative models. A key innovation is the use of a decoupled codec, which independently represents and processes timbre, content, and style, thereby facilitating independent manipulation of speech characteristics. The model employs a bidirectional attention mechanism and mask-based parallel decoding to handle discrete codec space efficiently, enhancing voice and style controllability.

A notable challenge in the effort to control speech style is addressed through the development of the Style Mixture Semantic Density (SMSD) model. This model tackles the many-to-many mapping issue wherein a specific text style description can correspond to various speech styles. SMSD, based on Gaussian mixture density networks, aids in the nuanced sampling and partitioning of style semantic information, leading to more diverse and expressive speech generation.

Experimental Evaluation

The experimental framework includes the introduction of ControlToolkit, a suite of resources comprising a new dataset (VccmDataset), replicated baseline models, and novel metrics for evaluating control capabilities and synthesized speech quality. The effectiveness of ControlSpeech is compared to prior systems in several respects, including style accuracy, timbre similarity, audio quality, diversity, and generalization across the evaluated datasets.

Quantitative achievements in pitch, speed, energy, and emotion accuracy demonstrate ControlSpeech's superior performance in style controllability compared to existing systems. The results validate the necessity of each component of the model, providing empirical support for the model's architecture. Especially highlighted is the stark improvement over traditional TTS approaches which typically do not simultaneously address voice cloning and style modulation.

Implications and Future Directions

ControlSpeech has significant implications in various domains where personalized TTS systems are crucial, such as content creation, assistive technologies for those with communication impairments, and AI-driven human-computer interaction. The model's potential to extend personalized voice interfaces while maintaining naturalness and expressiveness can significantly enhance user experience.

Future research avenues could pivot towards enriching the diversity of style descriptions to more closely mimic human-level expressions. Furthermore, optimization of the decoupled codec using more sophisticated vector quantization methods might improve the naturalness of synthesized speech. Expanding training datasets to tens of thousands of hours may yield further performance gains. Exploring additional generative architectures could also enhance model robustness and flexibility.

ControlSpeech bridges a critical gap in TTS technology by integrating zero-shot speaker cloning with dynamic style controllability, supported by a robust methodological framework and promising experimental results, setting a new benchmark for future developments in speech synthesis.

PDF Markdown Bookmark Chat (Pro)

Authors (11)

Shengpeng Ji (26 papers)
Jialong Zuo (22 papers)
Minghui Fang (17 papers)
Siqi Zheng (61 papers)
Qian Chen (264 papers)
Wen Wang (144 papers)
Ziyue Jiang (38 papers)
Hai Huang (47 papers)
Xize Cheng (29 papers)
Zhou Zhao (218 papers)
Zehan Wang (37 papers)

Citations (6)

View on Semantic Scholar

Related Papers

Find Related Papers

GitHub

GitHub - jishengpeng/ControlSpeech: ControlSpeech: Towards Simultaneous Zero-shot Speaker Cloning and Zero-shot Language Style Control With Decoupled Codec (101 stars)