Voice Attribute Editing with Text Prompt

Published 13 Apr 2024 in cs.SD, cs.AI, and eess.AS | (2404.08857v2)

Abstract: Despite recent advancements in speech generation with text prompt providing control over speech style, voice attributes in synthesized speech remain elusive and challenging to control. This paper introduces a novel task: voice attribute editing with text prompt, with the goal of making relative modifications to voice attributes according to the actions described in the text prompt. To solve this task, VoxEditor, an end-to-end generative model, is proposed. In VoxEditor, addressing the insufficiency of text prompt, a Residual Memory (ResMem) block is designed, that efficiently maps voice attributes and these descriptors into the shared feature space. Additionally, the ResMem block is enhanced with a voice attribute degree prediction (VADP) block to align voice attributes with corresponding descriptors, addressing the imprecision of text prompt caused by non-quantitative descriptions of voice attributes. We also establish the open-source VCTK-RVA dataset, which leads the way in manual annotations detailing voice characteristic differences among different speakers. Extensive experiments demonstrate the effectiveness and generalizability of our proposed method in terms of both objective and subjective metrics. The dataset and audio samples are available on the website.

Abstract PDF HTML Upgrade to Chat

Authors (5)

References (24)

Citations (2)

View on Semantic Scholar

Summary

The paper introduces VoxEditor, an end-to-end model for editing voice attributes via text prompts.
It employs innovative modules like the Residual Memory block and VADP to accurately map and adjust qualitative voice features.
Experimental evaluations on the VCTK-RVA dataset demonstrate significant enhancements in target voice attribute similarity.

Voice Attribute Editing with Text Prompt

Introduction to Voice Attribute Editing

The paper "Voice Attribute Editing with Text Prompt" introduces a novel task aimed at refining voice characteristics in synthesized speech through natural language cues. The primary goal is to achieve relative modifications to voice attributes - qualitative elements like "husky" or "bright" - dictated by textual prompts. This task stands apart from traditional voice conversion (VC), as it relies on text instead of reference audio to control voice attributes. This approach offers a practical solution for applications such as personalized voice creation, where finding specific reference audio is often challenging.

Figure 1: Illustration of voice attribute editing with text prompt.

Methodology: VoxEditor

The proposed solution, VoxEditor, is an end-to-end generative model designed to address the insufficiencies and imprecision inherent in text prompts. It employs a novel Residual Memory (ResMem) block, alongside a voice attribute degree prediction (VADP) module, to align the text-provided voice attributes with their corresponding descriptors effectively.

Residual Memory (ResMem) Block: This component is key to mapping voice attributes into a shared feature space, compensating for aspects difficult to describe in text. It consists of a main memory which quantizes common characteristics and a residual memory that captures subtle nuances.

VADP Block: This module predicts the degree of difference in voice attributes between speakers, thus addressing the qualitative nature of voice attribute descriptors drawn from text. *Figure 2: The overall flowchart of our proposed VoxEditor. During the training process, two speech segments (Speech^A and Speech^B) are used, along with voice attribute descriptor x. In the inference process, the model takes source speech and the text prompt as inputs to generate edited speech. Here Mel denotes the Mel spectrograms, Linear denotes the linear spectrograms, $\bm{s$

Dataset and Experimental Validation

An essential contribution of this research is the creation of the VCTK-RVA dataset, which includes manually annotated voice characteristic differences between speakers. This dataset facilitates the alignment of qualitative voice attributes with quantitative descriptors.

Extensive experiments, utilizing both objective and subjective metrics, demonstrate VoxEditor's effectiveness. Results revealed that VoxEditor can produce high-quality speech that aligns closely with input text prompts while retaining the source speech's voice characteristics.

Evaluation and Results

Numerical evaluations showed significant improvements in metrics such as TVAS (Target Voice Attribute Similarity) when using VoxEditor compared to existing methods like PromptStyle. These results validate the model's ability to generate speech with precisely edited voice attributes corresponding to text prompts.

Figure 3: The variation of the TAVS metric for generated speech edited with different attributes under various values of editing degree alpha.

Visualizations and User Study

Visual analyses, such as t-SNE visualizations of speaker embeddings, further demonstrated VoxEditor's ability to cluster edited speech with distinct voice attributes. Additionally, user studies confirmed that an editing degree (alpha) between 0.6 and 0.8 yields the most compelling balance of attribute modification and voice characteristic retention.

Figure 4: The t-SNE visualization of the speaker embeddings extracted from generated speech edited with different attributes under various values of alpha.

Figure 5: MOS-Cons and MOS-Corr scores with varying editing degrees alpha. Edited speech tends to match both the source speech and text prompt in the highlighted area.

Conclusion

VoxEditor represents a significant advancement in the field of AI-driven voice synthesis, offering a robust system for editing voice attributes via textual commands. Future research directions could include expanding the dataset and enhancing model capabilities to encompass a wider range of voice attribute adjustments. The outlined limitations suggest pathways for further refinement, particularly in addressing decreased performance in unseen conditions and improving dataset annotations.

The development of VoxEditor not only enhances the flexibility of voice editing tasks but also sets a precedent for employing natural language as a control mechanism in voice synthesis, highlighting the intersection of AI and linguistic descriptions in sophisticated audio processing tasks.

Markdown Report Issue