StableVC: Style Controllable Zero-Shot Voice Conversion with Conditional Flow Matching

Published 6 Dec 2024 in eess.AS and cs.SD | (2412.04724v2)

Abstract: Zero-shot voice conversion (VC) aims to transfer the timbre from the source speaker to an arbitrary unseen speaker while preserving the original linguistic content. Despite recent advancements in zero-shot VC using LLM-based or diffusion-based approaches, several challenges remain: 1) current approaches primarily focus on adapting timbre from unseen speakers and are unable to transfer style and timbre to different unseen speakers independently; 2) these approaches often suffer from slower inference speeds due to the autoregressive modeling methods or the need for numerous sampling steps; 3) the quality and similarity of the converted samples are still not fully satisfactory. To address these challenges, we propose a style controllable zero-shot VC approach named StableVC, which aims to transfer timbre and style from source speech to different unseen target speakers. Specifically, we decompose speech into linguistic content, timbre, and style, and then employ a conditional flow matching module to reconstruct the high-quality mel-spectrogram based on these decomposed features. To effectively capture timbre and style in a zero-shot manner, we introduce a novel dual attention mechanism with an adaptive gate, rather than using conventional feature concatenation. With this non-autoregressive design, StableVC can efficiently capture the intricate timbre and style from different unseen speakers and generate high-quality speech significantly faster than real-time. Experiments demonstrate that our proposed StableVC outperforms state-of-the-art baseline systems in zero-shot VC and achieves flexible control over timbre and style from different unseen speakers. Moreover, StableVC offers approximately 25x and 1.65x faster sampling compared to autoregressive and diffusion-based baselines.

Abstract PDF HTML Upgrade to Chat

Citations (1)

View on Semantic Scholar

Summary

The paper introduces a novel conditional flow matching module that reduces sampling requirements and accelerates inference by 25×.
The dual attention mechanism with adaptive gate control effectively separates style and timbre conversion for unseen speakers.
Superior performance on the LibriLight dataset validates StableVC's advancements in quality and efficiency over state-of-the-art methods.

StableVC: Style Controllable Zero-Shot Voice Conversion with Conditional Flow Matching

Introduction

StableVC aims to address the complex challenges of zero-shot voice conversion (VC), particularly style-controllable VC, by leveraging conditional flow matching. Unlike existing methods that primarily focus on adapting timbre, StableVC enables independent timbre and style conversion to unseen speakers. The paper identifies issues in current techniques, including slow inferencing caused by autoregressive models and unsatisfactory quality in converted samples.

Figure 1: The concept of style-controllable zero-shot voice conversion. It aims to build a VC system capable of adapting timbre to unseen speakers and transferring style independently.

Zero-shot capabilities are largely inspired by the success of LLMs like GPT, paving the way for advancements in speech generation. Despite some success, most methods lack independent style control and face speed constraints. Existing approaches often require complex setups and have limitations in quality or inferencing speed.

StableVC Overview

StableVC involves three core components: linguistic content extraction, style representation through a factorized codec, and mel-spectrogram extraction for timbre modeling. A critical innovation is the conditional flow matching module, which uses non-autoregressive design for efficient conversion.

Figure 2: Details of DualAGC in the DiT block.

The extraction process includes K-means clustering for linguistic content and multiple reference speeches for timbre, avoiding conventional concatenation methods. The unique dual attention mechanism with adaptive gate control enhances timbre and style information capture, paving the way for faster generation significantly ahead of real-time.

Dual Attention Mechanism and Conditional Flow Matching

DualAGC captures style and timbre simultaneously with a pioneering approach designed to maximize conversion precision. This mechanism incorporates FiLM layers, adaptive gates, and attention strategies reflecting novel architecture design.

Conditional flow matching reduces sampling needs and inference time by leveraging optimal transport principles. This approach optimizes probabilistic paths between distributions, achieving efficient and high-quality voice conversion.

Experimental Setup and Results

Experiments on the LibriLight dataset validate the superior performance of StableVC, using metrics like nMOS and SECS for subjective and objective evaluation. StableVC significantly outperforms state-of-the-art systems in both effectiveness and efficiency, boasting approximately 25× faster sampling compared to existing models.

Figure 3: Violin plots for timbre and style similarity of speech generated by baseline systems and StableVC.

Ablation studies show each component's contribution to capturing timbre and style separately, confirming the benefits of adaptive gate and prior inclusion for style robustness.

Conclusion

StableVC demonstrates substantial improvement in style-controllable zero-shot VC. Its novel architecture, combining dual attention mechanisms and conditional flow matching, achieves superior conversion performance and efficiency. StableVC's results offer compelling evidence of its potential to redefine VC approaches with significant implications for future AI developments in speech technology.

The study invites further exploration into extending these results across diverse languages and speaker profiles, generating pathways for robust AI applications.

Markdown Report Issue