Zero-shot Voice Conversion with Diffusion Transformers (2411.09943v1)

Published 15 Nov 2024 in cs.SD, cs.LG, and eess.AS

Abstract: Zero-shot voice conversion aims to transform a source speech utterance to match the timbre of a reference speech from an unseen speaker. Traditional approaches struggle with timbre leakage, insufficient timbre representation, and mismatches between training and inference tasks. We propose Seed-VC, a novel framework that addresses these issues by introducing an external timbre shifter during training to perturb the source speech timbre, mitigating leakage and aligning training with inference. Additionally, we employ a diffusion transformer that leverages the entire reference speech context, capturing fine-grained timbre features through in-context learning. Experiments demonstrate that Seed-VC outperforms strong baselines like OpenVoice and CosyVoice, achieving higher speaker similarity and lower word error rates in zero-shot voice conversion tasks. We further extend our approach to zero-shot singing voice conversion by incorporating fundamental frequency (F0) conditioning, resulting in comparative performance to current state-of-the-art methods. Our findings highlight the effectiveness of Seed-VC in overcoming core challenges, paving the way for more accurate and versatile voice conversion systems.

Summary

The paper introduces Seed-VC, a framework that mitigates timbre leakage by using an external timbre shifter during training for zero-shot voice conversion.
It employs a diffusion transformer architecture that leverages full reference speech context to capture fine-grained speaker characteristics and enhance conversion quality.
Experimental results demonstrate lower word and character error rates compared to baselines, ensuring high speaker similarity and preserved linguistic content.

Zero-Shot Voice Conversion with Diffusion Transformers

The paper introduces Seed-VC, a novel zero-shot voice conversion (VC) framework that addresses several persistent challenges in the field: timbre leakage, insufficient timbre representation, and the inconsistency between training and inference tasks. The proposed approach employs a diffusion transformer architecture combined with an external timbre shifter during training, which innovatively enhances voice conversion quality without requiring pre-recorded target speaker data.

Key Innovations and Methodology

Seed-VC's framework is structured around two primary innovations:

Timbre Shifter for Training: An external timbre shifter perturbs the source speech timbre during training. This process involves transforming the source speech into a timbre-shifted version, allowing the content extractor to operate on speech free from the original speaker's timbre. This technique significantly reduces timbre leakage and aligns the training phase with the inference scenario where distinct content and timbre are sourced from different speakers.
Diffusion Transformer Architecture: The framework employs a diffusion transformer that captures nuanced speaker characteristics through in-context learning. Rather than relying on a single timbre vector, this approach leverages the full reference speech context, enabling detailed timbre representation. The use of the diffusion transformer allows for capturing both global and fine-grained timbre features, enhancing the generalization capability in zero-shot scenarios.

Experimental Results

The empirical evaluation of Seed-VC reveals considerable improvements over existing baselines like OpenVoice and CosyVoice. Key results from the experiments include:

Speaker Similarity: Seed-VC achieves a higher speaker similarity score, indicating its superior ability to mimic unseen speakers' timbre in zero-shot scenarios. This improvement is attributed to the enhanced timbre representation strategy facilitated by the diffusion transformer.
Word Error Rate (WER) and Character Error Rate (CER): The framework demonstrates a lower WER and CER compared to baselines, suggesting better preservation of linguistic content during conversion. The novel training strategy effectively mitigates the trade-off between timbre similarity and speech intelligibility.

Additionally, Seed-VC extends its capabilities to zero-shot singing voice conversion by incorporating fundamental frequency (F0) conditioning. This adaptation maintains high speaker similarity and low WER while ensuring the tonal and expressive qualities of singing are preserved.

Theoretical and Practical Implications

The theoretical contribution of this work lies in its novel integration of diffusion transformers for zero-shot voice conversion, which sets a new precedent for leveraging detailed contextual information in VC tasks. Practically, the introduction of Seed-VC paves the way for more versatile and scalable VC systems that operate effectively in diverse, real-world scenarios.

Future Directions

Future research could involve extensive ablation studies to dissect the impact of various elements within the Seed-VC framework. Exploring alternative datasets for training and evaluating Seed-VC in different linguistic contexts could further validate its robustness and generalizability. Additionally, optimizing computational efficiency to facilitate real-time applications represents a promising avenue for extending the practical applicability of this framework in live voice conversion scenarios.

In conclusion, the development of Seed-VC signifies a crucial step forward in the domain of zero-shot voice conversion, offering a sophisticated solution to previously entrenched challenges and expanding the horizon of potential VC applications.

PDF Markdown

Related Papers

Tweets

https://twitter.com/AudioAndSpeech/status/1858418931897188460

YouTube

Show All Videos