REF-VC: Robust, Expressive and Fast Zero-Shot Voice Conversion with Diffusion Transformers

Published 7 Aug 2025 in eess.AS | (2508.04996v2)

Abstract: In real-world voice conversion applications, environmental noise in source speech and user demands for expressive output pose critical challenges. Traditional ASR-based methods ensure noise robustness but suppress prosody richness, while SSL-based models improve expressiveness but suffer from timbre leakage and noise sensitivity. This paper proposes REF-VC, a noise-robust expressive voice conversion system. Key innovations include: (1) A random erasing strategy to mitigate the information redundancy inherent in SSL features, enhancing noise robustness and expressiveness; (2) Implicit alignment inspired by E2TTS to suppress non-essential feature reconstruction; (3) Integration of Shortcut Models to accelerate flow matching inference, significantly reducing to 4 steps. Experimental results demonstrate that REF-VC outperforms baselines such as Seed-VC in zero-shot scenarios on the noisy set, while also performing comparably to Seed-VC on the clean set. In addition, REF-VC can be compatible with singing voice conversion within one model.

Abstract PDF Upgrade to Chat

Summary

The paper presents a novel zero-shot voice conversion system, REF-VC, leveraging diffusion transformers to fuse ASR and SSL features.
The model employs a random erasing strategy and implicit alignment to enhance noise robustness and preserve natural expressiveness.
Shortcut Models streamline inference, achieving superior NMOS and SMOS metrics compared to state-of-the-art baselines in both clean and noisy environments.

REF-VC: Robust, Expressive, and Fast Zero-Shot Voice Conversion with Diffusion Transformers

Introduction

Voice Conversion (VC) involves transforming the voice of a source speaker into that of a target speaker while preserving the linguistic content. In practical applications, two main challenges arise: robustness to environmental noise and the demand for expressive and natural output. Previous methods, such as those based on Automatic Speech Recognition (ASR), suppress noise but at the cost of expressiveness. Self-supervised learning (SSL) approaches capture expressiveness but suffer from timbre leakage and noise sensitivity.

REF-VC is proposed to address these challenges by combining the strengths of ASR and SSL frameworks. This novel VC system leverages diffusion transformers (DiT) as its backbone to enhance robustness and expressive performance.

Model Architecture and Techniques

As depicted in the architecture overview (Figure 1), REF-VC includes key components such as an input encoder, a fusion module, and a DiT-based estimator. Pretrained Wenet and WavLM are utilized to extract bottleneck features (BNF) and SSL representations, respectively. The model operates on these features to project them into a lower-dimensional latent space for content conditioning.

Figure 1: Architecture overview of REF-VC.

Random Erasing Strategy

The paper introduces a random erasing strategy to handle redundancy in SSL features. The method involves randomly erasing parts of SSL features during training, which forces the model to rely on BNF for noise-robust content modeling, while still leveraging SSL features for paralinguistic information.

Implicit Alignment

Traditional frame-to-frame conversions may compromise audio clarity. REF-VC employs an implicit alignment inspired by TTS systems, using blank frame padding to align inputs, enhancing robustness and fidelity. The fusion module combines features into the estimator's input efficiently (Figure 2).

Figure 2: Detail of fusion module.

Shortcut Models

To optimize inference speed, Shortcut Models are implemented, enabling effective inference in just four steps. This approach introduces the step size $d$ to modify sampling paths, significantly reducing error in generating output signals (Figure 3).

Figure 3: Comparison of shortcut models and flow matching.

Experiments and Evaluation

Objective and Subjective Evaluation

REF-VC is evaluated against baselines like Seed-VC and VITS-VC. On clean audio, REF-VC shows comparable speaker similarity while excelling in challenging noisy conditions due to its robust architecture. It demonstrates superior Naturalness Mean Opinion Score (NMOS) and Speaker Similarity (SMOS) metrics in both environments.

Performance metrics such as Character Error Rate (CER) and Speaker Embedding Cosine Similarity (SECS) reaffirm REF-VC's efficacy in maintaining intelligibility and speaker consistency under varying conditions (Table 1).

Ablation Studies

Ablation studies validate the contributions of key components. The absence of random erasing led to a severe degradation in both naturalness and speaker similarity due to the unchecked reliance on SSL features. The implicit alignment was critical in preserving audio fidelity (Figure 4).

Figure 4: Spectrogram visualization of ablation experiments.

Conclusion

REF-VC effectively integrates BNF and SSL features through a random erasing strategy to enhance noise robustness while preserving expressiveness. It offers significant improvements over state-of-the-art systems in both subjective and objective evaluations, particularly in noise-robust scenarios. Shortcut Models significantly streamline inference without compromising quality, highlighting REF-VC's practical applicability in real-world VC tasks.

Future Work

Future developments aim at achieving an optimal balance between prosodic preservation and style transfer, addressing current user preferences for converting both timbre and style simultaneously. An additional focus will be on resolving existing limitations related to generating arbitrarily long speech, akin to TTS systems, due to inherent alignment constraints.

Markdown