RVCBench: Robust Voice Cloning Benchmark

Updated 7 February 2026

RVCBench is a comprehensive benchmark suite measuring voice cloning robustness under realistic input, generation, output, and adversarial conditions.
It standardizes evaluation with ten targeted tasks, utilizing diverse datasets across 225 speakers and multiple languages.
The benchmark enables systematic comparison of 11 state-of-the-art VC models, highlighting failure modes and guiding future improvements.

RVCBench denotes a comprehensive benchmark suite for evaluating robustness in modern voice cloning (VC) systems under realistic input, generation, output, and adversarial perturbation conditions. It provides a unified, open-source testbed with ten systematically designed robustness tasks, encompassing 225 speakers, 14,370 utterances, and eleven state-of-the-art VC models. RVCBench establishes a foundational resource for the in-depth comparative study of VC model stability under deployment-induced distribution shifts, addressing a key gap unmet by prior quality-focused evaluation frameworks (Liao et al., 31 Jan 2026).

1. Motivation and Design Principles

RVCBench is motivated by the discrepancy between high-fidelity performance of zero-shot VC systems under laboratory conditions and their diminished robustness during real-world deployment. These deficiencies stem from sensitivity to reference audio demographics (accent, age, gender), environmental noise, prompt irregularities, language switching, output post-processing, and the presence of adversarial perturbations. Prior benchmarks largely omitted robustness dimensions, focusing instead on quality metrics under sanitized conditions.

The benchmark design adheres to several principles:

Pipeline coverage: Tasks are grouped according to points of vulnerability in the VC pipeline—input variation (reference and prompt), generation (cross-linguality, expressiveness, long context), output (compression, detectability), and perturbation (passive and proactive distortions).
Standardized data and protocol: Tasks are tied to publicly available or curated datasets, each converted into speaker-level JSON manifests containing reference and ground-truth waveforms with paired text, to assure reproducibility.
Cross-model comparability: Consistent evaluation metrics and a fixed input-output protocol are enforced across all models and conditions.

2. Robustness Task Suite

RVCBench consists of ten targeted robustness tasks, each isolating a critical aspect of practical VC robustness:

Task	Dimension Assessed	Example Sources/Variations
RVC-AudioShift	Demographic variation in reference audio	VCTK (12 accents, gender, age)
RVC-TextShift	Prompt irregularity and hallucination	LLM-gen, robocall scripts
RVC-Multilingual	Cross-lingual and monolingual synthesis	VCTK, LibriTTS, AISHELL-1, EMIME
RVC-LongContext	Reference/prompt duration scaling	LibriSpeech-Long, variable ref lengths
RVC-Expression	Expressive content and emotional alignment	Robocall, benign VCTK prompts
RVC-Compression	Output post-processing/compression artifacts	MP3/AAC/Opus, telephone-bandwidth
RVC-Detectability	Deepfake detector exposure	SpeechLLM-as-Judge
RVC-PassiveNoise	Environmental/multispeaker noise	VoiceBank+DEMAND
RVC-AdvNoise	Proactive adversarial audio perturbation	SafeSpeech, SPEC, POP, Enkidu
RVC-AntiProtect	Denoising attack against protection	DEMUCS

This multidimensional structure allows for systematic robustness profiling, revealing failure modes specific to each pipeline segment.

3. Dataset Composition and Model Set

The evaluation corpus aggregates over 14,000 utterances from 225 speakers, spanning:

English: VCTK (accents), LibriTTS, LibriSpeech-Long (long-form)
Mandarin: AISHELL-1
French: Common Voice
Cross-lingual: EMIME (English ↔ Mandarin)
Controlled prompt sets: LLM-generated, robocall
Perturbation benchmarks: VoiceBank+DEMAND, adversarial defense/attack sets

JSON manifests standardize the partitioning, aligning each input reference and prompt with its ground-truth target per robustness scenario. Speaker-level train/test splits are observed to preclude contamination.

Eleven VC models covering three architectural paradigms are benchmarked:

Autoregressive codec-token LMs: FishSpeech, SparkTTS, MOSS-TTSD, Higgs-Audio v2
Diffusion/flow-matching: StyleTTS-2, OZSpeech, PlayDiffusion
Hybrid (LM + refinement): CosyVoice 2, GLM-TTS, VibeVoice, MGM-Omni

4. Evaluation Metrics

Performance is measured via a suite of objective metrics, each capturing domain-relevant aspects of VC robustness:

Speaker Identity Consistency: SIM (cosine similarity of ECAPA-TDNN embeddings), SVA (boolean speaker verification)
Content Accuracy: Word Error Rate (WER),

$\mathrm{WER} = \frac{S + D + I}{N} \times 100\%$

Acoustic Fidelity: Mel-cepstral Distortion (MCD),

$\mathrm{MCD} = \frac{10}{\ln 10} \sqrt{2\,\sum_{t=1}^{T}\|\mathbf{c}_t^{\mathrm{syn}} - \mathbf{c}_t^{\mathrm{ref}}\|^2}$

Perceptual Naturalness: UTMOS (SpeechMOS), DNSMOS (OVRL, SIG, BAK)
Intelligibility: STOI (Short-Time Objective Intelligibility)
Protection Fidelity: SNR,

$\mathrm{SNR} = 10\log_{10} \frac{\|a\|_2^2}{\|a - a^\mathrm{prot}\|_2^2}$

Expressiveness: EMC (wav2vec2-IEMOCAP emotion consistency), EmTXT (0–3 audio vs. text emotion alignment, Audio-LLM judge)
Generation Efficiency: RTF (real-time factor)

This metric suite realizes a holistic assessment, spanning signal-level, semantic, and attack-vulnerability dimensions.

5. Experimental Insights and Model Robustness Analysis

Quantitative evaluation across tasks reveals characteristic patterns:

Input Robustness: RVC-AudioShift indicates accent-specific failures; Indian English references consistently yield highest spectral distortion (MCD) and lowest speaker similarity (SIM), while Canadian and Australian pose the least challenge.

Prompt Shift Sensitivity: RVC-TextShift shows hallucination prompts increase WER by ~50% in half of the models, and "scam" prompts elevate error rates, albeit less drastically.

Cross-Lingual and Long-Context Limits: In RVC-Multilingual, English→Mandarin cloning quadruples WER and produces SIM drops of 5–15%. For long context (RVC-LongContext), reference extension improves SIM/MCD up to 8–12s, but long text prompts cause linearly increasing WER and 1–2 dB MCD deterioration across systems. No model maintains <5% WER over multi-minute texts; hybrid models degrade more gracefully.

Expressive Content Preservation: Under scam prompts (RVC-Expression), emotion-alignment drops by 20–30% relative to benign prompts, with large model-dependent variance.

Post-Processing Robustness: RVC-Compression demonstrates that low bitrate or narrowband (telephone) codecs increase MCD by 1–4 dB and degrade STOI by up to 0.1.

Detectability: RVC-Detectability (deepfake detection) yields variable results: ACC spans 62.5% (SparkTTS) to 96.3% (OZSpeech); EER ranges from ~40.7% to ~1.7%. SparkTTS is least detectable but sacrifices content accuracy.

Perturbation Effects: RVC-PassiveNoise shows 0.1–0.3 SIM drops at 10 dB SNR; multi-speaker interference (–5 dB SNR) further depresses SIM and MOS. Proactive perturbations (RVC-AdvNoise, e.g., SafeSpeech, SPEC) can reduce SIM by 30–60%. DEMUCS denoising (RVC-AntiProtect) recovers only 10–20% of the lost fidelity.

Model Robustness Variation: CosyVoice 2, Higgs Audio v2, and GLM-TTS are most robust to input noise and accent variation; MGM-Omni and CozyVoice best handle cross-lingual tasks. MossTTS exhibits increased resilience under adversarial conditions, especially when denoising is applied. This suggests architectural choices mediate robustness across task axes.

6. Limitations, Gaps, and Future Directions

Principal robustness gaps are documented:

Input-level: Accent, demographic, and prompt irregularities induce large fidelity and content errors.
Generation: Cross-lingual and long-context scenarios expose language modeling and identity-drift limitations.
Output: Codec degradation and deepfake detectability remain unsolved; outputs are systematically flagged by modern detectors.
Perturbation: Proactive adversarial perturbations undermine all tested pipelines; existing denoisers partly recover, but losses persist.

Recommended strategies include:

Augmenting training with diverse, accented, and noisy references.
Developing robust speaker encoders tuned on multi-lingual, multi-condition corpora.
Fine-tuning TTS modules on atypical prompts/sequence lengths.
Joint adversarial training against both perturbation and post-processing artifacts.
Introducing anti-detection countermeasures to mediate the fidelity/stealth trade-off.

A plausible implication is that future VC models will require pipeline-level robustness augmentation, not simply incremental refinement to existing architectures.

7. Resource Availability

The RVCBench dataset, protocols, evaluation metrics, codebase, and integration scripts are open-sourced at [https://github.com/Nanboy-Ronan/RVCBench] for further research, benchmarking, and model development targeting robust, deployable voice cloning systems (Liao et al., 31 Jan 2026).

Markdown Report Issue Upgrade to Chat

References (1)

RVCBench: Benchmarking the Robustness of Voice Cloning Across Modern Audio Generation Models (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to RVCBench.