RVCBench: Robust Voice Cloning Benchmark
- RVCBench is a comprehensive benchmark suite measuring voice cloning robustness under realistic input, generation, output, and adversarial conditions.
- It standardizes evaluation with ten targeted tasks, utilizing diverse datasets across 225 speakers and multiple languages.
- The benchmark enables systematic comparison of 11 state-of-the-art VC models, highlighting failure modes and guiding future improvements.
RVCBench denotes a comprehensive benchmark suite for evaluating robustness in modern voice cloning (VC) systems under realistic input, generation, output, and adversarial perturbation conditions. It provides a unified, open-source testbed with ten systematically designed robustness tasks, encompassing 225 speakers, 14,370 utterances, and eleven state-of-the-art VC models. RVCBench establishes a foundational resource for the in-depth comparative study of VC model stability under deployment-induced distribution shifts, addressing a key gap unmet by prior quality-focused evaluation frameworks (Liao et al., 31 Jan 2026).
1. Motivation and Design Principles
RVCBench is motivated by the discrepancy between high-fidelity performance of zero-shot VC systems under laboratory conditions and their diminished robustness during real-world deployment. These deficiencies stem from sensitivity to reference audio demographics (accent, age, gender), environmental noise, prompt irregularities, language switching, output post-processing, and the presence of adversarial perturbations. Prior benchmarks largely omitted robustness dimensions, focusing instead on quality metrics under sanitized conditions.
The benchmark design adheres to several principles:
- Pipeline coverage: Tasks are grouped according to points of vulnerability in the VC pipeline—input variation (reference and prompt), generation (cross-linguality, expressiveness, long context), output (compression, detectability), and perturbation (passive and proactive distortions).
- Standardized data and protocol: Tasks are tied to publicly available or curated datasets, each converted into speaker-level JSON manifests containing reference and ground-truth waveforms with paired text, to assure reproducibility.
- Cross-model comparability: Consistent evaluation metrics and a fixed input-output protocol are enforced across all models and conditions.
2. Robustness Task Suite
RVCBench consists of ten targeted robustness tasks, each isolating a critical aspect of practical VC robustness:
| Task | Dimension Assessed | Example Sources/Variations |
|---|---|---|
| RVC-AudioShift | Demographic variation in reference audio | VCTK (12 accents, gender, age) |
| RVC-TextShift | Prompt irregularity and hallucination | LLM-gen, robocall scripts |
| RVC-Multilingual | Cross-lingual and monolingual synthesis | VCTK, LibriTTS, AISHELL-1, EMIME |
| RVC-LongContext | Reference/prompt duration scaling | LibriSpeech-Long, variable ref lengths |
| RVC-Expression | Expressive content and emotional alignment | Robocall, benign VCTK prompts |
| RVC-Compression | Output post-processing/compression artifacts | MP3/AAC/Opus, telephone-bandwidth |
| RVC-Detectability | Deepfake detector exposure | SpeechLLM-as-Judge |
| RVC-PassiveNoise | Environmental/multispeaker noise | VoiceBank+DEMAND |
| RVC-AdvNoise | Proactive adversarial audio perturbation | SafeSpeech, SPEC, POP, Enkidu |
| RVC-AntiProtect | Denoising attack against protection | DEMUCS |
This multidimensional structure allows for systematic robustness profiling, revealing failure modes specific to each pipeline segment.
3. Dataset Composition and Model Set
The evaluation corpus aggregates over 14,000 utterances from 225 speakers, spanning:
- English: VCTK (accents), LibriTTS, LibriSpeech-Long (long-form)
- Mandarin: AISHELL-1
- French: Common Voice
- Cross-lingual: EMIME (English ↔ Mandarin)
- Controlled prompt sets: LLM-generated, robocall
- Perturbation benchmarks: VoiceBank+DEMAND, adversarial defense/attack sets
JSON manifests standardize the partitioning, aligning each input reference and prompt with its ground-truth target per robustness scenario. Speaker-level train/test splits are observed to preclude contamination.
Eleven VC models covering three architectural paradigms are benchmarked:
- Autoregressive codec-token LMs: FishSpeech, SparkTTS, MOSS-TTSD, Higgs-Audio v2
- Diffusion/flow-matching: StyleTTS-2, OZSpeech, PlayDiffusion
- Hybrid (LM + refinement): CosyVoice 2, GLM-TTS, VibeVoice, MGM-Omni
4. Evaluation Metrics
Performance is measured via a suite of objective metrics, each capturing domain-relevant aspects of VC robustness:
- Speaker Identity Consistency: SIM (cosine similarity of ECAPA-TDNN embeddings), SVA (boolean speaker verification)
- Content Accuracy: Word Error Rate (WER),
- Perceptual Naturalness: UTMOS (SpeechMOS), DNSMOS (OVRL, SIG, BAK)
- Intelligibility: STOI (Short-Time Objective Intelligibility)
- Protection Fidelity: SNR,
- Expressiveness: EMC (wav2vec2-IEMOCAP emotion consistency), EmTXT (0–3 audio vs. text emotion alignment, Audio-LLM judge)
- Generation Efficiency: RTF (real-time factor)
This metric suite realizes a holistic assessment, spanning signal-level, semantic, and attack-vulnerability dimensions.
5. Experimental Insights and Model Robustness Analysis
Quantitative evaluation across tasks reveals characteristic patterns:
Input Robustness: RVC-AudioShift indicates accent-specific failures; Indian English references consistently yield highest spectral distortion (MCD) and lowest speaker similarity (SIM), while Canadian and Australian pose the least challenge.
Prompt Shift Sensitivity: RVC-TextShift shows hallucination prompts increase WER by ~50% in half of the models, and "scam" prompts elevate error rates, albeit less drastically.
Cross-Lingual and Long-Context Limits: In RVC-Multilingual, English→Mandarin cloning quadruples WER and produces SIM drops of 5–15%. For long context (RVC-LongContext), reference extension improves SIM/MCD up to 8–12s, but long text prompts cause linearly increasing WER and 1–2 dB MCD deterioration across systems. No model maintains <5% WER over multi-minute texts; hybrid models degrade more gracefully.
Expressive Content Preservation: Under scam prompts (RVC-Expression), emotion-alignment drops by 20–30% relative to benign prompts, with large model-dependent variance.
Post-Processing Robustness: RVC-Compression demonstrates that low bitrate or narrowband (telephone) codecs increase MCD by 1–4 dB and degrade STOI by up to 0.1.
Detectability: RVC-Detectability (deepfake detection) yields variable results: ACC spans 62.5% (SparkTTS) to 96.3% (OZSpeech); EER ranges from ~40.7% to ~1.7%. SparkTTS is least detectable but sacrifices content accuracy.
Perturbation Effects: RVC-PassiveNoise shows 0.1–0.3 SIM drops at 10 dB SNR; multi-speaker interference (–5 dB SNR) further depresses SIM and MOS. Proactive perturbations (RVC-AdvNoise, e.g., SafeSpeech, SPEC) can reduce SIM by 30–60%. DEMUCS denoising (RVC-AntiProtect) recovers only 10–20% of the lost fidelity.
Model Robustness Variation: CosyVoice 2, Higgs Audio v2, and GLM-TTS are most robust to input noise and accent variation; MGM-Omni and CozyVoice best handle cross-lingual tasks. MossTTS exhibits increased resilience under adversarial conditions, especially when denoising is applied. This suggests architectural choices mediate robustness across task axes.
6. Limitations, Gaps, and Future Directions
Principal robustness gaps are documented:
- Input-level: Accent, demographic, and prompt irregularities induce large fidelity and content errors.
- Generation: Cross-lingual and long-context scenarios expose language modeling and identity-drift limitations.
- Output: Codec degradation and deepfake detectability remain unsolved; outputs are systematically flagged by modern detectors.
- Perturbation: Proactive adversarial perturbations undermine all tested pipelines; existing denoisers partly recover, but losses persist.
Recommended strategies include:
- Augmenting training with diverse, accented, and noisy references.
- Developing robust speaker encoders tuned on multi-lingual, multi-condition corpora.
- Fine-tuning TTS modules on atypical prompts/sequence lengths.
- Joint adversarial training against both perturbation and post-processing artifacts.
- Introducing anti-detection countermeasures to mediate the fidelity/stealth trade-off.
A plausible implication is that future VC models will require pipeline-level robustness augmentation, not simply incremental refinement to existing architectures.
7. Resource Availability
The RVCBench dataset, protocols, evaluation metrics, codebase, and integration scripts are open-sourced at [https://github.com/Nanboy-Ronan/RVCBench] for further research, benchmarking, and model development targeting robust, deployable voice cloning systems (Liao et al., 31 Jan 2026).