Papers
Topics
Authors
Recent
Search
2000 character limit reached

ClonEval: Evaluating Voice Cloning

Updated 26 March 2026
  • ClonEval is an open, reproducible benchmarking framework that objectively assesses voice cloning TTS systems using automated similarity metrics.
  • It standardizes input data and employs rigorous methods like cosine similarity of WavLM embeddings to compare system performance.
  • Its methodology inspires extensions to software clone detection and tumor clonality assessment, promoting reproducible research across domains.

ClonEval refers to several distinct, domain-specific benchmarking and evaluation frameworks developed to assess "cloning" in diverse fields: voice cloning for text-to-speech (TTS), code-clone detection, and tumor clonality in cancer genomics. Each system known as "ClonEval" or with closely related nomenclature embodies rigorous methodology for quantifying resemblance—whether of synthesized voices, software code fragments, or somatic mutation patterns. The following exposition systematically details the primary incarnation of ClonEval in open voice-cloning evaluation, while contextualizing its relationship to analogous frameworks in software engineering and oncogenomics.

1. Objective and Motivation

ClonEval is an open, reproducible, and fully automated benchmark expressly designed for assessing the performance of voice-cloning and speaker-adaptive TTS models as black boxes. Its goals are threefold:

  • To enable direct, fair comparison among diverse voice-cloning systems by standardizing input data, evaluation metrics, and tooling.
  • To eliminate subjective human scoring (notably mean opinion score, MOS) in favor of scalable, objective, and fully automated metrics.
  • To provide a public leaderboard enabling researchers and vendors to submit models and view comparative results in near real-time (Christop et al., 29 Apr 2025).

ClonEval is motivated by recent advances in neural codec LLMs (e.g., VALL-E, VoiceBox) that make zero-shot and few-shot voice cloning widely accessible, exposing the limitations of previous evaluation strategies that rely heavily on expensive and non-reproducible human ratings.

2. Benchmark Design and Protocol

Model Interface and Data Conditions

ClonEval treats any voice-cloning TTS system as a black box with minimal interface requirements:

  • Input: a single audio file from the target speaker, and a separate text prompt (non-matching with the reference transcript).
  • Output: a single, cloned wav file.

Evaluation is performed on the following corpora:

  • LibriSpeech test-clean (∼2500 neutral, non-emotional English utterances)
  • Four emotion-rich English datasets: CREMA-D, RAVDESS, SAVEE, TESS

Reference–prompt pairs are mapped so that generated speech cannot trivially parrot the reference, enforcing true speaker adaptation.

Evaluation Workflow

The evaluation follows a two-stage pipeline:

  1. Synthesis: Candidate models synthesize cloned audio (wav format) for each test pair.
  2. Scoring:

a. All audio is resampled at 16 kHz. b. WavLM speaker verification embeddings are extracted using the pretrained microsoft/wavlm-base-plus-sv model. c. Primary metric is calculated as cosine similarity between reference and cloned embeddings, averaged across the dataset. d. Secondary metrics include per-feature cosine similarity (MFCCs, pitch, spectral centroid, LPC coefficients, etc.), using feature vectors extracted with Librosa.

Pseudocode API:

1
2
3
4
5
6
7
8
from clonEval import Evaluator
ev = Evaluator(
    original_dir="orig/",
    cloned_dir="clone/",
    use_emotions=True
)
per_file_df, agg_df = ev.run()
print(agg_df)

3. Objective Metrics and Statistical Formulas

ClonEval eschews subjective metrics in favor of the following pipeline:

Speaker Embedding Similarity (Primary Metric)

Let urefu_\text{ref} and ugenu_\text{gen} be the WavLM embeddings of the reference and cloned files. The cosine similarity is: cosine_sim(uref,ugen)=urefugenurefugen\mathrm{cosine\_sim}(u_\text{ref}, u_\text{gen}) = \frac{u_\text{ref}\cdot u_\text{gen}}{\|u_\text{ref}\|\cdot\|u_\text{gen}\|} The aggregate metric reported is the mean similarity across all evaluation pairs in a dataset.

Acoustic-Feature Similarity

For each acoustic feature (e.g., MFCCs, spectral bandwidth), the same cosine similarity formula is applied to the respective feature vectors.

Classic Metrics (not used in the main pipeline)

  • Mel-Cepstral Distortion (MCD): MCD=10ln102d=1D(ct,drefct,dhyp)2\mathrm{MCD} = \frac{10}{\ln 10} \sqrt{2 \sum_{d=1}^D (c^{\mathrm{ref}}_{t,d} - c^{\mathrm{hyp}}_{t,d})^2}
  • MOS: Not used in ClonEval for the sake of reproducibility and scalability.

The output consists of both fine-grained (per-file) and aggregate (per-dataset, per-emotion) CSVs.

4. Software Library and Reproducible Evaluation

ClonEval is implemented as an open-source Python library, structured for strict reproducibility:

  • data_loader.py: Aligns and validates reference vs. generated files.
  • wavlm_embedder.py: Wraps inference for fixed WavLM-Large embeddings.
  • feature_extractor.py: Uses Librosa with fixed parameters (16 kHz, hop_length=512, n_fft=2048).
  • evaluator.py: Computes all metrics, outputs results as CSV.
  • CLI (eval.py): Argument parsing and pipeline orchestration.

Usage convention requires identically named directories of original (reference) and cloned wav files, enabling fully automated, script-driven evaluation.

Standard procedure:

  1. Clone repository, install dependencies.
  2. Download WavLM checkpoint.
  3. Run eval.py with appropriate input directories.
  4. Obtain results.csv (per-file) and aggregated_results.csv (per-dataset/emotion).

5. Leaderboard Structure and Baseline Results

ClonEval operates a public, continuously updated leaderboard, organized as follows:

  • Main: Average speaker-embedding cosine similarity for each submitted system.
  • Emotions: Results stratified by emotion × dataset.
  • Features: Per-feature similarity results on LibriSpeech.
  • Submission: Simple upload of .wav outputs triggers automatic evaluation and ranking.

Representative results (partial, Table 1 (Christop et al., 29 Apr 2025)):

Dataset OuteTTS SpeechT5 VALL-E X Whisper XTTS-v2
LS test-clean 0.8836 0.9099 0.9010 0.9014 0.8881
CREMA-D 0.7359 0.7618 0.7412 0.7284 0.8060
RAVDESS 0.7696 0.8265 0.7382 0.6972 0.8539
SAVEE 0.5394 0.7987 0.7674 0.7725 0.8135
TESS 0.8207 0.8521 0.7832 0.8188 0.8167
Average 0.7499 0.8298 0.7862 0.7837 0.8356
  • XTTS-v2 achieves the highest average similarity (0.8356) overall.
  • SpeechT5 leads on LibriSpeech (0.9099).
  • XTTS-v2 consistently outperforms competitors across emotional sub-categories.

6. Best Practices and Extensibility

ClonEval prescribes the following standards for robust benchmarking:

  • Always treat the model as a pure black box; no parameter tuning per evaluation set.
  • Use the provided, pre-defined sample → prompt mappings to ensure reproducibility.
  • Employ the distributed WavLM-Large checkpoint without modification.
  • Resample all audio to 16 kHz prior to evaluation.
  • Provide both ClonEval and embedding model versions when submitting results.
  • Share synthesized .wav outputs, permitting verification and future benchmarking.
  • To expand the benchmark (new language, emotion), insert new data into the appropriate input directories and rerun the pipeline; no further modification is needed.

7. Relation to Other "ClonEval" Paradigms

Software Clone Detection (EvaClone)

The name "ClonEval" is also associated with the EvaClone meta-framework for clone-detection tool evaluation in software engineering. EvaClone decouples the effects of tool configuration via genetic algorithm-based optimization, maximizing inter-tool agreement or other user-specified fitness metrics (Gauci, 2015). Unlike the voice cloning ClonEval, EvaClone focuses on optimizing detection threshold parameters and harmonization of multiple detectors' outputs on software codebases. The underlying philosophy—objective, reproducible, tool-neutral evaluation—remains consistent, suggesting a convergence of methodology across domains.

Tumor Clonality Assessment

In cancer genomics, "ClonEval"-style tests refer to statistical frameworks for evaluating the clonal relatedness of tumor pairs. These rely on likelihood models comparing patterns of somatic mutations, with exact conditioning on observed loci and simulation-based p-value calibration (Ostrovnaya et al., 2015). The focus is rigorous reproducibility and transparent weighting of evidence, concepts closely aligned to those that motivate ClonEval in TTS.

Summary

ClonEval defines a domain-standard for rigorous, reference-free, and reproducible benchmarking in voice-cloning TTS, eschewing subjective ratings in favor of automated, scientifically grounded similarity metrics and reproducible protocols. Its architecture—objectivity, transparency, extensibility, and comparability—has close analogs in code-clone detection and somatic mutation analysis. ClonEval thus epitomizes the current direction in machine learning and algorithmic evaluation: benchmarking frameworks that maximize both fairness and practical utility, while remaining extensible to future modalities and domains (Christop et al., 29 Apr 2025, Gauci, 2015, Ostrovnaya et al., 2015).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ClonEval.