Papers
Topics
Authors
Recent
Search
2000 character limit reached

VCTK-2mix Speech Separation Test Set

Updated 4 March 2026
  • VCTK-2mix is a two-speaker mixture dataset offering an out-of-distribution benchmark for evaluating speech separation models’ generalization across accents and recording conditions.
  • The dataset is constructed by pairing utterances from 108 speakers using energy-based VAD and loudness normalization with varying SNR protocols.
  • Benchmark results reveal significant SI-SNR performance drops when models trained on in-domain data are tested on VCTK-2mix, highlighting challenges of domain mismatch.

VCTK-2mix is a two-speaker mixture dataset derived from the CSTR VCTK corpus of read English speech. It is designed as a test-only benchmark to evaluate the generalization ability of single-channel speech separation models, particularly when these models are trained on different domains or speaker populations. VCTK-2mix has emerged as a critical out-of-distribution evaluation set, supplementing established corpora such as WSJ0-2mix, LibriMix, and WHAM!, by representing speech with broader accent diversity and more conversational scenarios (Kadioglu et al., 2020, Cosentino et al., 2020). The dataset is used extensively for benchmarking time-domain separation architectures, notably Conv-TasNet and its enhanced variants.

1. Dataset Construction Protocols

VCTK-2mix is prepared from the CSTR VCTK corpus, which includes recordings from 108–109 native English speakers of varying genders and accents. The construction process comprises the following steps:

  • Speaker Selection and Preprocessing: All available speakers are included; extended silences are removed using an energy-based VAD (threshold: 20 dB), resulting in 108 speakers in the final set (Cosentino et al., 2020). No gender or accent stratification is imposed.
  • Speaker Pairing and Mixing Procedure: For each mixture, two utterances from different speakers are randomly sampled and loudness-normalized. In one protocol, SNR for mixing is sampled uniformly between –5 dB and +5 dB (Kadioglu et al., 2020); in another, according to LUFS (ITU-R BS.1770-4), with speech and noise loudness sampled as UsUniform(33,25)LUFSU_s \sim \mathrm{Uniform}(-33, -25)\,\mathrm{LUFS} and UnUniform(38,30)LUFSU_n \sim \mathrm{Uniform}(-38, -30)\,\mathrm{LUFS}, respectively (Cosentino et al., 2020).
  • Mixing Equation: The mixture is created as x=s1+αs2x = s_1 + \alpha s_2, with α\alpha selected such that 10log10s12αs22=SNR10 \log_{10} \frac{\|s_1\|^2}{\|\alpha s_2\|^2} = \mathrm{SNR} (Kadioglu et al., 2020).
  • Temporal Alignment and Modes: Both sources are fully overlapped (synchronous start). Mixtures are generated in both "min" (truncate to shortest utterance) and "max" (zero-pad to longest) modes, with audio resampled at either 8 kHz (Kadioglu et al., 2020) or 16 kHz (Cosentino et al., 2020).
  • Dataset Split and Size: VCTK-2mix is strictly a test set, with no training or validation splits. Sets contain either 3,000 (Cosentino et al., 2020) or 4,000 (Kadioglu et al., 2020) mixtures. Each mixture is 4 s long in (Kadioglu et al., 2020), yielding approximately 4.44 hours of evaluation data.
  • File Structure: The directory structure and file naming are identical to LibriMix, with zero-padded indices and per-source reference tracks.

A summary table of construction parameters from major protocols follows:

Protocol Source # Mixtures Speakers Sampling Rate SNR Sampling Modes
(Kadioglu et al., 2020) 4,000 Not specified 8 kHz Uniform(–5, +5) dB Only max
(Cosentino et al., 2020) 3,000 108 16 kHz, 8 kHz LUFS-based as above min, max

2. Objective Evaluation Metrics

VCTK-2mix benchmarking primarily uses scale-invariant signal-to-noise ratio improvement (SI-SNRᵢ) as the main metric, alongside (optionally) source-to-distortion ratio improvement (SDRᵢ).

  • Scale-Invariant SNR (SI-SNR):

α=s^,ss2 starget=αs enoise=s^starget SISNR(s^,s)=10log10starget2enoise2\begin{align*} \alpha &= \frac{\langle \hat{s}, s \rangle}{\|s\|^2} \ s_\mathrm{target} &= \alpha s \ e_\mathrm{noise} &= \hat{s} - s_\mathrm{target} \ \mathrm{SI-SNR}(\hat{s}, s) &= 10 \log_{10} \frac{\|s_\mathrm{target}\|^2}{\|e_\mathrm{noise}\|^2} \end{align*}

(Kadioglu et al., 2020, Cosentino et al., 2020)

  • SI-SNR Improvement:

SI-SNRi=SI-SNR(s^,s)SI-SNR(x,s)\mathrm{SI\text{-}SNR}_i = \mathrm{SI\text{-}SNR}(\hat{s}, s) - \mathrm{SI\text{-}SNR}(x, s)

  • Source-to-Distortion Ratio (SDR): Defined with BSS-Eval conventions; not always reported for VCTK-2mix (Cosentino et al., 2020).

Perceptual metrics such as PESQ and STOI are not reported for VCTK-2mix in the surveyed studies.

3. Use in Benchmarking and Experimental Protocols

VCTK-2mix's role is as an out-of-distribution (OOD) test set for models trained on other speech separation corpora (WSJ0-2mix, LibriMix, LibriTTS):

  • No Training/Validation: VCTK-2mix is never used for fitting or tuning hyperparameters.
  • Cross-Dataset Generalization: The set quantifies degradation in SI-SNRᵢ when source separation models are tested outside their training distribution.
  • Mixture Types: Both clean (2spk-C, speech only) and noisy (2spk-N, speech + WHAM! noise) mixtures are available (Cosentino et al., 2020).

This benchmarking context is critical for advancing speech separation systems towards domain robustness and real-world applicability.

4. Empirical Results and Model Generalization

Key experimental findings across major studies consistently show significant generalization gaps when models are evaluated on VCTK-2mix versus their training corpus:

  • Conv-TasNet Trained on WSJ0-2mix: SI-SNRᵢ ≈ 15.4 dB on in-domain test, but ≈ 9.1 dB on VCTK-2mix (Δ ≈ –6.3 dB) (Kadioglu et al., 2020).
  • Conv-TasNet Trained on LibriTTS: SI-SNRᵢ ≈ 17–17.5 dB in-domain, but 9.8–11.5 dB on VCTK-2mix, depending on encoder/decoder depth and loss (Kadioglu et al., 2020).
  • Conv-TasNet Trained on LibriMix Train-360 (16 kHz): SI-SNRᵢ ≈ 12.1 dB (clean), ≈10.8 dB (noisy) (Cosentino et al., 2020).
  • Conv-TasNet Trained on WHAM!: SI-SNRᵢ ≈ 8.2 dB to ≈ 6.8 dB on VCTK-2mix; a further drop with noisy mixtures.
  • Effect of Enhanced Architectures: Deep (non-linear, multi-layer) encoder/decoder yields 0.3–1.0 dB gain; addition of an augmented "power-law" loss term yields up to 1.7 dB gain (Kadioglu et al., 2020).

Performance drops of 4–7 dB SI-SNRᵢ relative to in-domain test sets are typical, reflecting the challenge of robust speech separation in real-world speaker conditions.

5. Architectural and Training Recommendations

Research leveraging VCTK-2mix highlights several protocols and design enhancements to improve cross-corpus robustness:

  • Encoder/Decoder Complexity: Increasing the depth and non-linearity of the (time-domain) encoder/decoder (e.g., via stacked 1-D convolutional layers with PReLU or GLU activations, with or without dilation) enhances generalizability (Kadioglu et al., 2020).
  • Augmented Loss Functions: Combining SI-SNR with a power-law magnitude loss in the STFT domain (STFT(s^)αSTFT(s)α1\| |\mathrm{STFT}(\hat{s})|^{\alpha} - |\mathrm{STFT}(s)|^{\alpha} \|_1, with α=0.5\alpha=0.5, β=0.01\beta=0.01) stabilizes training and further improves SI-SNRᵢ both in- and out-of-domain (Kadioglu et al., 2020).
  • Training Data Diversity: Larger speaker pools, expanded vocabulary, and more varied recording conditions in LibriMix/LibriTTS translate to smaller generalization error on VCTK-2mix compared to systems trained on WSJ0-2mix/WHAM! (Cosentino et al., 2020).
  • Loudness Normalization: LUFS-based loudness normalization yields more realistic audio level variability than simple power-scaling, improving robustness (Cosentino et al., 2020).

The following procedural recommendations derive from these empirical findings:

  1. Use LibriMix train-360 (clean, noisy) for training to maximize generalization.
  2. Adopt LUFS-based level normalization in data preparation.
  3. Apply noise speed-perturbation to further diversify noise instances.
  4. Use SI-SNR loss (possibly augmented) and standardized evaluation splits.

6. Directory Structure and Benchmarking Best Practices

VCTK-2mix's distribution closely follows that of LibriMix for reproducibility and interoperability with existing speech separation pipelines. Directory structure is as follows (at 16 kHz, max-mode):

1
2
3
4
5
6
7
8
9
10
VCTK-2mix/
  └─ test/
      ├─ 2spk-C/
      │   ├─ mixture/
      │   ├─ s1/
      │   └─ s2/
      └─ 2spk-N/
          ├─ mixture/
          ├─ s1/
          └─ s2/
File naming is uniform: mixture_0001.wav, s1_0001.wav, s2_0001.wav, etc.

Benchmarking is performed on both clean (2spk-C) and noise-added (2spk-N) modes, using standardized evaluation scripts and segmentations. Observed generalization errors on VCTK-2mix should be reported alongside results for WSJ0-2mix and LibriMix to provide a comprehensive view of system robustness.


VCTK-2mix serves as a reproducible, challenging cross-dataset evaluation set for single-channel speech separation. Its diversity and public availability make it a principal resource for reporting generalization performance and for guiding future methods in architecture design, dataset construction, and loss formulation (Kadioglu et al., 2020, Cosentino et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to VCTK-2mix.