VCTK-2mix Speech Separation Test Set
- VCTK-2mix is a two-speaker mixture dataset offering an out-of-distribution benchmark for evaluating speech separation models’ generalization across accents and recording conditions.
- The dataset is constructed by pairing utterances from 108 speakers using energy-based VAD and loudness normalization with varying SNR protocols.
- Benchmark results reveal significant SI-SNR performance drops when models trained on in-domain data are tested on VCTK-2mix, highlighting challenges of domain mismatch.
VCTK-2mix is a two-speaker mixture dataset derived from the CSTR VCTK corpus of read English speech. It is designed as a test-only benchmark to evaluate the generalization ability of single-channel speech separation models, particularly when these models are trained on different domains or speaker populations. VCTK-2mix has emerged as a critical out-of-distribution evaluation set, supplementing established corpora such as WSJ0-2mix, LibriMix, and WHAM!, by representing speech with broader accent diversity and more conversational scenarios (Kadioglu et al., 2020, Cosentino et al., 2020). The dataset is used extensively for benchmarking time-domain separation architectures, notably Conv-TasNet and its enhanced variants.
1. Dataset Construction Protocols
VCTK-2mix is prepared from the CSTR VCTK corpus, which includes recordings from 108–109 native English speakers of varying genders and accents. The construction process comprises the following steps:
- Speaker Selection and Preprocessing: All available speakers are included; extended silences are removed using an energy-based VAD (threshold: 20 dB), resulting in 108 speakers in the final set (Cosentino et al., 2020). No gender or accent stratification is imposed.
- Speaker Pairing and Mixing Procedure: For each mixture, two utterances from different speakers are randomly sampled and loudness-normalized. In one protocol, SNR for mixing is sampled uniformly between –5 dB and +5 dB (Kadioglu et al., 2020); in another, according to LUFS (ITU-R BS.1770-4), with speech and noise loudness sampled as and , respectively (Cosentino et al., 2020).
- Mixing Equation: The mixture is created as , with selected such that (Kadioglu et al., 2020).
- Temporal Alignment and Modes: Both sources are fully overlapped (synchronous start). Mixtures are generated in both "min" (truncate to shortest utterance) and "max" (zero-pad to longest) modes, with audio resampled at either 8 kHz (Kadioglu et al., 2020) or 16 kHz (Cosentino et al., 2020).
- Dataset Split and Size: VCTK-2mix is strictly a test set, with no training or validation splits. Sets contain either 3,000 (Cosentino et al., 2020) or 4,000 (Kadioglu et al., 2020) mixtures. Each mixture is 4 s long in (Kadioglu et al., 2020), yielding approximately 4.44 hours of evaluation data.
- File Structure: The directory structure and file naming are identical to LibriMix, with zero-padded indices and per-source reference tracks.
A summary table of construction parameters from major protocols follows:
| Protocol Source | # Mixtures | Speakers | Sampling Rate | SNR Sampling | Modes |
|---|---|---|---|---|---|
| (Kadioglu et al., 2020) | 4,000 | Not specified | 8 kHz | Uniform(–5, +5) dB | Only max |
| (Cosentino et al., 2020) | 3,000 | 108 | 16 kHz, 8 kHz | LUFS-based as above | min, max |
2. Objective Evaluation Metrics
VCTK-2mix benchmarking primarily uses scale-invariant signal-to-noise ratio improvement (SI-SNRᵢ) as the main metric, alongside (optionally) source-to-distortion ratio improvement (SDRᵢ).
- Scale-Invariant SNR (SI-SNR):
(Kadioglu et al., 2020, Cosentino et al., 2020)
- SI-SNR Improvement:
- Source-to-Distortion Ratio (SDR): Defined with BSS-Eval conventions; not always reported for VCTK-2mix (Cosentino et al., 2020).
Perceptual metrics such as PESQ and STOI are not reported for VCTK-2mix in the surveyed studies.
3. Use in Benchmarking and Experimental Protocols
VCTK-2mix's role is as an out-of-distribution (OOD) test set for models trained on other speech separation corpora (WSJ0-2mix, LibriMix, LibriTTS):
- No Training/Validation: VCTK-2mix is never used for fitting or tuning hyperparameters.
- Cross-Dataset Generalization: The set quantifies degradation in SI-SNRᵢ when source separation models are tested outside their training distribution.
- Mixture Types: Both clean (2spk-C, speech only) and noisy (2spk-N, speech + WHAM! noise) mixtures are available (Cosentino et al., 2020).
This benchmarking context is critical for advancing speech separation systems towards domain robustness and real-world applicability.
4. Empirical Results and Model Generalization
Key experimental findings across major studies consistently show significant generalization gaps when models are evaluated on VCTK-2mix versus their training corpus:
- Conv-TasNet Trained on WSJ0-2mix: SI-SNRᵢ ≈ 15.4 dB on in-domain test, but ≈ 9.1 dB on VCTK-2mix (Δ ≈ –6.3 dB) (Kadioglu et al., 2020).
- Conv-TasNet Trained on LibriTTS: SI-SNRᵢ ≈ 17–17.5 dB in-domain, but 9.8–11.5 dB on VCTK-2mix, depending on encoder/decoder depth and loss (Kadioglu et al., 2020).
- Conv-TasNet Trained on LibriMix Train-360 (16 kHz): SI-SNRᵢ ≈ 12.1 dB (clean), ≈10.8 dB (noisy) (Cosentino et al., 2020).
- Conv-TasNet Trained on WHAM!: SI-SNRᵢ ≈ 8.2 dB to ≈ 6.8 dB on VCTK-2mix; a further drop with noisy mixtures.
- Effect of Enhanced Architectures: Deep (non-linear, multi-layer) encoder/decoder yields 0.3–1.0 dB gain; addition of an augmented "power-law" loss term yields up to 1.7 dB gain (Kadioglu et al., 2020).
Performance drops of 4–7 dB SI-SNRᵢ relative to in-domain test sets are typical, reflecting the challenge of robust speech separation in real-world speaker conditions.
5. Architectural and Training Recommendations
Research leveraging VCTK-2mix highlights several protocols and design enhancements to improve cross-corpus robustness:
- Encoder/Decoder Complexity: Increasing the depth and non-linearity of the (time-domain) encoder/decoder (e.g., via stacked 1-D convolutional layers with PReLU or GLU activations, with or without dilation) enhances generalizability (Kadioglu et al., 2020).
- Augmented Loss Functions: Combining SI-SNR with a power-law magnitude loss in the STFT domain (, with , ) stabilizes training and further improves SI-SNRᵢ both in- and out-of-domain (Kadioglu et al., 2020).
- Training Data Diversity: Larger speaker pools, expanded vocabulary, and more varied recording conditions in LibriMix/LibriTTS translate to smaller generalization error on VCTK-2mix compared to systems trained on WSJ0-2mix/WHAM! (Cosentino et al., 2020).
- Loudness Normalization: LUFS-based loudness normalization yields more realistic audio level variability than simple power-scaling, improving robustness (Cosentino et al., 2020).
The following procedural recommendations derive from these empirical findings:
- Use LibriMix train-360 (clean, noisy) for training to maximize generalization.
- Adopt LUFS-based level normalization in data preparation.
- Apply noise speed-perturbation to further diversify noise instances.
- Use SI-SNR loss (possibly augmented) and standardized evaluation splits.
6. Directory Structure and Benchmarking Best Practices
VCTK-2mix's distribution closely follows that of LibriMix for reproducibility and interoperability with existing speech separation pipelines. Directory structure is as follows (at 16 kHz, max-mode):
1 2 3 4 5 6 7 8 9 10 |
VCTK-2mix/
└─ test/
├─ 2spk-C/
│ ├─ mixture/
│ ├─ s1/
│ └─ s2/
└─ 2spk-N/
├─ mixture/
├─ s1/
└─ s2/ |
mixture_0001.wav, s1_0001.wav, s2_0001.wav, etc.
Benchmarking is performed on both clean (2spk-C) and noise-added (2spk-N) modes, using standardized evaluation scripts and segmentations. Observed generalization errors on VCTK-2mix should be reported alongside results for WSJ0-2mix and LibriMix to provide a comprehensive view of system robustness.
VCTK-2mix serves as a reproducible, challenging cross-dataset evaluation set for single-channel speech separation. Its diversity and public availability make it a principal resource for reporting generalization performance and for guiding future methods in architecture design, dataset construction, and loss formulation (Kadioglu et al., 2020, Cosentino et al., 2020).