Libri2Mix: Benchmark Dataset for Speech Separation
- Libri2Mix is a large-scale, open-source corpus featuring two-speaker mixtures, diverse speaker profiles, and realistic noise from varied environments.
- It employs LUFS loudness normalization and dynamic noise augmentation with WHAM! noise to generate both clean and noisy mixtures, enhancing model realism.
- The dataset offers standard splits, cross-dataset evaluations (e.g., VCTK-2mix), and improved SI-SDR benchmarks, supporting robust speech separation research.
Libri2Mix is a large-scale, fully open-source corpus designed for single-channel two-speaker speech separation, constructed from LibriSpeech “clean” subsets and non-stationary ambient noise from WHAM!. Developed to address generalization issues identified with models trained solely on wsj0-2mix, Libri2Mix features comprehensive clean and noisy mixture sets, varied speaker and vocabulary diversity, richly annotated metadata, and additional cross-dataset and sparsely overlapping test sets. The resource supports robust benchmarking and cross-domain evaluation for modern speech separation architectures, offering enhanced generalization to naturalistic, conversation-like scenarios (Cosentino et al., 2020).
1. Dataset Composition and Structure
Libri2Mix is derived from LibriSpeech “clean” subsets (train-360, train-100, dev-clean, test-clean), comprising approximately 470 hours of read English speech from 1,252 speakers, sampled at 16 kHz. Ambient noise is sourced from the WHAM! corpus, which includes background environments such as coffee shops, restaurants, and bars, and is distributed under the CC BY-NC 4.0 license. The corpus includes paired two-speaker mixtures in both clean (speech only) and noisy (speech plus WHAM! noise) variants.
Standard splits and their approximate scale are as follows:
| Split | # Mixtures | Duration (hours) | Speakers* |
|---|---|---|---|
| train-360 | 50,800 | ~212 | ≥1,000 |
| train-100 | 13,900 | ~58 | ≥100 |
| dev | 3,000 | ~11 | - |
| test | 3,000 | ~11 | - |
*Minimum number of unique speakers per training set; dev/test inherit LibriSpeech speakers.
Complementary resources include VCTK-2mix (3,000 mixtures, 9 h, 108 speakers, with matched WHAM! noise) for cross-dataset evaluation and SparseLibri2Mix (3,000 mixtures, 6 h) for benchmarking under varying degrees of inter-speaker overlap.
2. Mixture Generation and Loudness Normalization
Mixtures are created using the formula:
where is selected to achieve a target signal-to-noise ratio (SNR) or, more specifically in Libri2Mix, to ensure auditory realism via perceptual loudness normalization. Unlike classical signal-power scaling (as in wsj0-2mix), which sets for a uniform SNR draw, Libri2Mix employs LUFS (Loudness Units relative to Full Scale, ITU-R BS.1770-4) normalization. Each utterance is normalized to a random target LUFS, and the gain
is applied. Final mixtures are constructed by summing the scaled sources. This strategy yields an empirical SNR distribution in “clean” mixtures of approximately dB.
For “noisy” mixtures, noise loudness is drawn from LUFS, resulting in SNRs (speech vs. noise) approximately dB. To prevent digital overflow, all outputs are peak-normalized and clipped to amplitude.
3. Noise Augmentation and Background Scenarios
Noisy Libri2Mix mixtures leverage WHAM! noise segments, selected at random for each mixture from splits corresponding to train/dev/test. For the large train-360 set, unique noise coverage is increased by speed-perturbing WHAM! noises by factors 0.8 and 1.2. WHAM! noises are dynamic and non-stationary, yielding more realistic and variable background environments than stationary alternatives. Each noise segment is gain-scaled to the target LUFS and added to the speech mixture to achieve the sampled loudness ratio.
4. Data Splits, Special Test Sets, and Resampling Modes
Both Libri2Mix and its three-speaker sibling Libri3Mix provide four standard splits: train-360, train-100, dev, and test, available in both clean and noisy variants. Each split is distributed at 16 kHz (native) and 8 kHz (resampled), and two duration modes are provided: “min” (truncation to the shortest source) and “max” (zero-padding to the longest).
- VCTK-2mix: Designed for cross-corpus generalization evaluation, this set consists of 3,000 mixtures produced by combining VCTK speech and WHAM! noise under the same mixing protocol.
- SparseLibri2Mix: Contains six nominal overlap levels (0%, 20%, ..., 100%), built by concatenating sub-utterances (up to 15s per speaker) aligned at the frame level using the Montreal Forced Aligner, to better reflect naturally alternating conversational speech overlaps.
5. Organization, Format, and Metadata
Libri2Mix employs a systematic directory structure reflecting mixture type and split. For example:
2
Audio is stored as 16-bit PCM WAV files. Each example is named as “{utt1ID}_{utt2ID}.wav” in each respective folder. Associated JSON manifests (e.g. train-360_clean.json) provide file paths for “mix”, “s1”, “s2”, and for noisy splits, “noise”. All file samples are normalized to ensure maximum amplitude 0.
LibriSpeech-derived material is under CC BY 4.0; WHAM! noise is under CC BY-NC 4.0. Libri2Mix is distributed under the most restrictive of the combined sources: CC BY 4.0 (clean), CC BY-NC 4.0 (noisy).
6. Evaluation, Baselines, and Generalization
Baseline evaluation with Conv-TasNet on Libri2Mix demonstrates substantial improvements in scale-invariant signal-to-distortion ratio (SI-SDR):
- On 16 kHz, SI-SDR1 improvement up to 16 dB (2 speakers, clean), 13.5 dB (2 speakers, noisy), ≈13 dB (3 speakers, clean), and ≈10.9 dB (3 speakers, noisy).
- Models trained on LibriMix train-360 exhibit only ≈0.8 dB SI-SDR drop on WHAM! test, compared to ≈4 dB loss for WHAM! models on Libri2Mix.
- On VCTK-2mix, LibriMix-trained models outperform WHAM!-trained models by ≈3–4 dB SI-SDR.
The larger pool of training mixtures, speaker diversity, and increased vocabulary coverage (60k word types vs. 5k in WHAM!) are primary drivers of improved generalization.
7. Significance and Research Utility
Libri2Mix addresses major limitations of prior speech separation benchmarks—particularly the poor generalization of models trained on wsj0-2mix or WHAM! to new domains. Its open-source, large-scale design, enriched speaker and noise diversity, and multiple test sets (including cross-dataset and sparse overlap) enable rigorous, reproducible evaluation of modern deep learning-based source separation systems. The adoption of LUFS normalization and dynamic noise augmentation further advances the realism and variability of training conditions, making Libri2Mix a central resource for research targeting speech separation model robustness and cross-corpus generalization (Cosentino et al., 2020).