Libri2Mix Corpus for Speech Separation
- Libri2Mix is a specialized dataset for two-speaker single-channel speech separation, integrating clean LibriSpeech utterances with WHAM! noise samples.
- It employs precise loudness scaling and mixture generation, ensuring consistent benchmarks across training, validation, and test splits.
- The corpus offers diverse audio characteristics and cross-corpus evaluations, showcasing improved generalization over older datasets like WSJ0-2mix.
Libri2Mix is a subset of the LibriMix corpus specifically tailored for two-speaker single-channel speech separation tasks. It is constructed from the “clean” portions of LibriSpeech and augmented with ambient noise samples from WHAM! to support both clean and noisy mixture scenarios. Libri2Mix is designed to address the generalization limitations observed in models trained on datasets like WSJ0-2mix, providing large-scale, diverse, and open-source benchmarks for contemporary speech separation systems (Cosentino et al., 2020).
1. Dataset Composition and Splits
Libri2Mix sources its clean speech data from the LibriSpeech train-100, train-360, dev-clean, and test-clean subsets, with ambient noise clips provided by WHAM!. The corpus replicates the organization of WSJ0-2mix and WHAM! to maintain evaluation and training consistency, featuring two separate training sets, one validation (dev) set, and one test set:
| Split | Mixtures | Duration (h) |
|---|---|---|
| train-360 | 50,800 | ≈212 |
| train-100 | 13,900 | ≈58 |
| dev | 3,000 | ≈11 |
| test | 3,000 | ≈11 |
For each mixture, two randomly selected speakers are paired, each contributing one utterance, sampled without replacement to ensure each utterance is used exactly once in train splits. These utterances are subsequently loudness scaled and summed. An additional test set, VCTK-2mix, consists of 3,000 mixtures (≈9 h) from 108 VCTK speakers mixed analogously, using WHAM! test-set noise, facilitating fair cross-corpus generalization evaluation.
SparseLibri2Mix provides a variant test set (3,000 mixtures, ≈6 h) for the two-speaker case, in which sub-utterances are concatenated to 15 seconds per mixture under six target overlap ratios: 0%, 20%, 40%, 60%, 80%, 100% (500 mixtures per ratio). This design enables assessment of performance under various conversational overlap scenarios.
2. Data Generation Methodology
Libri2Mix mixtures adhere to a clear signal model:
- Clean mixture:
- Noisy mixture:
All component utterances and noise clips undergo loudness scaling according to ITU-R BS.1770-4 LUFS: each speech utterance loudness is drawn uniformly from LUFS, and each noise clip from LUFS. For “clean” mixtures, is set to LUFS (effectively zero noise). The empirical SNR (dB), defined as
exhibits a Normal(0 dB, 4.1 dB) distribution for clean mixtures and Normal(−2 dB, 3.6 dB) for noisy mixtures. Noise clips are sourced from the WHAM! test split for dev/test and, in train-360, speed-perturbed (by factors 0.8 and 1.2) to prevent reuse. The final mixture is clipped at ±0.9 to prevent digital overflow.
3. Audio Characteristics
Libri2Mix is delivered at two sampling rates: 16 kHz (“max” version) and 8 kHz (“min” version), supporting both bandwidths familiar in contemporary speech separation literature. Audio is stored as 16-bit PCM WAV files. The mixture length matches the longer component utterance (“max” mode) or is truncated to the shorter one (“min” mode), yielding average durations of 13–15 seconds. All component utterances are derived from LibriSpeech, comprising read speech typically 5–15 seconds in length.
4. Evaluation Protocols and Benchmarks
Benchmarks employ canonical separation and quality metrics:
- SI-SDR (Scale-Invariant Signal-to-Distortion Ratio):
Using the Conv-TasNet model (Asteroid implementation):
| Condition | SI-SDR (dB) | IRM (dB) | IBM (dB) |
|---|---|---|---|
| 2-spk clean (16kHz) | 16.0 | 14.1 | 14.5 |
| 2-spk noisy (16kHz) | 13.5 | 13.4 | 13.7 |
| 3-spk clean | 13.0 | — | — |
| 3-spk noisy | 10.9 | — | — |
Performance in the sparsely overlapping sets declines smoothly with increased non-overlap; at 0% overlap, Conv-TasNet reaches ≈ 31.9 dB SI-SDR (clean, 8 kHz), falling toward ≈ 15 dB for fully overlapped mixtures.
5. Generalization Findings
Extensive cross-dataset evaluation reveals:
- Models trained on WHAM! exhibit ≈4 dB SI-SDR drop when tested on Libri2Mix compared to models trained on Libri2Mix.
- Libri2Mix train-360 models lose only ≈0.8 dB SI-SDR on WHAM! test.
- WHAM!-trained models tested on VCTK-2mix are ≈3–4 dB worse than Libri2Mix train-360 models, for both clean and noisy variants.
- Training on Libri2Mix train-360 (212 h, ~1,000 speakers, 60k words) provides superior generalization than train-100 or WHAM!/WSJ0-2mix (30 h, 100 speakers, 5k words).
SparseLibri2Mix evaluations demonstrate that separation quality degrades gracefully as overlap decreases, with the model maintaining high SI-SDR improvements where overlap is minimal.
6. Usage and Licensing
LibriMix and Libri2Mix are fully open source. Clean speech originates from LibriSpeech (public domain), and noise clips derive from WHAM! (CC BY-NC 4.0). All data generation scripts and metadata are hosted on GitHub:
- https://github.com/JorisCos/LibriMix
- https://github.com/JorisCos/VCTK-2Mix
- https://github.com/popcornell/SparseLibriMix
To reproduce Libri2Mix, users acquire LibriSpeech clean sets and WHAM! noise files, install the pyloudnorm toolkit, and execute the provided Python scripts. The output comprises directories per split featuring both 16 kHz and 8 kHz WAV files for mixtures, sources, and noise, accompanied by manifest files for scripting compatibility.
Libri2Mix is established as a robust and generalizable corpus for speech separation benchmarking and research (Cosentino et al., 2020).