Papers
Topics
Authors
Recent
2000 character limit reached

Libri2Mix Corpus for Speech Separation

Updated 24 December 2025
  • Libri2Mix is a specialized dataset for two-speaker single-channel speech separation, integrating clean LibriSpeech utterances with WHAM! noise samples.
  • It employs precise loudness scaling and mixture generation, ensuring consistent benchmarks across training, validation, and test splits.
  • The corpus offers diverse audio characteristics and cross-corpus evaluations, showcasing improved generalization over older datasets like WSJ0-2mix.

Libri2Mix is a subset of the LibriMix corpus specifically tailored for two-speaker single-channel speech separation tasks. It is constructed from the “clean” portions of LibriSpeech and augmented with ambient noise samples from WHAM! to support both clean and noisy mixture scenarios. Libri2Mix is designed to address the generalization limitations observed in models trained on datasets like WSJ0-2mix, providing large-scale, diverse, and open-source benchmarks for contemporary speech separation systems (Cosentino et al., 2020).

1. Dataset Composition and Splits

Libri2Mix sources its clean speech data from the LibriSpeech train-100, train-360, dev-clean, and test-clean subsets, with ambient noise clips provided by WHAM!. The corpus replicates the organization of WSJ0-2mix and WHAM! to maintain evaluation and training consistency, featuring two separate training sets, one validation (dev) set, and one test set:

Split Mixtures Duration (h)
train-360 50,800 ≈212
train-100 13,900 ≈58
dev 3,000 ≈11
test 3,000 ≈11

For each mixture, two randomly selected speakers are paired, each contributing one utterance, sampled without replacement to ensure each utterance is used exactly once in train splits. These utterances are subsequently loudness scaled and summed. An additional test set, VCTK-2mix, consists of 3,000 mixtures (≈9 h) from 108 VCTK speakers mixed analogously, using WHAM! test-set noise, facilitating fair cross-corpus generalization evaluation.

SparseLibri2Mix provides a variant test set (3,000 mixtures, ≈6 h) for the two-speaker case, in which sub-utterances are concatenated to 15 seconds per mixture under six target overlap ratios: 0%, 20%, 40%, 60%, 80%, 100% (500 mixtures per ratio). This design enables assessment of performance under various conversational overlap scenarios.

2. Data Generation Methodology

Libri2Mix mixtures adhere to a clear signal model:

  • Clean mixture: x(t)=s1(t)+s2(t)x(t) = s_1(t) + s_2(t)
  • Noisy mixture: x(t)=s1(t)+s2(t)+n(t)x(t) = s_1(t) + s_2(t) + n(t)

All component utterances and noise clips undergo loudness scaling according to ITU-R BS.1770-4 LUFS: each speech utterance loudness s\ell_s is drawn uniformly from [33,25][−33, -25] LUFS, and each noise clip n\ell_n from [38,30][−38, -30] LUFS. For “clean” mixtures, n\ell_n is set to 300-300 LUFS (effectively zero noise). The empirical SNR (dB), defined as

SNRdB=10log10(E[s(t)2]E[n(t)2]),\mathrm{SNR_{dB}} = 10\log_{10}\left(\frac{\mathbb{E}[s(t)^2]}{\mathbb{E}[n(t)^2]}\right),

exhibits a Normal(0 dB, 4.1 dB) distribution for clean mixtures and Normal(−2 dB, 3.6 dB) for noisy mixtures. Noise clips are sourced from the WHAM! test split for dev/test and, in train-360, speed-perturbed (by factors 0.8 and 1.2) to prevent reuse. The final mixture is clipped at ±0.9 to prevent digital overflow.

3. Audio Characteristics

Libri2Mix is delivered at two sampling rates: 16 kHz (“max” version) and 8 kHz (“min” version), supporting both bandwidths familiar in contemporary speech separation literature. Audio is stored as 16-bit PCM WAV files. The mixture length matches the longer component utterance (“max” mode) or is truncated to the shorter one (“min” mode), yielding average durations of 13–15 seconds. All component utterances are derived from LibriSpeech, comprising read speech typically 5–15 seconds in length.

4. Evaluation Protocols and Benchmarks

Benchmarks employ canonical separation and quality metrics:

  • SI-SDR (Scale-Invariant Signal-to-Distortion Ratio):

SISDR=10log10(αs2αss^2),α=s^,ss2\mathrm{SI{-}SDR} = 10\log_{10} \left( \frac{ \| \alpha s \|^2 }{ \| \alpha s - \hat{s} \|^2 } \right), \quad \alpha = \frac{ \langle \hat{s}, s \rangle }{ \|s\|^2 }

  • SDR (BSS_eval)
  • PESQ (ITU-T P.862) for perceptual evaluation
  • STOI for intelligibility

Using the Conv-TasNet model (Asteroid implementation):

Condition SI-SDR (dB) IRM (dB) IBM (dB)
2-spk clean (16kHz) 16.0 14.1 14.5
2-spk noisy (16kHz) 13.5 13.4 13.7
3-spk clean 13.0
3-spk noisy 10.9

Performance in the sparsely overlapping sets declines smoothly with increased non-overlap; at 0% overlap, Conv-TasNet reaches ≈ 31.9 dB SI-SDR (clean, 8 kHz), falling toward ≈ 15 dB for fully overlapped mixtures.

5. Generalization Findings

Extensive cross-dataset evaluation reveals:

  • Models trained on WHAM! exhibit ≈4 dB SI-SDR drop when tested on Libri2Mix compared to models trained on Libri2Mix.
  • Libri2Mix train-360 models lose only ≈0.8 dB SI-SDR on WHAM! test.
  • WHAM!-trained models tested on VCTK-2mix are ≈3–4 dB worse than Libri2Mix train-360 models, for both clean and noisy variants.
  • Training on Libri2Mix train-360 (212 h, ~1,000 speakers, 60k words) provides superior generalization than train-100 or WHAM!/WSJ0-2mix (30 h, 100 speakers, 5k words).

SparseLibri2Mix evaluations demonstrate that separation quality degrades gracefully as overlap decreases, with the model maintaining high SI-SDR improvements where overlap is minimal.

6. Usage and Licensing

LibriMix and Libri2Mix are fully open source. Clean speech originates from LibriSpeech (public domain), and noise clips derive from WHAM! (CC BY-NC 4.0). All data generation scripts and metadata are hosted on GitHub:

To reproduce Libri2Mix, users acquire LibriSpeech clean sets and WHAM! noise files, install the pyloudnorm toolkit, and execute the provided Python scripts. The output comprises directories per split featuring both 16 kHz and 8 kHz WAV files for mixtures, sources, and noise, accompanied by manifest files for scripting compatibility.

Libri2Mix is established as a robust and generalizable corpus for speech separation benchmarking and research (Cosentino et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Libri2Mix Corpus.