Papers
Topics
Authors
Recent
Search
2000 character limit reached

Libri2Mix: Benchmark Dataset for Speech Separation

Updated 9 April 2026
  • Libri2Mix is a large-scale, open-source corpus featuring two-speaker mixtures, diverse speaker profiles, and realistic noise from varied environments.
  • It employs LUFS loudness normalization and dynamic noise augmentation with WHAM! noise to generate both clean and noisy mixtures, enhancing model realism.
  • The dataset offers standard splits, cross-dataset evaluations (e.g., VCTK-2mix), and improved SI-SDR benchmarks, supporting robust speech separation research.

Libri2Mix is a large-scale, fully open-source corpus designed for single-channel two-speaker speech separation, constructed from LibriSpeech “clean” subsets and non-stationary ambient noise from WHAM!. Developed to address generalization issues identified with models trained solely on wsj0-2mix, Libri2Mix features comprehensive clean and noisy mixture sets, varied speaker and vocabulary diversity, richly annotated metadata, and additional cross-dataset and sparsely overlapping test sets. The resource supports robust benchmarking and cross-domain evaluation for modern speech separation architectures, offering enhanced generalization to naturalistic, conversation-like scenarios (Cosentino et al., 2020).

1. Dataset Composition and Structure

Libri2Mix is derived from LibriSpeech “clean” subsets (train-360, train-100, dev-clean, test-clean), comprising approximately 470 hours of read English speech from 1,252 speakers, sampled at 16 kHz. Ambient noise is sourced from the WHAM! corpus, which includes background environments such as coffee shops, restaurants, and bars, and is distributed under the CC BY-NC 4.0 license. The corpus includes paired two-speaker mixtures in both clean (speech only) and noisy (speech plus WHAM! noise) variants.

Standard splits and their approximate scale are as follows:

Split # Mixtures Duration (hours) Speakers*
train-360 50,800 ~212 ≥1,000
train-100 13,900 ~58 ≥100
dev 3,000 ~11 -
test 3,000 ~11 -

*Minimum number of unique speakers per training set; dev/test inherit LibriSpeech speakers.

Complementary resources include VCTK-2mix (3,000 mixtures, 9 h, 108 speakers, with matched WHAM! noise) for cross-dataset evaluation and SparseLibri2Mix (3,000 mixtures, 6 h) for benchmarking under varying degrees of inter-speaker overlap.

2. Mixture Generation and Loudness Normalization

Mixtures are created using the formula:

x(t)=s1(t)+αs2(t)x(t) = s_1(t) + \alpha \cdot s_2(t)

where α\alpha is selected to achieve a target signal-to-noise ratio (SNR) or, more specifically in Libri2Mix, to ensure auditory realism via perceptual loudness normalization. Unlike classical signal-power scaling (as in wsj0-2mix), which sets α=10SNR/20s12/s22\alpha = 10^{-\mathrm{SNR}/20} \cdot \sqrt{\int s_1^2 / \int s_2^2} for a uniform SNR draw, Libri2Mix employs LUFS (Loudness Units relative to Full Scale, ITU-R BS.1770-4) normalization. Each utterance is normalized to a random target LiUniform(33,25)L_i \sim \mathrm{Uniform}(-33, -25) LUFS, and the gain

gi=10(LiLmeas,i)/20g_i = 10^{(L_i - L_{\mathrm{meas}, i})/20}

is applied. Final mixtures are constructed by summing the scaled sources. This strategy yields an empirical SNR distribution in “clean” mixtures of approximately N(0,4.12)\mathcal{N}(0, 4.1^2) dB.

For “noisy” mixtures, noise loudness LnL_n is drawn from Uniform(38,30)\mathrm{Uniform}(-38, -30) LUFS, resulting in SNRs (speech vs. noise) approximately N(2,3.62)\mathcal{N}(-2, 3.6^2) dB. To prevent digital overflow, all outputs are peak-normalized and clipped to 0.9|0.9| amplitude.

3. Noise Augmentation and Background Scenarios

Noisy Libri2Mix mixtures leverage WHAM! noise segments, selected at random for each mixture from splits corresponding to train/dev/test. For the large train-360 set, unique noise coverage is increased by speed-perturbing WHAM! noises by factors 0.8 and 1.2. WHAM! noises are dynamic and non-stationary, yielding more realistic and variable background environments than stationary alternatives. Each noise segment is gain-scaled to the target LUFS and added to the speech mixture to achieve the sampled loudness ratio.

4. Data Splits, Special Test Sets, and Resampling Modes

Both Libri2Mix and its three-speaker sibling Libri3Mix provide four standard splits: train-360, train-100, dev, and test, available in both clean and noisy variants. Each split is distributed at 16 kHz (native) and 8 kHz (resampled), and two duration modes are provided: “min” (truncation to the shortest source) and “max” (zero-padding to the longest).

  • VCTK-2mix: Designed for cross-corpus generalization evaluation, this set consists of 3,000 mixtures produced by combining VCTK speech and WHAM! noise under the same mixing protocol.
  • SparseLibri2Mix: Contains six nominal overlap levels (0%, 20%, ..., 100%), built by concatenating sub-utterances (up to 15s per speaker) aligned at the frame level using the Montreal Forced Aligner, to better reflect naturally alternating conversational speech overlaps.

5. Organization, Format, and Metadata

Libri2Mix employs a systematic directory structure reflecting mixture type and split. For example:

α\alpha2

Audio is stored as 16-bit PCM WAV files. Each example is named as “{utt1ID}_{utt2ID}.wav” in each respective folder. Associated JSON manifests (e.g. train-360_clean.json) provide file paths for “mix”, “s1”, “s2”, and for noisy splits, “noise”. All file samples are normalized to ensure maximum amplitude α\alpha0.

LibriSpeech-derived material is under CC BY 4.0; WHAM! noise is under CC BY-NC 4.0. Libri2Mix is distributed under the most restrictive of the combined sources: CC BY 4.0 (clean), CC BY-NC 4.0 (noisy).

6. Evaluation, Baselines, and Generalization

Baseline evaluation with Conv-TasNet on Libri2Mix demonstrates substantial improvements in scale-invariant signal-to-distortion ratio (SI-SDR):

  • On 16 kHz, SI-SDRα\alpha1 improvement up to 16 dB (2 speakers, clean), 13.5 dB (2 speakers, noisy), ≈13 dB (3 speakers, clean), and ≈10.9 dB (3 speakers, noisy).
  • Models trained on LibriMix train-360 exhibit only ≈0.8 dB SI-SDR drop on WHAM! test, compared to ≈4 dB loss for WHAM! models on Libri2Mix.
  • On VCTK-2mix, LibriMix-trained models outperform WHAM!-trained models by ≈3–4 dB SI-SDR.

The larger pool of training mixtures, speaker diversity, and increased vocabulary coverage (60k word types vs. 5k in WHAM!) are primary drivers of improved generalization.

7. Significance and Research Utility

Libri2Mix addresses major limitations of prior speech separation benchmarks—particularly the poor generalization of models trained on wsj0-2mix or WHAM! to new domains. Its open-source, large-scale design, enriched speaker and noise diversity, and multiple test sets (including cross-dataset and sparse overlap) enable rigorous, reproducible evaluation of modern deep learning-based source separation systems. The adoption of LUFS normalization and dynamic noise augmentation further advances the realism and variability of training conditions, making Libri2Mix a central resource for research targeting speech separation model robustness and cross-corpus generalization (Cosentino et al., 2020).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Libri2Mix Dataset.