Libri2Mix Benchmark

Updated 1 June 2026

Libri2Mix Benchmark is a standardized dataset created by mixing two distinct LibriSpeech utterances with WHAM! noise at randomized SNRs.
It supports both signal-level evaluation (e.g., SI-SNR, PESQ) and recognition-level evaluation (WER) for assessing diverse TSE and TS-ASR approaches.
State-of-the-art pipelines like SoloSpeech and SQ-Whisper have demonstrated significant performance gains on Libri2Mix, highlighting its impact on overlapped speech research.

The Libri2Mix benchmark is a widely used controlled corpus for evaluating target speech extraction (TSE) and target-speaker automatic speech recognition (TS-ASR) algorithms in conditions of speaker overlap and additive noise. By constructing synthetic two-speaker mixtures from the LibriSpeech corpus and WHAM! environmental noise at randomized signal-to-noise ratios (SNR), Libri2Mix provides a standardized and rigorous testbed supporting both signal-level and recognition-level evaluation. Multiple state-of-the-art pipelines—including discriminative, generative, and foundation model-based approaches—have reported results on Libri2Mix, rendering it central for benchmarking progress in overlapped speech processing.

1. Construction and Partitioning of the Libri2Mix Corpus

Libri2Mix is derived by summing two distinct utterances from LibriSpeech, spoken by different speakers, at a randomly chosen SNR within $[-6, +3]$ dB for the canonical “2-speaker” condition. Additive noise conditions are created by mixing WHAM! real-world environmental noise segments at identical SNRs (Wang et al., 25 May 2025). The corpus supplies both “clean” (speech-only) and “noisy” (speech-plus-noise) overlaps. Libri2Mix is distributed in well-defined splits:

Split	Mixtures	Duration (h)	Notes
train-100	~40,000	58	100-hour LibriSpeech
train-360	~147,000	212	360-hour LibriSpeech
dev	~8,000–11,000	11
test	~8,000–11,000	11

Each mixture is paired with a “cue” utterance (enrollment) from the target speaker distinct from the one in the mixture, enabling both target extraction and recognition evaluation scenarios (Wang et al., 25 May 2025, Guo et al., 2024). SNR distributions for noisy mixtures are $N(\mu=-2\,\mathrm{dB}, \sigma=3.6\,\mathrm{dB})$ and for clean mixtures $N(\mu=0\,\mathrm{dB}, \sigma=4.1\,\mathrm{dB})$ (Guo et al., 2024).

2. Evaluation Metrics

Multiple complementary metrics are standardized for benchmarking on Libri2Mix:

SI-SNR (Scale-Invariant Signal-to-Noise Ratio): For extracted waveform $\hat{y}$ and clean target $y$ , $\text{SI-SNR}(\hat{y}, y) = 10\log_{10}\left(\frac{\|\alpha y\|^2}{\|\hat{y}-\alpha y\|^2}\right),\quad \alpha = \frac{\langle \hat{y}, y \rangle}{\|y\|^2}$ SI-SNR improvement quantifies gain over the unprocessed mixture.
PESQ (Perceptual Evaluation of Speech Quality): Standardized by ITU-P.862, ranges $[-0.5, 4.5]$ (higher is better).
ESTOI (Extended Short-Time Objective Intelligibility): Ranges $[0, 1]$ for predicted intelligibility.
DNSMOS: Non-intrusive MOS prediction using a neural model, range $\approx [1,5]$ .
Automatic Speech Recognition WER: Measured as $\mathrm{WER} = \frac{S+D+I}{N} \times 100\%$ where $N(\mu=-2\,\mathrm{dB}, \sigma=3.6\,\mathrm{dB})$ 0, $N(\mu=-2\,\mathrm{dB}, \sigma=3.6\,\mathrm{dB})$ 1, $N(\mu=-2\,\mathrm{dB}, \sigma=3.6\,\mathrm{dB})$ 2 are substitutions, deletions, insertions; $N(\mu=-2\,\mathrm{dB}, \sigma=3.6\,\mathrm{dB})$ 3 is the reference length.
Speaker Similarity (SIM): Cosine similarity between extracted and enrollment utterances in a verification embedding space (Wang et al., 25 May 2025).

3. Baseline Systems and Comparative Performance

The Libri2Mix test set is a principal comparative ground for TSE and TS-ASR, spanning discriminative, foundation, and generative approaches. Recent key results are summarized below.

Model	Type	WER (%)	SI-SNR $N(\mu=-2\,\mathrm{dB}, \sigma=3.6\,\mathrm{dB})$ 4	PESQ $N(\mu=-2\,\mathrm{dB}, \sigma=3.6\,\mathrm{dB})$ 5	ESTOI $N(\mu=-2\,\mathrm{dB}, \sigma=3.6\,\mathrm{dB})$ 6	DNSMOS $N(\mu=-2\,\mathrm{dB}, \sigma=3.6\,\mathrm{dB})$ 7	SIM $N(\mu=-2\,\mathrm{dB}, \sigma=3.6\,\mathrm{dB})$ 8
Pipeline (Separation $N(\mu=-2\,\mathrm{dB}, \sigma=3.6\,\mathrm{dB})$ 9 Whisper)	Discriminative	15.3	–	–	–	–	–
TS-HuBERT + CLN	Foundation	24.8	–	–	–	–	–
TSE-Whisper (LoRA, FiLM)	Foundation	25.6	–	–	–	–	–
SoloSpeech (cascaded generative)	Generative	0.15	11.12	1.89	0.78	3.76	0.96
SSL-MHFA (SOTA disc.)	Discriminative	0.17	10.60	1.76	0.74	3.22	0.94
USEF-TSE (top disc.)	Discriminative	0.17	10.17	1.82	0.72	3.48	0.94
DDTSE (generative)	Generative	–	7.60	1.60	0.71	3.74	–
PT-Whisper	Prompt Tuning	30.7	–	–	–	–	–
SQ-Whisper (+train-360+SP)	Foundation	14.6	–	–	–	–	–

SoloSpeech achieves 11.12 dB SI-SNR, 1.89 PESQ, 0.78 ESTOI, and a WER of 0.15 on the Libri2Mix test set, substantially outperforming contemporary discriminative and prior generative methods in both signal and recognition metrics (Wang et al., 25 May 2025). SQ-Whisper with extended training reaches 14.6\% WER, surpassing all previous end-to-end TS-ASR models (Guo et al., 2024).

4. Methodological Innovations and Architectures

Notable architectures evaluated on Libri2Mix include:

Discriminative models: Separate the target signal using encoder–decoder or transformer-based neural separation, then apply an ASR backend.
TS-HuBERT + CLN: Foundation model relying on conditional layer normalization for target speaker adaptation (Guo et al., 2024).
TSE-Whisper: Foundation ASR with LoRA and FiLM-based adaptors.
SQ-Whisper: Incorporates the SQ-Former adaptor with $N(\mu=0\,\mathrm{dB}, \sigma=4.1\,\mathrm{dB})$ 0 learnable queries, performing stacked self-attention and cross-attention between enrollment and mixture representations. Prompt vectors are injected into both encoder and decoder streams to boost target speaker reliance. Its training objective is a joint sum of cross-entropy loss and a contrastive speaker loss, with $N(\mu=0\,\mathrm{dB}, \sigma=4.1\,\mathrm{dB})$ 1 scaling the contrastive term. Training utilizes LoRA for parameter-efficient tuning and data augmentation (speed perturbation, expanded mixtures) for achieving state-of-the-art performance (Guo et al., 2024).
SoloSpeech: A cascaded generative pipeline mixing time–frequency VAE-based compression, a speaker-embedding-free extractor operating in the latent space, and a corrector based on diffusion. Latent fusion is used for speaker conditioning, outperforming both fixed and fine-tuned SSL-based speaker embeddings in ablation. Corrector training employs strategic signal masking to optimize intelligibility–quality balance (Wang et al., 25 May 2025).

5. Ablation and Analysis

Comprehensive ablation studies on Libri2Mix reveal:

For SQ-Whisper, varying the number of queries $N(\mu=0\,\mathrm{dB}, \sigma=4.1\,\mathrm{dB})$ 2 shows optimal test WER at $N(\mu=0\,\mathrm{dB}, \sigma=4.1\,\mathrm{dB})$ 3; fewer queries under-represent target speaker, while more lead to overfitting or increased noise (Guo et al., 2024).
The speaker contrastive loss is essential; its removal degrades test WER by approximately 5% absolute, and t-SNE analysis confirms improved clustering per speaker when present.
Encoder–decoder prompt injection jointly provides the lowest WER (20.1% versus encoder-only at 21.8% and decoder-only at 27.9%).
SoloSpeech ablations demonstrate each module’s contribution: switching from time-domain to T-F VAE improves SI-SNR by 0.7 dB, and latent-space fusion for speaker conditioning exceeds fixed/fine-tuned SSL embedding by over 1 dB SI-SNR. The SoloSpeech corrector boosts SI-SNR by 1.02 dB over compressor+extractor alone, and further reduces WER from 0.18 to 0.15 (Wang et al., 25 May 2025).
Mismatched enrollment (where the cue utterance is not from the target speaker) triggers a WER spike (approximately 72%), confirming reliance on proper speaker cues in SQ-Whisper (Guo et al., 2024).

6. Generalization and Real-World Performance

Libri2Mix-trained models’ robustness is systematically assessed on simulated out-of-domain (WSJ speech plus WHAM!, MUSAN, or DEMAND noise) and real-world corpora (CHiME-5, RealSEP). SoloSpeech demonstrates consistent 1–2 dB SI-SNR gains and 10–20% WER reductions across all out-of-domain settings, with MOS superior to both discriminative and generative baselines on real recordings. MOS (mean opinion score) for SoloSpeech is 2.93 on CHiME-5, outperforming alternative models by at least 0.67 (Wang et al., 25 May 2025). This suggests the benchmark enables not only in-domain but also strong cross-domain metrics for robust TSE/TS-ASR architecture assessment.

7. Significance and Benchmark Impact

The Libri2Mix benchmark has proven crucial as a reproducible, rigorously partitioned testbed for overlapped speech separation, extraction, and recognition. Its partitioning, tightly controlled mixing protocols, and clear evaluation suite support precise comparison between fundamentally distinct methodologies, aiding the field in tracking performance improvements across architectural classes. As evidenced by the widespread adoption and continued setting of new state-of-the-art results—such as those by SoloSpeech and SQ-Whisper—Libri2Mix remains the canonical reference for academic studies on overlapped speech extraction and TS-ASR (Guo et al., 2024, Wang et al., 25 May 2025).

Markdown Report Issue Upgrade to Chat

References (2)

SoloSpeech: Enhancing Intelligibility and Quality in Target Speech Extraction through a Cascaded Generative Pipeline (2025)

SQ-Whisper: Speaker-Querying based Whisper Model for Target-Speaker ASR (2024)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Libri2Mix Benchmark.