LibriMix: An Open-Source Dataset for Generalizable Speech Separation (2005.11262v1)

Published 22 May 2020 in eess.AS

Abstract: In recent years, wsj0-2mix has become the reference dataset for single-channel speech separation. Most deep learning-based speech separation models today are benchmarked on it. However, recent studies have shown important performance drops when models trained on wsj0-2mix are evaluated on other, similar datasets. To address this generalization issue, we created LibriMix, an open-source alternative to wsj0-2mix, and to its noisy extension, WHAM!. Based on LibriSpeech, LibriMix consists of two- or three-speaker mixtures combined with ambient noise samples from WHAM!. Using Conv-TasNet, we achieve competitive performance on all LibriMix versions. In order to fairly evaluate across datasets, we introduce a third test set based on VCTK for speech and WHAM! for noise. Our experiments show that the generalization error is smaller for models trained with LibriMix than with WHAM!, in both clean and noisy conditions. Aiming towards evaluation in more realistic, conversation-like scenarios, we also release a sparsely overlapping version of LibriMix's test set.

Citations (266)

View on Semantic Scholar

Summary

The paper introduces LibriMix, a new dataset that improves the generalization of speech separation models by addressing limitations of existing benchmarks.
The study combines LibriSpeech samples with noise from WHAM! to create realistic, varied acoustic environments for robust evaluation.
Experiments with Conv-TasNet reveal that models trained on LibriMix achieve higher SI-SDR scores in both clean and noisy conditions.

LibriMix: An Open-Source Dataset for Generalizable Speech Separation

The paper discusses the creation and utility of LibriMix, an open-source dataset designed for the advancement of speech separation models. It builds upon the limitations observed in the current standard datasets, such as WSJ0-2mix and its noisy and reverberant extensions, WHAM! and WHAMR!. The study is particularly relevant for researchers focusing on enhancing the generalization capabilities of speech separation algorithms.

Background and Motivation

The research community has predominantly relied on the WSJ0-2mix dataset to benchmark speech separation models. Although these models achieve high separation quality on this dataset, there is a significant drop in performance when applied to other datasets. This indicates a lack of generalization, a key challenge the LibriMix dataset aims to address.

LibriMix Dataset Design

LibriMix extends the LibriSpeech dataset by mixing two- or three-speaker samples with noise elements from WHAM!, creating a more varied and extensive corpus. By incorporating a broader range of speakers and different recording conditions, LibriMix generates more realistic experimental environments. It provides test sets based on VCTK and offers a sparsely overlapping version of their test set to simulate conversational speech. The dataset's design addresses several identified shortcomings of existing datasets, such as unnatural overlap ratios and limited speaker diversity.

Experimental Setup and Results

The team undertook a series of experiments using Conv-TasNet, a state-of-the-art deep learning model for speech separation. The experiments demonstrated that models trained on LibriMix generally exhibited smaller generalization errors compared to those trained on the WSJ0-2mix and WHAM! datasets. Notably, performance improvements were observed both in clean and noisy conditions, illustrating the robustness and diversity of LibriMix.

Results are quantified through the signal-to-distortion ratio (SI-SDR), and models trained on LibriMix exhibited superior generalization capabilities when tested on VCTK-2mix. Specifically, the SI-SDR improvement was notable for models trained on the train-360 subset of LibriMix, emphasizing the importance of larger, more diverse training sets.

Implications and Future Directions

LibriMix offers a significant contribution to the field by providing a dataset that enhances the generalization capabilities of speech separation models across various conditions. It enables researchers to train and evaluate models in more realistic acoustic environments, thus aligning closer with practical applications. The release of this dataset could stimulate further advances in the development of separation systems, potentially impacting areas such as automatic transcription systems, hearing aids, and voice assistants.

Moving forward, the authors suggest the development of a training set with more sparsely overlapping speech mixtures and a broader diversity of noise samples. Such advancements would continue to improve the robustness and applicability of speech separation technologies.

In conclusion, LibriMix sets a new standard for dataset design in speech separation research, addressing generalization challenges and encouraging more comprehensive evaluation frameworks. Its use may lead to substantial progress in the applicability and performance of speech separation models across various real-world conditions.