GASS: Generalizing Audio Source Separation with Large-scale Data (2310.00140v1)

Published 29 Sep 2023 in cs.SD, cs.AI, cs.LG, eess.AS, and eess.SP

Abstract: Universal source separation targets at separating the audio sources of an arbitrary mix, removing the constraint to operate on a specific domain like speech or music. Yet, the potential of universal source separation is limited because most existing works focus on mixes with predominantly sound events, and small training datasets also limit its potential for supervised learning. Here, we study a single general audio source separation (GASS) model trained to separate speech, music, and sound events in a supervised fashion with a large-scale dataset. We assess GASS models on a diverse set of tasks. Our strong in-distribution results show the feasibility of GASS models, and the competitive out-of-distribution performance in sound event and speech separation shows its generalization abilities. Yet, it is challenging for GASS models to generalize for separating out-of-distribution cinematic and music content. We also fine-tune GASS models on each dataset and consistently outperform the ones without pre-training. All fine-tuned models (except the music separation one) obtain state-of-the-art results in their respective benchmarks.

References (36)

Citations (9)

View on Semantic Scholar

Summary

The paper presents a unified approach that leverages a dataset of 15,499 hours to separate audio sources across varied domains.
It employs waveform-based, STFT-based, and band-splitting architectures optimized using permutation invariant training and logarithmic-MSE loss.
The study shows robust in-distribution performance and improved out-of-distribution generalization, highlighting the value of large-scale data for fine-tuning in new scenarios.

Generalizing Audio Source Separation with Large-Scale Data (GASS)

Introduction to the Study

The field of audio source separation has witnessed significant advancements, yet the domain-specific constraints of most existing studies—focusing on speech or music—limit the applicability of these models to broader scenarios. In light of this limitation, the research presented by Jordi Pons et al. introduces an ambitious approach aiming for universal audio source separation. This endeavor is noteworthy for its attempt to create a unified model capable of disentangling various audio sources from a mix without prior domain-specific knowledge.

Dataset Creation and Model Training

Large-Scale Source Separation Dataset

A pivotal aspect of this paper is the creation of a significantly enlarged dataset intended for audio source separation training. The dataset encompasses 15,499 hours of recordings, spanning speech, music, and sound events—representing a substantial leap over existing datasets in terms of volume.

The dataset is categorized into speech and sound events (both foreground and background) and music (both foreground and background), with distinct control over the gains for each category to simulate realistic mixes of foreground and background sounds.
A noteworthy approach in dataset compilation is the consideration of background sounds as a single, combined source, aiming for a balance between the separation of dominant sources and the practicality of not over-separating ambient sounds.

Models and Training

The paper explores the efficacy of three distinct models: TDANet-Wav, TDANet-STFT, and BSRNN. Each of these models undergoes training on the large-scale dataset with permutation invariant training and logarithmic-MSE loss as the primary optimization strategies.

TDANet-Wav and TDANet-STFT utilize waveform-based and STFT-based approaches, respectively, adapting the encoder-decoder architecture for audio separation tasks.
BSRNN leverages band-splitting techniques to cater specifically to music source separation, illustrating a tailored approach within the broader quest for universal separation capabilities.

Evaluation and Findings

In-Distribution Separation Performance

The models demonstrate strong in-distribution performance across speech, sound event, and music separation tasks. Their capacity to generalize across different source types without specific prior knowledge is underscored, revealing the potential of large-scale data in fostering versatile audio separation models.

Each model shows particular strengths in different separation tasks, suggesting the importance of architectural choices in relation to the target audio sources.

Out-of-Distribution Generalization

When evaluated on out-of-distribution datasets—for instance, FUSS for universal sound separation and Libri2Mix for speech separation—the pre-trained models show varying degrees of generalizability.

The fine-tuning process consistently improves performance across these external datasets, reiterating the models' adaptability to novel separation challenges not encountered during the initial training phase.

Concluding Remarks and Future Directions

This research marks a significant stride towards general audio source separation, leveraging unprecedented scale in data to train models capable of broad applicability across domain barriers. The successful application of these models to diverse separation tasks, together with their fine-tuning adaptability, sets a promising precedent for future explorations in the field.

The paper also uncovers challenges and limitations, particularly in the separation of out-of-distribution cinematic and music content, posing questions for future research focus. The fine-tuning successes, however, highlight a path forward wherein pre-trained universal models can be specialized through targeted adaptation, marrying the benefits of large-scale data training with the flexibility needed for specific audio separation tasks.

As the field progresses, further enriching the diversity of training data and exploring model architectures tailored to harness this diversity will be crucial. This work not only contributes a valuable dataset and insights into model training for universal audio source separation but also invites continued innovation towards more holistic and robust solutions in the complex landscape of audio processing.

PDF Markdown