Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
194 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

GASS: Generalizing Audio Source Separation with Large-scale Data (2310.00140v1)

Published 29 Sep 2023 in cs.SD, cs.AI, cs.LG, eess.AS, and eess.SP

Abstract: Universal source separation targets at separating the audio sources of an arbitrary mix, removing the constraint to operate on a specific domain like speech or music. Yet, the potential of universal source separation is limited because most existing works focus on mixes with predominantly sound events, and small training datasets also limit its potential for supervised learning. Here, we study a single general audio source separation (GASS) model trained to separate speech, music, and sound events in a supervised fashion with a large-scale dataset. We assess GASS models on a diverse set of tasks. Our strong in-distribution results show the feasibility of GASS models, and the competitive out-of-distribution performance in sound event and speech separation shows its generalization abilities. Yet, it is challenging for GASS models to generalize for separating out-of-distribution cinematic and music content. We also fine-tune GASS models on each dataset and consistently outperform the ones without pre-training. All fine-tuned models (except the music separation one) obtain state-of-the-art results in their respective benchmarks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. “An Efficient Encoder-Decoder Architecture with Top-down Attention for Speech Separation,” in ICLR, 2023.
  2. “Hybrid Transformers for Music Source Separation,” in ICASSP, 2023, pp. 1–5.
  3. Y. Luo and J. Yu, “Music Source Separation With Band-Split RNN,” IEEE/ACM TASLP, 2023.
  4. “Universal Sound Separation,” in WASPAA, 2019, pp. 175–179.
  5. “What’s All the Fuss About Free Universal Sound Separation Data?,” in ICASSP, 2021, pp. 186–190.
  6. “Compute and Memory Efficient Universal Sound Source Separation,” Journal of Signal Processing Systems, pp. 245–259, 2022.
  7. “Improving Universal Sound Separation Using Sound Classification,” in ICASSP, 2020, pp. 96–100.
  8. “Adversarial Permutation Invariant Training for Universal Sound Separation,” in ICASSP, 2023, pp. 1–5.
  9. “Universal Source Separation with Weakly Labelled Data,” arXiv:2305.07447, 2023.
  10. “Separate Anything You Describe,” arXiv:2308.05037, 2023.
  11. “Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-wild Unsupervised Sound Separation,” in WASPAA, 2021, pp. 51–55.
  12. “CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos,” ICLR, 2023.
  13. “Unsupervised Sound Separation Using Mixture Invariant Training,” NeurIPS, vol. 33, pp. 3846–3857, 2020.
  14. “An Empirical Study of Conv-TasNet,” in ICASSP, 2020, pp. 7264–7268.
  15. “LibriMix: An Open-Source Dataset for Generalizable Speech Separation,” arXiv:2005.11262, 2020.
  16. “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,” IEEE/ACM TASLP, vol. 29, pp. 3451–3460, 2021.
  17. “Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation,” in SIGGRAPH, 2018.
  18. “CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit,” 2017.
  19. G. J. Mysore, “Can We Automatically Transform Speech Recorded on Common Consumer Devices in Real-World Environments into Professional Production Quality Speech?—A Dataset, Insights, and Challenges,” IEEE Signal Processing Letters, vol. 22, no. 8, pp. 1006–1010, 2014.
  20. J. S. Garofolo et al., “TIMIT Acoustic Phonetic Continuous Speech Corpus,” Linguistic Data Consortium, 1993.
  21. “WHAM!: Extending Speech Separation to Noisy Environments,” in INTERSPEECH, 2019, pp. 1368–1372.
  22. “The Diverse Environments Multi-Channel Acoustic Noise Database (DEMAND): A Database of Multichannel Environmental Noise Recordings,” in Meetings on Acoustics, 2013.
  23. “Cutting Music Source Separation Some Slakh: A Dataset to Study the Impact of Training Data Quality and Quantity,” in WASPAA, 2019, pp. 45–49.
  24. O. Gillet and G. Richard, “ENST-Drums: An Extensive Audio-Visual Database for Drum Signals Processing,” in ISMIR, 2006.
  25. “VocalSet: A Singing Voice Dataset,” in ISMIR, 2018, pp. 468–474.
  26. “Automatic Identification of Emotional Cues in Chinese Opera Singing,” in ICMPC, 2014.
  27. “The Sound of Pixels,” in ECCV, 2018, pp. 570–586.
  28. “EGFxSet: Electric Guitar Tones Processed Through Real Effects of Distortion, Modulation, Delay and Reverb,” in ISMIR, 2022.
  29. “High Fidelity Speech Enhancement with Band-split RNN,” in INTERSPEECH, 2023, pp. 2483–2487.
  30. “Multitalker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks,” IEEE/ACM TASLP, vol. 25, no. 10, pp. 1901–1913, 2017.
  31. “FSD50K: An Open Dataset of Human-Labeled Sound Events,” IEEE/ACM TASLP, vol. 30, pp. 829–852, 2021.
  32. “Librispeech: An ASR Corpus Based on Public Domain Audio Books,” in ICASSP, 2015, pp. 5206–5210.
  33. “The Cocktail Fork Problem: Three-Stem Audio Separation for Real-world Soundtracks,” in ICASSP, 2022, pp. 526–530.
  34. “MUSDB18-HQ - An Uncompressed Version of MUSDB18,” 2019.
  35. “SDR–Half-Baked or Well Done?,” in ICASSP, 2019, pp. 626–630.
  36. “The 2018 Signal Separation Evaluation Campaign,” in LVA/ICA, 2018, pp. 293–305.
Citations (9)

Summary

  • The paper presents a unified approach that leverages a dataset of 15,499 hours to separate audio sources across varied domains.
  • It employs waveform-based, STFT-based, and band-splitting architectures optimized using permutation invariant training and logarithmic-MSE loss.
  • The study shows robust in-distribution performance and improved out-of-distribution generalization, highlighting the value of large-scale data for fine-tuning in new scenarios.

Generalizing Audio Source Separation with Large-Scale Data (GASS)

Introduction to the Study

The field of audio source separation has witnessed significant advancements, yet the domain-specific constraints of most existing studies—focusing on speech or music—limit the applicability of these models to broader scenarios. In light of this limitation, the research presented by Jordi Pons et al. introduces an ambitious approach aiming for universal audio source separation. This endeavor is noteworthy for its attempt to create a unified model capable of disentangling various audio sources from a mix without prior domain-specific knowledge.

Dataset Creation and Model Training

Large-Scale Source Separation Dataset

A pivotal aspect of this paper is the creation of a significantly enlarged dataset intended for audio source separation training. The dataset encompasses 15,499 hours of recordings, spanning speech, music, and sound events—representing a substantial leap over existing datasets in terms of volume.

  • The dataset is categorized into speech and sound events (both foreground and background) and music (both foreground and background), with distinct control over the gains for each category to simulate realistic mixes of foreground and background sounds.
  • A noteworthy approach in dataset compilation is the consideration of background sounds as a single, combined source, aiming for a balance between the separation of dominant sources and the practicality of not over-separating ambient sounds.

Models and Training

The paper explores the efficacy of three distinct models: TDANet-Wav, TDANet-STFT, and BSRNN. Each of these models undergoes training on the large-scale dataset with permutation invariant training and logarithmic-MSE loss as the primary optimization strategies.

  • TDANet-Wav and TDANet-STFT utilize waveform-based and STFT-based approaches, respectively, adapting the encoder-decoder architecture for audio separation tasks.
  • BSRNN leverages band-splitting techniques to cater specifically to music source separation, illustrating a tailored approach within the broader quest for universal separation capabilities.

Evaluation and Findings

In-Distribution Separation Performance

The models demonstrate strong in-distribution performance across speech, sound event, and music separation tasks. Their capacity to generalize across different source types without specific prior knowledge is underscored, revealing the potential of large-scale data in fostering versatile audio separation models.

  • Each model shows particular strengths in different separation tasks, suggesting the importance of architectural choices in relation to the target audio sources.

Out-of-Distribution Generalization

When evaluated on out-of-distribution datasets—for instance, FUSS for universal sound separation and Libri2Mix for speech separation—the pre-trained models show varying degrees of generalizability.

  • The fine-tuning process consistently improves performance across these external datasets, reiterating the models' adaptability to novel separation challenges not encountered during the initial training phase.

Concluding Remarks and Future Directions

This research marks a significant stride towards general audio source separation, leveraging unprecedented scale in data to train models capable of broad applicability across domain barriers. The successful application of these models to diverse separation tasks, together with their fine-tuning adaptability, sets a promising precedent for future explorations in the field.

The paper also uncovers challenges and limitations, particularly in the separation of out-of-distribution cinematic and music content, posing questions for future research focus. The fine-tuning successes, however, highlight a path forward wherein pre-trained universal models can be specialized through targeted adaptation, marrying the benefits of large-scale data training with the flexibility needed for specific audio separation tasks.

As the field progresses, further enriching the diversity of training data and exploring model architectures tailored to harness this diversity will be crucial. This work not only contributes a valuable dataset and insights into model training for universal audio source separation but also invites continued innovation towards more holistic and robust solutions in the complex landscape of audio processing.