GASS: Generalizing Audio Source Separation with Large-scale Data (2310.00140v1)
Abstract: Universal source separation targets at separating the audio sources of an arbitrary mix, removing the constraint to operate on a specific domain like speech or music. Yet, the potential of universal source separation is limited because most existing works focus on mixes with predominantly sound events, and small training datasets also limit its potential for supervised learning. Here, we study a single general audio source separation (GASS) model trained to separate speech, music, and sound events in a supervised fashion with a large-scale dataset. We assess GASS models on a diverse set of tasks. Our strong in-distribution results show the feasibility of GASS models, and the competitive out-of-distribution performance in sound event and speech separation shows its generalization abilities. Yet, it is challenging for GASS models to generalize for separating out-of-distribution cinematic and music content. We also fine-tune GASS models on each dataset and consistently outperform the ones without pre-training. All fine-tuned models (except the music separation one) obtain state-of-the-art results in their respective benchmarks.
- “An Efficient Encoder-Decoder Architecture with Top-down Attention for Speech Separation,” in ICLR, 2023.
- “Hybrid Transformers for Music Source Separation,” in ICASSP, 2023, pp. 1–5.
- Y. Luo and J. Yu, “Music Source Separation With Band-Split RNN,” IEEE/ACM TASLP, 2023.
- “Universal Sound Separation,” in WASPAA, 2019, pp. 175–179.
- “What’s All the Fuss About Free Universal Sound Separation Data?,” in ICASSP, 2021, pp. 186–190.
- “Compute and Memory Efficient Universal Sound Source Separation,” Journal of Signal Processing Systems, pp. 245–259, 2022.
- “Improving Universal Sound Separation Using Sound Classification,” in ICASSP, 2020, pp. 96–100.
- “Adversarial Permutation Invariant Training for Universal Sound Separation,” in ICASSP, 2023, pp. 1–5.
- “Universal Source Separation with Weakly Labelled Data,” arXiv:2305.07447, 2023.
- “Separate Anything You Describe,” arXiv:2308.05037, 2023.
- “Sparse, Efficient, and Semantic Mixture Invariant Training: Taming In-the-wild Unsupervised Sound Separation,” in WASPAA, 2021, pp. 51–55.
- “CLIPSep: Learning Text-queried Sound Separation with Noisy Unlabeled Videos,” ICLR, 2023.
- “Unsupervised Sound Separation Using Mixture Invariant Training,” NeurIPS, vol. 33, pp. 3846–3857, 2020.
- “An Empirical Study of Conv-TasNet,” in ICASSP, 2020, pp. 7264–7268.
- “LibriMix: An Open-Source Dataset for Generalizable Speech Separation,” arXiv:2005.11262, 2020.
- “HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units,” IEEE/ACM TASLP, vol. 29, pp. 3451–3460, 2021.
- “Looking to Listen at the Cocktail Party: A Speaker-Independent Audio-Visual Model for Speech Separation,” in SIGGRAPH, 2018.
- “CSTR VCTK Corpus: English Multi-speaker Corpus for CSTR Voice Cloning Toolkit,” 2017.
- G. J. Mysore, “Can We Automatically Transform Speech Recorded on Common Consumer Devices in Real-World Environments into Professional Production Quality Speech?—A Dataset, Insights, and Challenges,” IEEE Signal Processing Letters, vol. 22, no. 8, pp. 1006–1010, 2014.
- J. S. Garofolo et al., “TIMIT Acoustic Phonetic Continuous Speech Corpus,” Linguistic Data Consortium, 1993.
- “WHAM!: Extending Speech Separation to Noisy Environments,” in INTERSPEECH, 2019, pp. 1368–1372.
- “The Diverse Environments Multi-Channel Acoustic Noise Database (DEMAND): A Database of Multichannel Environmental Noise Recordings,” in Meetings on Acoustics, 2013.
- “Cutting Music Source Separation Some Slakh: A Dataset to Study the Impact of Training Data Quality and Quantity,” in WASPAA, 2019, pp. 45–49.
- O. Gillet and G. Richard, “ENST-Drums: An Extensive Audio-Visual Database for Drum Signals Processing,” in ISMIR, 2006.
- “VocalSet: A Singing Voice Dataset,” in ISMIR, 2018, pp. 468–474.
- “Automatic Identification of Emotional Cues in Chinese Opera Singing,” in ICMPC, 2014.
- “The Sound of Pixels,” in ECCV, 2018, pp. 570–586.
- “EGFxSet: Electric Guitar Tones Processed Through Real Effects of Distortion, Modulation, Delay and Reverb,” in ISMIR, 2022.
- “High Fidelity Speech Enhancement with Band-split RNN,” in INTERSPEECH, 2023, pp. 2483–2487.
- “Multitalker Speech Separation with Utterance-level Permutation Invariant Training of Deep Recurrent Neural Networks,” IEEE/ACM TASLP, vol. 25, no. 10, pp. 1901–1913, 2017.
- “FSD50K: An Open Dataset of Human-Labeled Sound Events,” IEEE/ACM TASLP, vol. 30, pp. 829–852, 2021.
- “Librispeech: An ASR Corpus Based on Public Domain Audio Books,” in ICASSP, 2015, pp. 5206–5210.
- “The Cocktail Fork Problem: Three-Stem Audio Separation for Real-world Soundtracks,” in ICASSP, 2022, pp. 526–530.
- “MUSDB18-HQ - An Uncompressed Version of MUSDB18,” 2019.
- “SDR–Half-Baked or Well Done?,” in ICASSP, 2019, pp. 626–630.
- “The 2018 Signal Separation Evaluation Campaign,” in LVA/ICA, 2018, pp. 293–305.