Universal Source Separation with Weakly Labelled Data (2305.07447v1)

Published 11 May 2023 in cs.SD and eess.AS

Abstract: Universal source separation (USS) is a fundamental research task for computational auditory scene analysis, which aims to separate mono recordings into individual source tracks. There are three potential challenges awaiting the solution to the audio source separation task. First, previous audio source separation systems mainly focus on separating one or a limited number of specific sources. There is a lack of research on building a unified system that can separate arbitrary sources via a single model. Second, most previous systems require clean source data to train a separator, while clean source data are scarce. Third, there is a lack of USS system that can automatically detect and separate active sound classes in a hierarchical level. To use large-scale weakly labeled/unlabeled audio data for audio source separation, we propose a universal audio source separation framework containing: 1) an audio tagging model trained on weakly labeled data as a query net; and 2) a conditional source separation model that takes query net outputs as conditions to separate arbitrary sound sources. We investigate various query nets, source separation models, and training strategies and propose a hierarchical USS strategy to automatically detect and separate sound classes from the AudioSet ontology. By solely leveraging the weakly labelled AudioSet, our USS system is successful in separating a wide variety of sound classes, including sound event separation, music source separation, and speech enhancement. The USS system achieves an average signal-to-distortion ratio improvement (SDRi) of 5.57 dB over 527 sound classes of AudioSet; 10.57 dB on the DCASE 2018 Task 2 dataset; 8.12 dB on the MUSDB18 dataset; an SDRi of 7.28 dB on the Slakh2100 dataset; and an SSNR of 9.00 dB on the voicebank-demand dataset. We release the source code at https://github.com/bytedance/uss

Authors (7)

Qiuqiang Kong (86 papers)
Ke Chen (241 papers)
Haohe Liu (59 papers)
Xingjian Du (25 papers)
Taylor Berg-Kirkpatrick (106 papers)
Shlomo Dubnov (40 papers)
Mark D. Plumbley (114 papers)

Citations (14)

View on Semantic Scholar

Summary

The paper presents a novel framework that leverages weakly labelled data to isolate arbitrary sound sources from complex audio recordings.
It employs a three-part system: sound event detection, audio tagging for condition vectors, and a conditional source separation module.
The approach shows practical feasibility across diverse datasets, highlighting its potential for scalable applications in speech enhancement and music separation.

Source Separation with Weakly Labelled Data: An Academic Overview

The paper presents a novel framework for source separation, focusing on the utilization of weakly labelled data within the context of computational auditory scene analysis (CASA). This work aims to address the challenges associated with general source separation, which seeks to isolate arbitrary sound classes from complex audio recordings. Traditional approaches often focus on specific sound types like speech or music and rely heavily on strongly labelled datasets, which are difficult and costly to obtain. The proposed methodology circumvents these limitations by leveraging weakly labelled data, specifically from the extensive AudioSet database, to develop a more versatile source separation system.

Framework and Methodology

The proposed framework integrates three primary systems:

Sound Event Detection (SED) System: This component is trained to identify audio segments containing potential sound events using weak labels.
Audio Tagging System: It predicts soft labels to create a condition vector, providing probabilistic information on the presence of sound classes within audio segments.
Conditional Source Separation System: Driven by the condition vector, this system isolates target sources from the mixed audio based on the detected sound classes.

This approach demonstrates potential for building a general source separation system without the need for specific training data from each dataset type. By employing weakly labelled inputs, the framework proves capable of handling tasks such as AudioSet separation, speech enhancement, and music source separation.

Results and Observations

While specific numerical outcomes such as the Perceptual Evaluation of Speech Quality (PESQ) scores or Signal-to-Distortion Ratio (SDR) values were placeholders in the paper, the approach discusses the effective application of this method on several datasets, including Voicebank-Demand, MUSDB18, and FSD50k. This indicates the practical feasibility of implementing a weakly labelled, universal source separation system across a wide range of audio types. However, future work should provide concrete numerical validations to substantiate these claims further.

Implications and Future Directions

The implications of this research are broad and impactful, particularly in the realms of automated audio processing and machine listening systems. The reliance on weakly labelled data not only widens the applicability of source separation technologies but also reduces the dependency on resource-intensive data labelling processes. In the context of CASA, this framework introduces an adaptable method to segregate mixed audio streams in real-world scenarios, enhancing applications in speech enhancement and multimedia content analysis.

Looking forward, further exploration may focus on integrating more sophisticated neural architectures, such as transformers, to enhance model performance. Additionally, refining the audio tagging and SED components to improve accuracy in condition vector generation could result in more precise separation outputs. Exploring zero-shot audio separation using embeddings might also open avenues for more flexible and generalized systems capable of adapting to unseen audio classes.

Overall, this research presents a significant step toward achieving effective universal source separation, advancing both theoretical understanding and practical implementations in audio signal processing.

PDF Markdown

Related Papers

GitHub

GitHub - bytedance/uss (357 stars)