Separate Anything You Describe (2308.05037v3)

Published 9 Aug 2023 in eess.AS, cs.AI, cs.MM, and cs.SD

Abstract: Language-queried audio source separation (LASS) is a new paradigm for computational auditory scene analysis (CASA). LASS aims to separate a target sound from an audio mixture given a natural language query, which provides a natural and scalable interface for digital audio applications. Recent works on LASS, despite attaining promising separation performance on specific sources (e.g., musical instruments, limited classes of audio events), are unable to separate audio concepts in the open domain. In this work, we introduce AudioSep, a foundation model for open-domain audio source separation with natural language queries. We train AudioSep on large-scale multimodal datasets and extensively evaluate its capabilities on numerous tasks including audio event separation, musical instrument separation, and speech enhancement. AudioSep demonstrates strong separation performance and impressive zero-shot generalization ability using audio captions or text labels as queries, substantially outperforming previous audio-queried and language-queried sound separation models. For reproducibility of this work, we will release the source code, evaluation benchmark and pre-trained model at: https://github.com/Audio-AGI/AudioSep.

Authors (10)

Xubo Liu (66 papers)
Qiuqiang Kong (86 papers)
Yan Zhao (120 papers)
Haohe Liu (59 papers)
Yi Yuan (54 papers)
Yuzhuo Liu (4 papers)
Rui Xia (53 papers)
Yuxuan Wang (239 papers)
Mark D. Plumbley (114 papers)
Wenwu Wang (148 papers)

Citations (32)

View on Semantic Scholar

Summary

Separate Anything You Describe: An Expert Overview

The paper "Separate Anything You Describe" introduces AudioSep, a novel framework for open-domain audio source separation using natural language queries. The approach, labeled as language-queried audio source separation (LASS), constitutes a significant stride in computational auditory scene analysis (CASA), addressing limitations in existing models by allowing text-based descriptions to dictate the separation process.

Technical Framework

AudioSep is designed with two core components: QueryNet and SeparationNet. The QueryNet leverages the text encoder from the CLIP or CLAP models, trained on extensive multimodal datasets, to transform natural language queries into embeddings. These embeddings serve as conditioning input for the SeparationNet, a frequency-domain ResUNet model, specifically architected to handle the intricate task of isolating desired audio elements based on textual commands.

The process involves transforming the audio mixture into a spectrogram, upon which the SeparationNet applies learned masks derived from the query embedding, effectively isolating the target sound source. The training utilizes a robust L1 loss function, and the method is capable of operating effectively even in zero-shot settings due to its pre-training on comprehensive datasets.

Empirical Evaluation

The research includes extensive evaluation using a benchmark comprising diverse datasets: AudioSet, VGGSound, AudioCaps, Clotho, ESC-50, MUSIC, and Voicebank-DEMAND. Results demonstrate that AudioSep significantly outperforms previous models like LASS-Net and CLIPSep, with high SDR improvement across tasks, emphasizing its robustness in real-world applicability.

In zero-shot scenarios, such as the ESC-50 and MUSIC datasets, AudioSep exhibits impressive accuracy, indicating that the model's generalization capabilities are exceptionally strong. This is particularly valuable for applications where predefined categories and extensive target data are unavailable.

Comparative Analysis

AudioSep stands out in comparison to conventional audio-queried models such as USS-ResUNet30 and off-the-shelf separation systems, offering superior performance while requiring only the text description of the target source. Unlike its predecessors, AudioSep accommodates a wider range of sound sources and conditions due to its foundation on large-scale, multimodal datasets.

Multimodal Supervision Insights

The paper explores the effects of integrating multimodal supervision to scale up AudioSep. Notably, the results describe the nuanced role of different audio and visual training data ratios on performance. It concludes that while additional visual data does not drastically augment model efficacy, robust text supervision remains crucial, especially given the noise inherent in large-scale video data.

Implications and Future Directions

The implications of AudioSep's capabilities are vast, offering enhanced accessibility and flexibility in audio content manipulation for diverse applications, from music production to media content retrieval. The paper suggests potential expansions into unsupervised learning techniques and multimodal query separations, promising to further broaden AudioSep's utility.

In conclusion, the paper presents a well-rounded, empirical exploration of language-driven auditory separation. AudioSep's enhanced adaptability and performance demonstrate a substantial advancement in AI-driven CASA, laying groundwork for future innovations in audio processing and multimodal integration.

PDF Markdown

Related Papers

Find Related Papers

GitHub

GitHub - Audio-AGI/AudioSep: Official implementation of "Separate Anything You Describe" (1,658 stars)

Tweets

https://twitter.com/taziku_co/status/1746856192578814309

YouTube

Show All Videos