Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
120 tokens/sec
GPT-4o
10 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
3 tokens/sec
DeepSeek R1 via Azure Pro
51 tokens/sec
2000 character limit reached

FlowSep: Language-Queried Sound Separation with Rectified Flow Matching (2409.07614v3)

Published 11 Sep 2024 in cs.SD and eess.AS

Abstract: Language-queried audio source separation (LASS) focuses on separating sounds using textual descriptions of the desired sources. Current methods mainly use discriminative approaches, such as time-frequency masking, to separate target sounds and minimize interference from other sources. However, these models face challenges when separating overlapping soundtracks, which may lead to artifacts such as spectral holes or incomplete separation. Rectified flow matching (RFM), a generative model that establishes linear relations between the distribution of data and noise, offers superior theoretical properties and simplicity, but has not yet been explored in sound separation. In this work, we introduce FlowSep, a new generative model based on RFM for LASS tasks. FlowSep learns linear flow trajectories from noise to target source features within the variational autoencoder (VAE) latent space. During inference, the RFM-generated latent features are reconstructed into a mel-spectrogram via the pre-trained VAE decoder, followed by a pre-trained vocoder to synthesize the waveform. Trained on 1,680 hours of audio data, FlowSep outperforms the state-of-the-art models across multiple benchmarks, as evaluated with subjective and objective metrics. Additionally, our results show that FlowSep surpasses a diffusion-based LASS model in both separation quality and inference efficiency, highlighting its strong potential for audio source separation tasks. Code, pre-trained models and demos can be found at: https://audio-agi.github.io/FlowSep_demo/ .

Citations (1)

Summary

  • The paper introduces FlowSep, which applies Rectified Flow Matching within a generative framework to enhance language-queried sound separation.
  • It synergistically combines FLAN-T5 encoding, VAE processing, and BigVGAN vocoding to reduce artifacts and improve signal reconstruction.
  • Experiments on 1,680 hours of audio show that FlowSep achieves lower FAD scores and higher CLAPScores compared to state-of-the-art approaches.

FlowSep: Language-Queried Sound Separation with Rectified Flow Matching

Overview

This paper introduces FlowSep, a novel language-queried audio source separation system. FlowSep leverages Rectified Flow Matching (RFM) within a generative architecture to address limitations of traditional discriminative models used in Language-Queried Audio Source Separation (LASS) tasks. The innovative methodology, comprehensive experimental results, and competitive performance metrics underline its significant potential for both theoretical advancements and practical applications.

Methodology

In the proposed FlowSep model, several key components are synergistically combined:

  1. FLAN-T5 Encoder: Using a pre-trained FLAN-T5 encoder, the model converts textual queries into embeddings. This choice is substantiated by advancements in related fields indicating its superior performance over alternatives like CLAP.
  2. VAE Encoder and Decoder: A variational autoencoder (VAE) processes audio signals into mel-spectrograms and reconstructs them back into waveforms. This component helps manage the transition between different modalities (audio and text).
  3. RFM-Based Latent Feature Generator: The heart of FlowSep's innovation lies in using RFM to predict linear pathways from noise to target audio features within the VAE's latent space. This approach reduces the artifacts typically associated with discriminative models, such as spectral holes and incomplete separation.
  4. BigVGAN Vocoder: A state-of-the-art GAN-based vocoder (BigVGAN) is employed to generate high-quality waveforms from the reconstructed mel-spectrograms.

Performance Evaluation

FlowSep was trained on an extensive dataset of 1,680 hours of audio. Evaluation across five benchmarks demonstrates superior performance compared to state-of-the-art models, including AudioSep and diffusion-based methods.

Key performance metrics include:

  • Frechet Audio Distance (FAD): FlowSep achieved lower FAD scores across all datasets, indicating enhanced perceptual quality.
  • CLAPScore and CLAPScoreA_A: Reflecting improved alignment with textual queries and ground truth audio, FlowSep's CLAPScore metrics significantly surpass those of baseline models.

Subjective and objective evaluations also emphasize FlowSep's capabilities, particularly in real-world and zero-shot scenarios.

Theoretical and Practical Implications

FlowSep's use of RFM introduces a new paradigm in generative models for sound separation. Its linear flow matching approach offers both computational simplicity and theoretical elegance. Practically, this model demonstrates robustness in diverse and dynamic acoustic environments, making it suitable for applications such as multimedia content retrieval, automatic audio editing, and audio-augmented listening.

Future Directions

The promising results of FlowSep open several avenues for future research:

  • Scalability and Efficiency: Further optimization of the RFM's efficiency could facilitate real-time applications.
  • Generalization to Other Modalities: Extending the RFM-based framework to incorporate visual and other sensory data could enhance multimodal learning systems.
  • Enhanced Text-Audio Alignment: Improving the text query processing, possibly through integration with contemporary LLMs, could refine the system's performance.

Conclusion

FlowSep presents a substantial leap in the field of language-queried sound separation. Its application of RFM within a generative model framework establishes a promising direction for future research, emphasizing enhanced separation quality and efficiency. The published results underscore its potential for real-world deployment, contributing to the broader discourse on effective multimodal learning systems.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-up Questions

We haven't generated follow-up questions for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com