- The paper introduces FlowSep, which applies Rectified Flow Matching within a generative framework to enhance language-queried sound separation.
- It synergistically combines FLAN-T5 encoding, VAE processing, and BigVGAN vocoding to reduce artifacts and improve signal reconstruction.
- Experiments on 1,680 hours of audio show that FlowSep achieves lower FAD scores and higher CLAPScores compared to state-of-the-art approaches.
FlowSep: Language-Queried Sound Separation with Rectified Flow Matching
Overview
This paper introduces FlowSep, a novel language-queried audio source separation system. FlowSep leverages Rectified Flow Matching (RFM) within a generative architecture to address limitations of traditional discriminative models used in Language-Queried Audio Source Separation (LASS) tasks. The innovative methodology, comprehensive experimental results, and competitive performance metrics underline its significant potential for both theoretical advancements and practical applications.
Methodology
In the proposed FlowSep model, several key components are synergistically combined:
- FLAN-T5 Encoder: Using a pre-trained FLAN-T5 encoder, the model converts textual queries into embeddings. This choice is substantiated by advancements in related fields indicating its superior performance over alternatives like CLAP.
- VAE Encoder and Decoder: A variational autoencoder (VAE) processes audio signals into mel-spectrograms and reconstructs them back into waveforms. This component helps manage the transition between different modalities (audio and text).
- RFM-Based Latent Feature Generator: The heart of FlowSep's innovation lies in using RFM to predict linear pathways from noise to target audio features within the VAE's latent space. This approach reduces the artifacts typically associated with discriminative models, such as spectral holes and incomplete separation.
- BigVGAN Vocoder: A state-of-the-art GAN-based vocoder (BigVGAN) is employed to generate high-quality waveforms from the reconstructed mel-spectrograms.
FlowSep was trained on an extensive dataset of 1,680 hours of audio. Evaluation across five benchmarks demonstrates superior performance compared to state-of-the-art models, including AudioSep and diffusion-based methods.
Key performance metrics include:
- Frechet Audio Distance (FAD): FlowSep achieved lower FAD scores across all datasets, indicating enhanced perceptual quality.
- CLAPScore and CLAPScoreA: Reflecting improved alignment with textual queries and ground truth audio, FlowSep's CLAPScore metrics significantly surpass those of baseline models.
Subjective and objective evaluations also emphasize FlowSep's capabilities, particularly in real-world and zero-shot scenarios.
Theoretical and Practical Implications
FlowSep's use of RFM introduces a new paradigm in generative models for sound separation. Its linear flow matching approach offers both computational simplicity and theoretical elegance. Practically, this model demonstrates robustness in diverse and dynamic acoustic environments, making it suitable for applications such as multimedia content retrieval, automatic audio editing, and audio-augmented listening.
Future Directions
The promising results of FlowSep open several avenues for future research:
- Scalability and Efficiency: Further optimization of the RFM's efficiency could facilitate real-time applications.
- Generalization to Other Modalities: Extending the RFM-based framework to incorporate visual and other sensory data could enhance multimodal learning systems.
- Enhanced Text-Audio Alignment: Improving the text query processing, possibly through integration with contemporary LLMs, could refine the system's performance.
Conclusion
FlowSep presents a substantial leap in the field of language-queried sound separation. Its application of RFM within a generative model framework establishes a promising direction for future research, emphasizing enhanced separation quality and efficiency. The published results underscore its potential for real-world deployment, contributing to the broader discourse on effective multimodal learning systems.