Papers

Topics

Authors

Recent

View all

Detailed Answer

Quick Answer

Concise responses based on abstracts only

Detailed Answer

Well-researched responses based on abstracts and relevant paper content.

Custom Instructions Pro

Preferences or requirements that you'd like Emergent Mind to consider when generating responses

Gemini 2.5 Flash

Gemini 2.5 Flash 44 tok/s

Gemini 2.5 Pro 41 tok/s Pro

GPT-5 Medium 13 tok/s Pro

GPT-5 High 15 tok/s Pro

GPT-4o 86 tok/s Pro

Kimi K2 208 tok/s Pro

GPT OSS 120B 447 tok/s Pro

Claude Sonnet 4 36 tok/s Pro

2000 character limit reached

Score Distillation Sampling for Audio: Source Separation, Synthesis, and Beyond (2505.04621v1)

Published 7 May 2025 in cs.SD, cs.AI, cs.LG, cs.MM, and eess.AS

Abstract: We introduce Audio-SDS, a generalization of Score Distillation Sampling (SDS) to text-conditioned audio diffusion models. While SDS was initially designed for text-to-3D generation using image diffusion, its core idea of distilling a powerful generative prior into a separate parametric representation extends to the audio domain. Leveraging a single pretrained model, Audio-SDS enables a broad range of tasks without requiring specialized datasets. In particular, we demonstrate how Audio-SDS can guide physically informed impact sound simulations, calibrate FM-synthesis parameters, and perform prompt-specified source separation. Our findings illustrate the versatility of distillation-based methods across modalities and establish a robust foundation for future work using generative priors in audio tasks.

Summary

Score Distillation Sampling for Audio: Source Separation, Synthesis, and Beyond

The paper on "Score Distillation Sampling for Audio: Source Separation, Synthesis, and Beyond" presents a significant investigation into extending Score Distillation Sampling (SDS) to the domain of audio, specifically focusing on text-conditioned audio diffusion models. This research is notably aligned with current trends in leveraging diffusion models for complex generative tasks, while addressing an evident gap in audio modeling when compared to the advancements seen in image processing.

The core technique, SDS, has traditionally been utilized in text-to-3D generation using image diffusion models. The essence of SDS lies in distilling a robust generative prior into a separate, parametrically optimized representation. The authors have successfully adapted this approach to audio, facilitating a range of tasks with a single pretrained model. Particularly, Audio-SDS provides capabilities for tasks that typically require specialized datasets, such as guiding simulations of impact sounds, calibrating FM-synthesis parameters, and realizing prompt-driven source separation.

Methodology and Contributions

The authors propose Audio-SDS, which generalizes SDS to text-conditioned audio diffusion models. This framework efficiently translates the SDS capabilities to the audio domain by encoding rendered audio signals, integrating noise, and utilizing a diffusion model to predict this noise. The key innovation is the optimization via backpropagation to align audio parameterization with the text-conditioned distribution learned by the diffusion model.

The paper highlights three main contributions:

Unified Framework for Diverse Audio Tasks: Audio-SDS employs a single pretrained model, obviating the need for task-specific data. This model can effectively engage in various tasks such as synthesis and editing.
SDS Improvements for Audio Models: The paper offers several modifications to SDS for improving stability and fidelity when applied to audio. These include bypassing encoder instabilities, multistep denoising, and emphasizing on multiscale spectrogram updates.
Applications in Concrete Audio Tasks: Demonstrations are provided across several tasks including FM synthesis tuning, impact sound parameterization, and source separation based on prompts. These tasks illustrate the method's applicability without the need of extensive retraining.

Numerical Results

Quantitatively, the method demonstrates improved prompt alignment as measured by metrics like CLAP scores, while maintaining signal reconstruction quality, particularly in source separation contexts. The synthesis pipelines for both FM and impact sounds show noticeable improvements in prompt concordance, indicating successful audio-to-prompt alignment.

Practical and Theoretical Implications

Practically, the Audio-SDS model extends the use of pretrained generative models beyond vision, addressing applications in sound design, interactive environments, and music production. Its capacity to undertake diverse tasks using a unified framework positions it as a versatile tool in real-world applications where specialized datasets might be lacking.

Theoretically, the adaptation of SDS to audio paves the way for cross-modal generative modeling, suggesting potential for integrating multiple modalities in a single model. The approach can be further refined to handle more complex audio tasks potentially in interactive or real-time systems, similar to what is developing in visual and textual modalities.

Future Directions

Looking forward, the model’s integration with more advanced audio diffusion models and exploring joint audio-video diffusion models could enhance the generative capabilities even further. The stability improvements and parameter tuning strategies discussed serve as a foundation for more extensive exploration in multimodal contexts.

In conclusion, this paper significantly contributes to the ongoing evolution of generative models, emphasizing the unexplored potential in the field of audio diffusion. By extending the capabilities of SDS to audio, it introduces a framework with broad implications for both theoretical exploration and practical deployment in audio-centric applications.