Symbolic Music Generation with Non-Differentiable Rule Guided Diffusion (2402.14285v4)

Published 22 Feb 2024 in cs.SD, cs.LG, and eess.AS

Abstract: We study the problem of symbolic music generation (e.g., generating piano rolls), with a technical focus on non-differentiable rule guidance. Musical rules are often expressed in symbolic form on note characteristics, such as note density or chord progression, many of which are non-differentiable which pose a challenge when using them for guided diffusion. We propose Stochastic Control Guidance (SCG), a novel guidance method that only requires forward evaluation of rule functions that can work with pre-trained diffusion models in a plug-and-play way, thus achieving training-free guidance for non-differentiable rules for the first time. Additionally, we introduce a latent diffusion architecture for symbolic music generation with high time resolution, which can be composed with SCG in a plug-and-play fashion. Compared to standard strong baselines in symbolic music generation, this framework demonstrates marked advancements in music quality and rule-based controllability, outperforming current state-of-the-art generators in a variety of settings. For detailed demonstrations, code and model checkpoints, please visit our project website: https://scg-rule-guided-music.github.io/.

References (50)

Authors (9)

Yujia Huang (12 papers)
Adishree Ghatare (1 paper)
Yuanzhe Liu (7 papers)
Ziniu Hu (51 papers)
Qinsheng Zhang (28 papers)
Chandramouli S Sastry (1 paper)
Siddharth Gururani (14 papers)
Sageev Oore (26 papers)
Yisong Yue (154 papers)

Citations (10)

View on Semantic Scholar

Summary

The paper introduces a novel SCG algorithm that efficiently integrates non-differentiable musical rules into pre-trained diffusion models.
The methodology employs path integral control theory to derive analytical optimal control without requiring additional backpropagation.
The latent diffusion architecture enhances temporal resolution and outperforms existing methods in symbolic music generation.

Advancing Symbolic Music Generation with Stochastic Control Guided Diffusion Models

Introduction

Symbolic music generation has witnessed a significant surge in research interest, underscored by the rapid evolution in generative models. This paper presents a novel approach to symbolic music generation, targeting the challenge of integrating non-differentiable and complex musical rules into the generation process. The proposed Stochastic Control Guidance (SCG) method enables the seamless incorporation of such rules into pre-trained diffusion models without necessitating additional training. This advancement facilitates a plug-and-play mechanism, providing a flexible and intuitive means for composers to influence the music generation process directly through rule-based controls. The introduction of a latent diffusion architecture further enhances the model's capability to generate symbolic music with high temporal resolution, setting new benchmarks in music quality and rule-based controllability.

Related Works

The literature on symbolic music generation predominantly spans two methodologies: MIDI token-based and piano roll-based approaches, each with inherent limitations related to rule integration and controllability. Recent developments in diffusion models have shown promise in image, audio, and video generation, inspiring approaches for symbolic music generation. However, guiding these models with non-differentiable symbolic music rules remains a challenge, largely due to the non-differentiability of many musical rules and the black-box nature of APIs used to evaluate rule compliance.

Stochastic Control Guidance

The SCG algorithm, rooted in stochastic control theory, addresses the challenge of rule guidance in generative models. By viewing the problem through the lens of optimal control within a stochastic dynamical system, the SCG algorithm efficiently steers the generation process towards samples that adhere to specified music rules. The methodology employs path integral control theory to derive an analytical form of optimal control, which is then implemented in an efficient manner compatible with diffusion models. This approach does not require backpropagation through the rule functions, making it suitable for non-differentiable rules.

Latent Diffusion Architecture

To complement the SCG method, a latent diffusion architecture is introduced, which excels in generating rich and dynamic musical pieces with fine temporal granularity. The architecture leverages the power of transformers within a latent space to model complex musical structures, achieving state-of-the-art performance across various music generation tasks.

Experimental Results

Comparative analyses with existing symbolic music generation methods underscore the effectiveness of the proposed framework. The model demonstrates superior performance in adherence to musical rules, including non-differentiable ones, outpacing other current generative models. Additionally, the flexibility of the SCG method is showcased in tasks requiring composite rule guidance and music editing, further illustrating its potential as a tool for musical creativity.

Conclusion and Future Directions

The integration of stochastic control theory into symbolic music generation represents a significant step forward in the field. This research not only addresses existing challenges in rule-based guidance and controllability but also opens avenues for future work on improving computational efficiency and exploring novel applications within the field of creative AI. The SCG method, together with the latent diffusion architecture, holds the promise of revolutionizing how we approach the task of generating symbolic music, paving the way for more intuitive and expressive compositional tools.

Acknowledgements

The development of this innovative approach to symbolic music generation was supported by various funding sources, including AeroVironment, NSF #1918655, a Caltech CDSF Postdoctoral Fellowship, the Canadian Institute for Advanced Research (CIFAR), and NSERC. This collaborative effort highlights the cross-disciplinary nature of research in artificial intelligence and music, driving forward the boundaries of what's possible in the field of generative models.

Related Papers

GitHub

Tweets

https://twitter.com/YujiaHuangC/status/1761115174742761610

https://twitter.com/fly51fly/status/1761700663589994895

https://twitter.com/yisongyue/status/1846399787064873479

https://twitter.com/arxivsanitybot/status/1761382320722923752

https://twitter.com/knishimae0531/status/1761209634466287665

https://twitter.com/AudioAndSpeech/status/1798066145221922919