Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Symbolic Music Generation with Non-Differentiable Rule Guided Diffusion (2402.14285v4)

Published 22 Feb 2024 in cs.SD, cs.LG, and eess.AS

Abstract: We study the problem of symbolic music generation (e.g., generating piano rolls), with a technical focus on non-differentiable rule guidance. Musical rules are often expressed in symbolic form on note characteristics, such as note density or chord progression, many of which are non-differentiable which pose a challenge when using them for guided diffusion. We propose Stochastic Control Guidance (SCG), a novel guidance method that only requires forward evaluation of rule functions that can work with pre-trained diffusion models in a plug-and-play way, thus achieving training-free guidance for non-differentiable rules for the first time. Additionally, we introduce a latent diffusion architecture for symbolic music generation with high time resolution, which can be composed with SCG in a plug-and-play fashion. Compared to standard strong baselines in symbolic music generation, this framework demonstrates marked advancements in music quality and rule-based controllability, outperforming current state-of-the-art generators in a variety of settings. For detailed demonstrations, code and model checkpoints, please visit our project website: https://scg-rule-guided-music.github.io/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. Atassi, L. Generating symbolic music using diffusion models. arXiv preprint arXiv:2303.08385, 2023.
  2. An optimal control perspective on diffusion-based generative modeling. arXiv preprint arXiv:2211.01364, 2022.
  3. Midi-vae: Modeling dynamics and instrumentation of music with applications to style transfer. 19th International Society for Music Information Retrieval Conference, 2018.
  4. Ilvr: Conditioning method for denoising diffusion probabilistic models. ICCV, 2021.
  5. Encoding musical style with transformer autoencoders. In International Conference on Machine Learning, pp.  1899–1908. PMLR, 2020.
  6. Diffusion posterior sampling for general noisy inverse problems. ICLR, 2023.
  7. music21: A toolkit for computer-aided musicology and symbolic music data. 2010.
  8. Dai Pra, P. A stochastic control approach to reciprocal diffusion processes. Applied mathematics and Optimization, 23:313–329, 1991.
  9. Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34:8780–8794, 2021.
  10. Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  11. An image is worth 16x16 words: Transformers for image recognition at scale. ICLR, 2021.
  12. Efron, B. Tweedie’s formula and selection bias. Journal of the American Statistical Association, 106(496):1602–1614, 2011.
  13. Evans, L. C. Partial differential equations, volume 19. American Mathematical Society, 2022.
  14. Optimal control and nonlinear filtering for nondegenerate diffusion processes. Stochastics: An International Journal of Probability and Stochastic Processes, 8(1):63–77, 1982.
  15. Enabling factorized piano music modeling and generation with the MAESTRO dataset. In International Conference on Learning Representations, 2019.
  16. Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598, 2022.
  17. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  18. Imagen video: High definition video generation with diffusion models. arXiv preprint arXiv:2210.02303, 2022.
  19. Compound word transformer: Learning to compose full-song music over dynamic directed hypergraphs. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  178–186, 2021.
  20. Music transformer. arXiv preprint arXiv:1809.04281, 2018.
  21. Noise2music: Text-conditioned music generation with diffusion models. arXiv preprint arXiv:2302.03917, 2023.
  22. Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions. In Proceedings of the 28th ACM international conference on multimedia, pp.  1180–1188, 2020.
  23. Denoising diffusion restoration models. Advances in Neural Information Processing Systems, 35:23593–23606, 2022.
  24. Melodydiffusion: Chord-conditioned melody generation using a transformer-based diffusion model. Mathematics, 11(8):1915, 2023.
  25. Decoupled weight decay regularization. ICLR, 2019.
  26. Exploring conditioning for generative music systems with human-interpretable controls. arXiv preprint arXiv:1907.04352, 2019.
  27. Sdedit: Image synthesis and editing with stochastic differential equations. ICLR, 2022.
  28. Polyffusion: A diffusion model for polyphonic score generation with internal and external controls. Proc. of the 24th Int. Society for Music Information Retrieval Conf, 2023.
  29. Øksendal, B. Stochastic differential equations. Springer, 2003.
  30. Pavon, M. Stochastic control and nonequilibrium thermodynamical systems. Applied Mathematics and Optimization, 19:187–202, 1989.
  31. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.  4195–4205, 2023.
  32. Popmag: Pop music accompaniment generation. In Proceedings of the 28th ACM international conference on multimedia, pp.  1198–1206, 2020.
  33. A hierarchical latent vector model for learning long-term structure in music. In International conference on machine learning, pp.  4364–4373. PMLR, 2018.
  34. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp.  10684–10695, 2022.
  35. Denoising diffusion implicit models. ICLR, 2021a.
  36. Loss-guided diffusion models for plug-and-play controllable generation. In International Conference on Machine Learning, 2023.
  37. Score-based generative modeling through stochastic differential equations. ICLR, 2021b.
  38. Roformer: Enhanced transformer with rotary position embedding. Neurocomputing, pp.  127063, 2023.
  39. A generalized path integral control approach to reinforcement learning. The Journal of Machine Learning Research, 11:3137–3181, 2010.
  40. Theodorou, E. A. Nonlinear stochastic control and information theoretic dualities: Connections, interdependencies and thermodynamic interpretations. Entropy, 17(5):3352–3375, 2015.
  41. Denoising diffusion samplers. arXiv preprint arXiv:2302.13834, 2023.
  42. Figaro: Controllable music generation using learned and expert features. In The Eleventh International Conference on Learning Representations, 2022.
  43. Pop909: A pop-song dataset for music arrangement generation. arXiv preprint arXiv:2008.07142, 2020.
  44. Musemorphose: Full-song and fine-grained piano music style transfer with one transformer vae. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31:1953–1967, 2023.
  45. On the evaluation of generative models in music. Neural Computing and Applications, 32(9):4773–4784, 2020.
  46. Midinet: A convolutional generative adversarial network for symbolic-domain music generation. International Society for Music Information Retrieval, 2017.
  47. Deep learning’s shallow gains: a comparative evaluation of algorithms for automatic music generation. Machine Learning, 112(5):1785–1822, 2023.
  48. Sdmuse: Stochastic differential music editing and generation via hybrid representation. IEEE Transactions on Multimedia, 2023a.
  49. Path integral sampler: a stochastic control approach for sampling. arXiv preprint arXiv:2111.15141, 2021.
  50. Diffcollage: Parallel generation of large content with diffusion models. CVPR, 2023b.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (9)
  1. Yujia Huang (12 papers)
  2. Adishree Ghatare (1 paper)
  3. Yuanzhe Liu (7 papers)
  4. Ziniu Hu (51 papers)
  5. Qinsheng Zhang (28 papers)
  6. Chandramouli S Sastry (1 paper)
  7. Siddharth Gururani (14 papers)
  8. Sageev Oore (26 papers)
  9. Yisong Yue (154 papers)
Citations (10)

Summary

  • The paper introduces a novel SCG algorithm that efficiently integrates non-differentiable musical rules into pre-trained diffusion models.
  • The methodology employs path integral control theory to derive analytical optimal control without requiring additional backpropagation.
  • The latent diffusion architecture enhances temporal resolution and outperforms existing methods in symbolic music generation.

Advancing Symbolic Music Generation with Stochastic Control Guided Diffusion Models

Introduction

Symbolic music generation has witnessed a significant surge in research interest, underscored by the rapid evolution in generative models. This paper presents a novel approach to symbolic music generation, targeting the challenge of integrating non-differentiable and complex musical rules into the generation process. The proposed Stochastic Control Guidance (SCG) method enables the seamless incorporation of such rules into pre-trained diffusion models without necessitating additional training. This advancement facilitates a plug-and-play mechanism, providing a flexible and intuitive means for composers to influence the music generation process directly through rule-based controls. The introduction of a latent diffusion architecture further enhances the model's capability to generate symbolic music with high temporal resolution, setting new benchmarks in music quality and rule-based controllability.

Related Works

The literature on symbolic music generation predominantly spans two methodologies: MIDI token-based and piano roll-based approaches, each with inherent limitations related to rule integration and controllability. Recent developments in diffusion models have shown promise in image, audio, and video generation, inspiring approaches for symbolic music generation. However, guiding these models with non-differentiable symbolic music rules remains a challenge, largely due to the non-differentiability of many musical rules and the black-box nature of APIs used to evaluate rule compliance.

Stochastic Control Guidance

The SCG algorithm, rooted in stochastic control theory, addresses the challenge of rule guidance in generative models. By viewing the problem through the lens of optimal control within a stochastic dynamical system, the SCG algorithm efficiently steers the generation process towards samples that adhere to specified music rules. The methodology employs path integral control theory to derive an analytical form of optimal control, which is then implemented in an efficient manner compatible with diffusion models. This approach does not require backpropagation through the rule functions, making it suitable for non-differentiable rules.

Latent Diffusion Architecture

To complement the SCG method, a latent diffusion architecture is introduced, which excels in generating rich and dynamic musical pieces with fine temporal granularity. The architecture leverages the power of transformers within a latent space to model complex musical structures, achieving state-of-the-art performance across various music generation tasks.

Experimental Results

Comparative analyses with existing symbolic music generation methods underscore the effectiveness of the proposed framework. The model demonstrates superior performance in adherence to musical rules, including non-differentiable ones, outpacing other current generative models. Additionally, the flexibility of the SCG method is showcased in tasks requiring composite rule guidance and music editing, further illustrating its potential as a tool for musical creativity.

Conclusion and Future Directions

The integration of stochastic control theory into symbolic music generation represents a significant step forward in the field. This research not only addresses existing challenges in rule-based guidance and controllability but also opens avenues for future work on improving computational efficiency and exploring novel applications within the field of creative AI. The SCG method, together with the latent diffusion architecture, holds the promise of revolutionizing how we approach the task of generating symbolic music, paving the way for more intuitive and expressive compositional tools.

Acknowledgements

The development of this innovative approach to symbolic music generation was supported by various funding sources, including AeroVironment, NSF #1918655, a Caltech CDSF Postdoctoral Fellowship, the Canadian Institute for Advanced Research (CIFAR), and NSERC. This collaborative effort highlights the cross-disciplinary nature of research in artificial intelligence and music, driving forward the boundaries of what's possible in the field of generative models.

Github Logo Streamline Icon: https://streamlinehq.com