Symbolic Music Generation with Diffusion Models (2103.16091v2)

Published 30 Mar 2021 in cs.SD, cs.LG, eess.AS, and stat.ML

Abstract: Score-based generative models and diffusion probabilistic models have been successful at generating high-quality samples in continuous domains such as images and audio. However, due to their Langevin-inspired sampling mechanisms, their application to discrete and sequential data has been limited. In this work, we present a technique for training diffusion models on sequential data by parameterizing the discrete domain in the continuous latent space of a pre-trained variational autoencoder. Our method is non-autoregressive and learns to generate sequences of latent embeddings through the reverse process and offers parallel generation with a constant number of iterative refinement steps. We apply this technique to modeling symbolic music and show strong unconditional generation and post-hoc conditional infilling results compared to autoregressive LLMs operating over the same continuous embeddings.

View on arXiv

Authors (4)

Gautam Mittal (3 papers)
Jesse Engel (30 papers)
Curtis Hawthorne (17 papers)
Ian Simon (16 papers)

Citations (172)

View on Semantic Scholar

Summary

Symbolic Music Generation with Diffusion Models: An Expert Overview

The paper "Symbolic Music Generation with Diffusion Models" presents an innovative approach to generating high-quality symbolic music using diffusion probabilistic models. This research addresses the limitations of applying diffusion models, which are traditionally used for continuous data, to discrete symbolic music generation. The authors propose a method that leverages a pre-trained variational autoencoder (VAE) to bridge the gap between discrete and continuous domains, thus enabling the use of diffusion models for music generation.

Key Contributions and Findings

The research introduces a methodology to train Denoising Diffusion Probabilistic Models (DDPMs) on symbolic music data. This is achieved by training the diffusion models on continuous latent embeddings derived from a VAE, which has been pre-trained on symbolic music sequences. This approach facilitates a non-autoregressive generation process that iteratively refines sequences in a latent space, allowing for parallel generation with a constant computational complexity. The main contributions and findings of this work include:

High-Quality Unconditional Sampling: The paper demonstrates the ability of DDPMs to unconditionally generate coherent and high-quality melodic sequences. These sequences consist of 1024 tokens, showcasing the model's capability to maintain consistency over long sequences when refining VAE latents.
Comparison with Autoregressive Baselines: The DDPMs outperform autoregressive models, such as TransformerMDN, particularly due to the absence of teacher forcing and exposure bias, which are inherent challenges in autoregressive training.
Post-Hoc Conditional Infilling: The model supports creative tasks like infilling, where a segment of a sequence can be generated conditionally without retraining the model. This is achieved by iteratively refining latent embeddings with guidance from surrounding context.

Technical Insights

DDPMs applied to this task operate by adding Gaussian noise to latent embeddings through a forward diffusion process and then iteratively refining these noisy embeddings back into coherent data using a learned reverse process. The implementation involves a noise schedule and utilizes a simple squared L2 training objective, inspired by denoising score matching.

The VAE serves as a critical component, encoding symbolic music into continuous latent space, hence facilitating the application of diffusion models, which are inherently more suited for continuous domains. The authors implement latent space trimming to focus on informative dimensions of the latent normal distribution, avoiding the noise amplification often seen in high-dimensional, underutilized latent spaces.

Evaluation and Results

The authors employ a framewise self-similarity metric to quantitatively assess the quality of generated music against human-composed sequences. This is further supplemented by latent space evaluations using metrics like Fréchet distance and Maximum Mean Discrepancy, ensuring that the generated distributions are statistically close to real data distributions.

Results indicate that the diffusion model's outputs closely align with the original data in terms of pitch and duration consistency and variance, outperforming both autoregressive models and baseline interpolation methods. Furthermore, the iterative refinement capability of the diffusion model proves effective in enhancing the quality and structure of generated music over time.

Implications and Future Directions

This research presents significant implications for the field of AI-based music generation. By demonstrating the effectiveness of diffusion models on symbolic music tasks, it opens new avenues for applying such generative models beyond traditional continuous data domains. The successful integration with a VAE indicates potential applicability in other discrete data types such as natural language processing and symbolic computation.

Future research could explore extending this methodology to other creative and structured data domains, potentially integrating more advanced diffusion processes or optimizing the VAE for better latent space representation and sampling. Additionally, further exploration of post-hoc conditioning techniques could enhance the creative applications of such models.

In summary, this paper provides a robust framework for symbolic music generation using diffusion models, highlighting their potential in handling sequential and discrete data with applications extending well beyond music.

PDF Markdown

Related Papers

Find Related Papers