Multi-Source Music Generation with Latent Diffusion
Abstract: Most music generation models directly generate a single music mixture. To allow for more flexible and controllable generation, the Multi-Source Diffusion Model (MSDM) has been proposed to model music as a mixture of multiple instrumental sources (e.g. piano, drums, bass, and guitar). Its goal is to use one single diffusion model to generate mutually-coherent music sources, that are then mixed to form the music. Despite its capabilities, MSDM is unable to generate music with rich melodies and often generates empty sounds. Its waveform diffusion approach also introduces significant Gaussian noise artifacts that compromise audio quality. In response, we introduce a Multi-Source Latent Diffusion Model (MSLDM) that employs Variational Autoencoders (VAEs) to encode each instrumental source into a distinct latent representation. By training a VAE on all music sources, we efficiently capture each source's unique characteristics in a "source latent." The source latents are concatenated and our diffusion model learns this joint latent space. This approach significantly enhances the total and partial generation of music by leveraging the VAE's latent compression and noise-robustness. The compressed source latent also facilitates more efficient generation. Subjective listening tests and Frechet Audio Distance (FAD) scores confirm that our model outperforms MSDM, showcasing its practical and enhanced applicability in music generation systems. We also emphasize that modeling sources is more effective than direct music mixture modeling. Codes and models are available at https://github.com/XZWY/MSLDM. Demos are available at https://xzwy.github.io/MSLDMDemo/.
- OpenAI, āGpt-4 technical report,ā arXiv preprint arXiv:2303.08774, 2023.
- āZero-shot text-to-image generation,ā 2021.
- āLanguage models are few-shot learners,ā 2020.
- āWavenet: A generative model for raw audio,ā 2016.
- āJukebox: A generative model for music,ā 2020.
- āHigh fidelity neural audio compression,ā 2022.
- āSoundstream: An end-to-end neural audio codec,ā 2021.
- āNeural discrete representation learning,ā 2018.
- āHifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,ā 2020.
- āHigh-fidelity audio compression with improved rvqgan,ā 2023.
- āMusiclm: Generating music from text,ā 2023.
- āSimple and controllable music generation,ā 2024.
- āNoise2music: Text-conditioned music generation with diffusion models,ā 2023.
- āHigh-resolution image synthesis with latent diffusion models,ā 2022.
- āDiffsound: Discrete diffusion model for text-to-sound generation,ā 2023.
- āMake-an-audio: Text-to-audio generation with prompt-enhanced diffusion models,ā 2023.
- āAudioldm: Text-to-audio generation with latent diffusion models,ā 2023.
- āAudioldm 2: Learning holistic audio generation with self-supervised pretraining,ā 2024.
- āMusicldm: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies,ā 2023.
- āStable audio open,ā 2024.
- āJen-1: Text-guided universal music generation with omnidirectional diffusion models,ā 2023.
- āMoĆ»sai: Text-to-music generation with long-context latent diffusion,ā 2023.
- āLong-form music generation with latent diffusion,ā 2024.
- āSymbolic music generation with diffusion models,ā 2021.
- āMusegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment,ā 2017.
- āStemgen: A music generation model that listens,ā 2024.
- āBass accompaniment generation via latent diffusion,ā 2024.
- āSingsong: Generating musical accompaniments from singing,ā 2023.
- āMulti-source diffusion models for simultaneous music generation and separation,ā 2024.
- āGeneralized multi-source inference for text conditioned music diffusion models,ā 2024.
- āAuto-encoding variational bayes,ā 2022.
- āScore-based generative modeling through stochastic differential equations,ā 2021.
- āElucidating the design space of diffusion-based generative models,ā 2022.
- āCutting music source separation some slakh: A dataset to study the impact of training data quality and quantity,ā 2019.
- āFrĆ©chet audio distance: A metric for evaluating music enhancement algorithms,ā 2019.
- āCnn architectures for large-scale audio classification,ā 2017.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.