Papers
Topics
Authors
Recent
Search
2000 character limit reached

Multi-Source Music Generation with Latent Diffusion

Published 10 Sep 2024 in eess.AS, cs.LG, and cs.SD | (2409.06190v4)

Abstract: Most music generation models directly generate a single music mixture. To allow for more flexible and controllable generation, the Multi-Source Diffusion Model (MSDM) has been proposed to model music as a mixture of multiple instrumental sources (e.g. piano, drums, bass, and guitar). Its goal is to use one single diffusion model to generate mutually-coherent music sources, that are then mixed to form the music. Despite its capabilities, MSDM is unable to generate music with rich melodies and often generates empty sounds. Its waveform diffusion approach also introduces significant Gaussian noise artifacts that compromise audio quality. In response, we introduce a Multi-Source Latent Diffusion Model (MSLDM) that employs Variational Autoencoders (VAEs) to encode each instrumental source into a distinct latent representation. By training a VAE on all music sources, we efficiently capture each source's unique characteristics in a "source latent." The source latents are concatenated and our diffusion model learns this joint latent space. This approach significantly enhances the total and partial generation of music by leveraging the VAE's latent compression and noise-robustness. The compressed source latent also facilitates more efficient generation. Subjective listening tests and Frechet Audio Distance (FAD) scores confirm that our model outperforms MSDM, showcasing its practical and enhanced applicability in music generation systems. We also emphasize that modeling sources is more effective than direct music mixture modeling. Codes and models are available at https://github.com/XZWY/MSLDM. Demos are available at https://xzwy.github.io/MSLDMDemo/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. OpenAI, ā€œGpt-4 technical report,ā€ arXiv preprint arXiv:2303.08774, 2023.
  2. ā€œZero-shot text-to-image generation,ā€ 2021.
  3. ā€œLanguage models are few-shot learners,ā€ 2020.
  4. ā€œWavenet: A generative model for raw audio,ā€ 2016.
  5. ā€œJukebox: A generative model for music,ā€ 2020.
  6. ā€œHigh fidelity neural audio compression,ā€ 2022.
  7. ā€œSoundstream: An end-to-end neural audio codec,ā€ 2021.
  8. ā€œNeural discrete representation learning,ā€ 2018.
  9. ā€œHifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,ā€ 2020.
  10. ā€œHigh-fidelity audio compression with improved rvqgan,ā€ 2023.
  11. ā€œMusiclm: Generating music from text,ā€ 2023.
  12. ā€œSimple and controllable music generation,ā€ 2024.
  13. ā€œNoise2music: Text-conditioned music generation with diffusion models,ā€ 2023.
  14. ā€œHigh-resolution image synthesis with latent diffusion models,ā€ 2022.
  15. ā€œDiffsound: Discrete diffusion model for text-to-sound generation,ā€ 2023.
  16. ā€œMake-an-audio: Text-to-audio generation with prompt-enhanced diffusion models,ā€ 2023.
  17. ā€œAudioldm: Text-to-audio generation with latent diffusion models,ā€ 2023.
  18. ā€œAudioldm 2: Learning holistic audio generation with self-supervised pretraining,ā€ 2024.
  19. ā€œMusicldm: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies,ā€ 2023.
  20. ā€œStable audio open,ā€ 2024.
  21. ā€œJen-1: Text-guided universal music generation with omnidirectional diffusion models,ā€ 2023.
  22. ā€œMoĆ»sai: Text-to-music generation with long-context latent diffusion,ā€ 2023.
  23. ā€œLong-form music generation with latent diffusion,ā€ 2024.
  24. ā€œSymbolic music generation with diffusion models,ā€ 2021.
  25. ā€œMusegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment,ā€ 2017.
  26. ā€œStemgen: A music generation model that listens,ā€ 2024.
  27. ā€œBass accompaniment generation via latent diffusion,ā€ 2024.
  28. ā€œSingsong: Generating musical accompaniments from singing,ā€ 2023.
  29. ā€œMulti-source diffusion models for simultaneous music generation and separation,ā€ 2024.
  30. ā€œGeneralized multi-source inference for text conditioned music diffusion models,ā€ 2024.
  31. ā€œAuto-encoding variational bayes,ā€ 2022.
  32. ā€œScore-based generative modeling through stochastic differential equations,ā€ 2021.
  33. ā€œElucidating the design space of diffusion-based generative models,ā€ 2022.
  34. ā€œCutting music source separation some slakh: A dataset to study the impact of training data quality and quantity,ā€ 2019.
  35. ā€œFrĆ©chet audio distance: A metric for evaluating music enhancement algorithms,ā€ 2019.
  36. ā€œCnn architectures for large-scale audio classification,ā€ 2017.
Citations (1)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.