Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
173 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
46 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Multi-Source Music Generation with Latent Diffusion (2409.06190v4)

Published 10 Sep 2024 in eess.AS, cs.LG, and cs.SD

Abstract: Most music generation models directly generate a single music mixture. To allow for more flexible and controllable generation, the Multi-Source Diffusion Model (MSDM) has been proposed to model music as a mixture of multiple instrumental sources (e.g. piano, drums, bass, and guitar). Its goal is to use one single diffusion model to generate mutually-coherent music sources, that are then mixed to form the music. Despite its capabilities, MSDM is unable to generate music with rich melodies and often generates empty sounds. Its waveform diffusion approach also introduces significant Gaussian noise artifacts that compromise audio quality. In response, we introduce a Multi-Source Latent Diffusion Model (MSLDM) that employs Variational Autoencoders (VAEs) to encode each instrumental source into a distinct latent representation. By training a VAE on all music sources, we efficiently capture each source's unique characteristics in a "source latent." The source latents are concatenated and our diffusion model learns this joint latent space. This approach significantly enhances the total and partial generation of music by leveraging the VAE's latent compression and noise-robustness. The compressed source latent also facilitates more efficient generation. Subjective listening tests and Frechet Audio Distance (FAD) scores confirm that our model outperforms MSDM, showcasing its practical and enhanced applicability in music generation systems. We also emphasize that modeling sources is more effective than direct music mixture modeling. Codes and models are available at https://github.com/XZWY/MSLDM. Demos are available at https://xzwy.github.io/MSLDMDemo/.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (36)
  1. OpenAI, “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
  2. “Zero-shot text-to-image generation,” 2021.
  3. “Language models are few-shot learners,” 2020.
  4. “Wavenet: A generative model for raw audio,” 2016.
  5. “Jukebox: A generative model for music,” 2020.
  6. “High fidelity neural audio compression,” 2022.
  7. “Soundstream: An end-to-end neural audio codec,” 2021.
  8. “Neural discrete representation learning,” 2018.
  9. “Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis,” 2020.
  10. “High-fidelity audio compression with improved rvqgan,” 2023.
  11. “Musiclm: Generating music from text,” 2023.
  12. “Simple and controllable music generation,” 2024.
  13. “Noise2music: Text-conditioned music generation with diffusion models,” 2023.
  14. “High-resolution image synthesis with latent diffusion models,” 2022.
  15. “Diffsound: Discrete diffusion model for text-to-sound generation,” 2023.
  16. “Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models,” 2023.
  17. “Audioldm: Text-to-audio generation with latent diffusion models,” 2023.
  18. “Audioldm 2: Learning holistic audio generation with self-supervised pretraining,” 2024.
  19. “Musicldm: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies,” 2023.
  20. “Stable audio open,” 2024.
  21. “Jen-1: Text-guided universal music generation with omnidirectional diffusion models,” 2023.
  22. “Moûsai: Text-to-music generation with long-context latent diffusion,” 2023.
  23. “Long-form music generation with latent diffusion,” 2024.
  24. “Symbolic music generation with diffusion models,” 2021.
  25. “Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment,” 2017.
  26. “Stemgen: A music generation model that listens,” 2024.
  27. “Bass accompaniment generation via latent diffusion,” 2024.
  28. “Singsong: Generating musical accompaniments from singing,” 2023.
  29. “Multi-source diffusion models for simultaneous music generation and separation,” 2024.
  30. “Generalized multi-source inference for text conditioned music diffusion models,” 2024.
  31. “Auto-encoding variational bayes,” 2022.
  32. “Score-based generative modeling through stochastic differential equations,” 2021.
  33. “Elucidating the design space of diffusion-based generative models,” 2022.
  34. “Cutting music source separation some slakh: A dataset to study the impact of training data quality and quantity,” 2019.
  35. “Fréchet audio distance: A metric for evaluating music enhancement algorithms,” 2019.
  36. “Cnn architectures for large-scale audio classification,” 2017.
Citations (1)

Summary

We haven't generated a summary for this paper yet.

Github Logo Streamline Icon: https://streamlinehq.com
X Twitter Logo Streamline Icon: https://streamlinehq.com