Generalized Multi-Source Inference for Text Conditioned Music Diffusion Models (2403.11706v1)
Abstract: Multi-Source Diffusion Models (MSDM) allow for compositional musical generation tasks: generating a set of coherent sources, creating accompaniments, and performing source separation. Despite their versatility, they require estimating the joint distribution over the sources, necessitating pre-separated musical data, which is rarely available, and fixing the number and type of sources at training time. This paper generalizes MSDM to arbitrary time-domain diffusion models conditioned on text embeddings. These models do not require separated data as they are trained on mixtures, can parameterize an arbitrary number of sources, and allow for rich semantic control. We propose an inference procedure enabling the coherent generation of sources and accompaniments. Additionally, we adapt the Dirac separator of MSDM to perform source separation. We experiment with diffusion models trained on Slakh2100 and MTG-Jamendo, showcasing competitive generation and separation results in a relaxed data setting.
- “Neural discrete representation learning,” in Advances in Neural Information Processing Systems, 2017, vol. 30.
- “Generative modeling by estimating gradients of the data distribution,” Advances in neural information processing systems, vol. 32, 2019.
- “Denoising diffusion probabilistic models,” Advances in neural information processing systems, vol. 33, pp. 6840–6851, 2020.
- “Score-based generative modeling through stochastic differential equations,” in International Conference on Learning Representations, 2020.
- “Musiclm: Generating music from text,” arXiv preprint arXiv:2301.11325, 2023.
- “Simple and controllable music generation,” arXiv preprint arXiv:2306.05284, 2023.
- “Vampnet: Music generation via masked acoustic token modeling,” arXiv preprint arXiv:2307.04686, 2023.
- “Moûsai: Text-to-music generation with long-context latent diffusion,” arXiv preprint arXiv:2301.11757, 2023.
- “Audioldm 2: Learning holistic audio generation with self-supervised pretraining,” arXiv preprint arXiv:2308.05734, 2023.
- “Exploring the limits of transfer learning with a unified text-to-text transformer,” The Journal of Machine Learning Research, vol. 21, no. 1, pp. 5485–5551, 2020.
- “Learning music audio representations via weak language supervision,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 456–460.
- “Clap learning audio concepts from natural language supervision,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- “Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation,” in ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–5.
- “Multi-source diffusion models for simultaneous music generation and separation,” arXiv preprint arXiv:2302.02257, 2023.
- “Audit: Audio editing by following instructions with latent diffusion models,” arXiv preprint arXiv:2304.00830, 2023.
- “Instructme: An instruction guided music edit and remix framework with latent diffusion models,” arXiv preprint arXiv:2308.14360, 2023.
- “Singsong: Generating musical accompaniments from singing,” arXiv preprint arXiv:2301.12662, 2023.
- “Classifier-free diffusion guidance,” in NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications, 2021.
- “Stay on topic with classifier-free guidance,” arXiv preprint arXiv:2306.17806, 2023.
- “Source separation with deep generative priors,” in International Conference on Machine Learning. PMLR, 2020, pp. 4724–4735.
- “Cutting music source separation some slakh: A dataset to study the impact of training data quality and quantity,” in 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA). IEEE, 2019, pp. 45–49.
- “The mtg-jamendo dataset for automatic music tagging,” in International Conference on Machine Learning, 2019.
- “Dpm-solver: A fast ode solver for diffusion probabilistic model sampling in around 10 steps,” Advances in Neural Information Processing Systems, vol. 35, pp. 5775–5787, 2022.
- “Fréchet audio distance: A reference-free metric for evaluating music enhancement algorithms,” in Interspeech, 2019.
- “Sdr–half-baked or well done?,” in ICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 626–630.
- “Gass: Generalizing audio source separation with large-scale data,” arXiv preprint arXiv:2310.00140, 2023.
- “Improving source separation by explicitly modeling dependencies between sources,” in ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 291–295.