InstructME: An Instruction Guided Music Edit And Remix Framework with Latent Diffusion Models (2308.14360v3)
Abstract: Music editing primarily entails the modification of instrument tracks or remixing in the whole, which offers a novel reinterpretation of the original piece through a series of operations. These music processing methods hold immense potential across various applications but demand substantial expertise. Prior methodologies, although effective for image and audio modifications, falter when directly applied to music. This is attributed to music's distinctive data nature, where such methods can inadvertently compromise the intrinsic harmony and coherence of music. In this paper, we develop InstructME, an Instruction guided Music Editing and remixing framework based on latent diffusion models. Our framework fortifies the U-Net with multi-scale aggregation in order to maintain consistency before and after editing. In addition, we introduce chord progression matrix as condition information and incorporate it in the semantic space to improve melodic harmony while editing. For accommodating extended musical pieces, InstructME employs a chunk transformer, enabling it to discern long-term temporal dependencies within music sequences. We tested InstructME in instrument-editing, remixing, and multi-round editing. Both subjective and objective evaluations indicate that our proposed method significantly surpasses preceding systems in music quality, text relevance and harmony. Demo samples are available at https://musicedit.github.io/
- Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325.
- Jointist: Joint learning for multi-instrument transcription and its applications. arXiv preprint arXiv:2206.10805.
- Simple and Controllable Music Generation. arXiv preprint arXiv:2306.05284.
- Diffusion models beat gans on image synthesis. Advances in neural information processing systems, 34: 8780–8794.
- SingSong: Generating musical accompaniments from singing. arXiv preprint arXiv:2301.12662.
- Fagerjord, A. 2010. After convergence: YouTube and remix culture. International handbook of internet research, 187–200.
- Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, 770–778.
- CNN architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp), 131–135. IEEE.
- Prompt-to-prompt image editing with cross attention control. arXiv preprint arXiv:2208.01626.
- Denoising diffusion probabilistic models. Advances in neural information processing systems, 33: 6840–6851.
- Classifier-free diffusion guidance. arXiv preprint arXiv:2207.12598.
- Holz, D. 2023. Midjourney. Artificial Intelligence platform. Accessible at https://www.midjourney.com. https://www.midjourney.com/. Accessed: 2023-07-31.
- Mulan: A joint embedding of music audio and natural language. arXiv preprint arXiv:2208.12415.
- Noise2music: Text-conditioned music generation with diffusion models. arXiv preprint arXiv:2302.03917.
- Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. arXiv preprint arXiv:2301.12661.
- Training-free Style Transfer Emerges from h-space in Diffusion models. arXiv preprint arXiv:2303.15403.
- Analyzing and improving the image quality of stylegan. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 8110–8119.
- Fréchet audio distance: A reference-free metric for evaluating music enhancement algorithms. In INTERSPEECH, 2350–2354.
- Diffusion models already have a semantic latent space. arXiv preprint arXiv:2210.10960.
- Efficient Neural Music Generation. arXiv preprint arXiv:2305.15719.
- Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503.
- More control for free! image synthesis with semantic diffusion guidance. In Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 289–299.
- SpecTNT: A time-frequency transformer for music audio. arXiv preprint arXiv:2110.09127.
- Repaint: Inpainting using denoising diffusion probabilistic models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 11461–11471.
- GETMusic: Generating Any Music Tracks with a Unified Representation and Diffusion Framework. arXiv preprint arXiv:2305.10841.
- Cutting music source separation some Slakh: A dataset to study the impact of training data quality and quantity. In 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), 45–49. IEEE.
- Sdedit: Guided image synthesis and editing with stochastic differential equations. arXiv preprint arXiv:2108.01073.
- Glide: Towards photorealistic image generation and editing with text-guided diffusion models. arXiv preprint arXiv:2112.10741.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1): 5485–5551.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125.
- Popmag: Pop music accompaniment generation. In Proceedings of the 28th ACM international conference on multimedia, 1198–1206.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, 10684–10695.
- U-net: Convolutional networks for biomedical image segmentation. In Medical Image Computing and Computer-Assisted Intervention–MICCAI 2015: 18th International Conference, Munich, Germany, October 5-9, 2015, Proceedings, Part III 18, 234–241. Springer.
- Photorealistic text-to-image diffusion models with deep language understanding. Advances in Neural Information Processing Systems, 35: 36479–36494.
- Mo\\\backslash\^ usai: Text-to-Music Generation with Long-Context Latent Diffusion. arXiv preprint arXiv:2301.11757.
- Score-based generative modeling through stochastic differential equations. arXiv preprint arXiv:2011.13456.
- Attention is all you need. Advances in neural information processing systems, 30.
- AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models. arXiv preprint arXiv:2304.00830.
- Waysdorf, A. S. 2021. Remix in the age of ubiquitous remix. Convergence, 27(4): 1129–1144.
- Perceptual evaluation of source separation for remixing music. In Audio Engineering Society Convention 143. Audio Engineering Society.
- Don’t Separate, Learn To Remix: End-To-End Neural Remixing With Joint Optimization. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 116–120. IEEE.
- On the evaluation of generative models in music. Neural Computing and Applications, 32(9): 4773–4784.
- Remixing music with visual conditioning. In 2020 IEEE International Symposium on Multimedia (ISM), 181–188. IEEE.
- Bing Han (74 papers)
- Junyu Dai (2 papers)
- Weituo Hao (16 papers)
- Xinyan He (2 papers)
- Dong Guo (46 papers)
- Jitong Chen (15 papers)
- Yuxuan Wang (239 papers)
- Yanmin Qian (97 papers)
- Xuchen Song (20 papers)