Arrange, Inpaint, and Refine: Steerable Long-term Music Audio Generation and Editing via Content-based Controls (2402.09508v3)
Abstract: Controllable music generation plays a vital role in human-AI music co-creation. While LLMs have shown promise in generating high-quality music, their focus on autoregressive generation limits their utility in music editing tasks. To address this gap, we propose a novel approach leveraging a parameter-efficient heterogeneous adapter combined with a masking training scheme. This approach enables autoregressive LLMs to seamlessly address music inpainting tasks. Additionally, our method integrates frame-level content-based controls, facilitating track-conditioned music refinement and score-conditioned music arrangement. We apply this method to fine-tune MusicGen, a leading autoregressive music generation model. Our experiments demonstrate promising results across multiple music editing tasks, offering more flexible controls for future AI-driven music editing tools. The source codes and a demo page showcasing our work are available at https://kikyo-16.github.io/AIR.
- MusicLM: Generating music from text. arXiv preprint arXiv:2301.11325, 2023.
- Madmom: A new python audio and music signal processing library. In Proceedings of the 24th ACM international conference on Multimedia, pages 1174–1178, 2016.
- Music sketchnet: Controllable music generation via factorized representations of pitch and rhythm. In Proceedings of the 21st Conference of the International Society for Music Information Retrieval, pages 77–84, 2020.
- Simple and controllable music generation. arXiv preprint arXiv:2306.05284, 2023.
- High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
- High fidelity neural audio compression. Transactions on Machine Learning Research, 2023. Featured Certification, Reproducibility Certification.
- Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020.
- Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
- Vampnet: Music generation via masked acoustic token modeling. In Ismir 2023 Hybrid Conference, 2023.
- Rwc music database: Popular, classical and jazz music databases. In Proceedings of the 3rd Conference of the International Society for Music Information Retrieval, volume 2, pages 287–288, 2002.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions. In Proceedings of the 28th ACM international conference on multimedia, pages 1180–1188, 2020.
- Music transformer. arXiv preprint arXiv:1809.04281, 2018.
- Infilling piano performances. In NIPS Workshop on Machine Learning for Creativity and Design, volume 2, page 5, 2018.
- Large-vocabulary chord transcription via chord structure decomposition. In Proceedings of the 20th Conference of the International Society for Music Information Retrieval, pages 644–651, 2019.
- Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
- P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61–68, 2022.
- Cutting music source separation some slakh: A dataset to study the impact of training data quality and quantity. In 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 45–49. IEEE, 2019.
- A context encoder for audio inpainting. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(12):2362–2372, 2019.
- Gacela: A generative adversarial context encoder for long audio inpainting of music. IEEE Journal of Selected Topics in Signal Processing, 15(1):120–131, 2020.
- Polyffusion: A diffusion model for polyphonic score generation with internal and external controls. In Proceedings of the 24th Conference of the International Society for Music Information Retrieval, pages 231–238, 2023.
- Music demixing challenge 2021. Frontiers in Signal Processing, 1:808395, 2022.
- Inpainting of long audio segments with similarity graphs. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(6):1083–1094, 2018.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
- Mo\\\backslash\^ usai: Text-to-music generation with long-context latent diffusion. arXiv preprint arXiv:2301.11757, 2023.
- Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
- Musebert: Pre-training music representation for music understanding and controllable generation. In ISMIR, pages 722–729, 2021.
- Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
- Vision-infused deep audio inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 283–292, 2019.
- Liwei Lin (14 papers)
- Gus Xia (57 papers)
- Yixiao Zhang (44 papers)
- Junyan Jiang (12 papers)