Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Arrange, Inpaint, and Refine: Steerable Long-term Music Audio Generation and Editing via Content-based Controls (2402.09508v3)

Published 14 Feb 2024 in cs.SD, cs.AI, and eess.AS

Abstract: Controllable music generation plays a vital role in human-AI music co-creation. While LLMs have shown promise in generating high-quality music, their focus on autoregressive generation limits their utility in music editing tasks. To address this gap, we propose a novel approach leveraging a parameter-efficient heterogeneous adapter combined with a masking training scheme. This approach enables autoregressive LLMs to seamlessly address music inpainting tasks. Additionally, our method integrates frame-level content-based controls, facilitating track-conditioned music refinement and score-conditioned music arrangement. We apply this method to fine-tune MusicGen, a leading autoregressive music generation model. Our experiments demonstrate promising results across multiple music editing tasks, offering more flexible controls for future AI-driven music editing tools. The source codes and a demo page showcasing our work are available at https://kikyo-16.github.io/AIR.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. MusicLM: Generating music from text. arXiv preprint arXiv:2301.11325, 2023.
  2. Madmom: A new python audio and music signal processing library. In Proceedings of the 24th ACM international conference on Multimedia, pages 1174–1178, 2016.
  3. Music sketchnet: Controllable music generation via factorized representations of pitch and rhythm. In Proceedings of the 21st Conference of the International Society for Music Information Retrieval, pages 77–84, 2020.
  4. Simple and controllable music generation. arXiv preprint arXiv:2306.05284, 2023.
  5. High fidelity neural audio compression. arXiv preprint arXiv:2210.13438, 2022.
  6. High fidelity neural audio compression. Transactions on Machine Learning Research, 2023. Featured Certification, Reproducibility Certification.
  7. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341, 2020.
  8. Llama-adapter v2: Parameter-efficient visual instruction model. arXiv preprint arXiv:2304.15010, 2023.
  9. Vampnet: Music generation via masked acoustic token modeling. In Ismir 2023 Hybrid Conference, 2023.
  10. Rwc music database: Popular, classical and jazz music databases. In Proceedings of the 3rd Conference of the International Society for Music Information Retrieval, volume 2, pages 287–288, 2002.
  11. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  12. Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions. In Proceedings of the 28th ACM international conference on multimedia, pages 1180–1188, 2020.
  13. Music transformer. arXiv preprint arXiv:1809.04281, 2018.
  14. Infilling piano performances. In NIPS Workshop on Machine Learning for Creativity and Design, volume 2, page 5, 2018.
  15. Large-vocabulary chord transcription via chord structure decomposition. In Proceedings of the 20th Conference of the International Society for Music Information Retrieval, pages 644–651, 2019.
  16. Prefix-tuning: Optimizing continuous prompts for generation. arXiv preprint arXiv:2101.00190, 2021.
  17. P-tuning: Prompt tuning can be comparable to fine-tuning across scales and tasks. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pages 61–68, 2022.
  18. Cutting music source separation some slakh: A dataset to study the impact of training data quality and quantity. In 2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), pages 45–49. IEEE, 2019.
  19. A context encoder for audio inpainting. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 27(12):2362–2372, 2019.
  20. Gacela: A generative adversarial context encoder for long audio inpainting of music. IEEE Journal of Selected Topics in Signal Processing, 15(1):120–131, 2020.
  21. Polyffusion: A diffusion model for polyphonic score generation with internal and external controls. In Proceedings of the 24th Conference of the International Society for Music Information Retrieval, pages 231–238, 2023.
  22. Music demixing challenge 2021. Frontiers in Signal Processing, 1:808395, 2022.
  23. Inpainting of long audio segments with similarity graphs. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 26(6):1083–1094, 2018.
  24. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  25. Mo\\\backslash\^ usai: Text-to-music generation with long-context latent diffusion. arXiv preprint arXiv:2301.11757, 2023.
  26. Neural discrete representation learning. Advances in neural information processing systems, 30, 2017.
  27. Musebert: Pre-training music representation for music understanding and controllable generation. In ISMIR, pages 722–729, 2021.
  28. Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  29. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199, 2023.
  30. Vision-infused deep audio inpainting. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 283–292, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (4)
  1. Liwei Lin (14 papers)
  2. Gus Xia (57 papers)
  3. Yixiao Zhang (44 papers)
  4. Junyan Jiang (12 papers)
Citations (8)