ByteComposer: a Human-like Melody Composition Method based on Language Model Agent (2402.17785v2)
Abstract: LLMs (LLM) have shown encouraging progress in multimodal understanding and generation tasks. However, how to design a human-aligned and interpretable melody composition system is still under-explored. To solve this problem, we propose ByteComposer, an agent framework emulating a human's creative pipeline in four separate steps : "Conception Analysis - Draft Composition - Self-Evaluation and Modification - Aesthetic Selection". This framework seamlessly blends the interactive and knowledge-understanding features of LLMs with existing symbolic music generation models, thereby achieving a melody composition agent comparable to human creators. We conduct extensive experiments on GPT4 and several open-source LLMs, which substantiate our framework's effectiveness. Furthermore, professional music composers were engaged in multi-dimensional evaluations, the final results demonstrated that across various facets of music composition, ByteComposer agent attains the level of a novice melody composer.
- Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325.
- Graph of thoughts: Solving elaborate problems with large language models. arXiv preprint arXiv:2308.09687.
- Chemcrow: Augmenting large-language models with chemistry tools. arXiv preprint arXiv:2304.05376.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
- Deep reinforcement learning from human preferences. Advances in neural information processing systems, 30.
- Qlora: Efficient finetuning of quantized llms. arXiv preprint arXiv:2305.14314.
- Language models are general-purpose interfaces. arXiv preprint arXiv:2206.06336.
- Mulan: A joint embedding of music audio and natural language. In Ismir 2022 Hybrid Conference.
- Noise2music: Text-conditioned music generation with diffusion models. arXiv preprint arXiv:2302.03917.
- Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. arXiv preprint arXiv:2301.12661.
- Efficient neural music generation. arXiv preprint arXiv:2305.15719.
- Jen-1: Text-guided universal music generation with omnidirectional diffusion models. arXiv preprint arXiv:2308.04729.
- Musecoco: Generating symbolic music from text. arXiv preprint arXiv:2306.00110.
- Self-refine: Iterative refinement with self-feedback. arXiv preprint arXiv:2303.17651.
- Augmented language models: a survey. arXiv preprint arXiv:2302.07842.
- OpenAI. 2023. Gpt-4 technical report. ArXiv, abs/2303.08774.
- Refiner: Reasoning feedback on intermediate representations. arXiv preprint arXiv:2304.01904.
- Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761.
- Mo\\\backslash\^ usai: Text-to-music generation with long-context latent diffusion. arXiv preprint arXiv:2301.11757.
- Hugginggpt: Solving ai tasks with chatgpt and its friends in huggingface. arXiv preprint arXiv:2303.17580.
- Reflexion: Language agents with verbal reinforcement learning. arXiv preprint arXiv:2303.11366, 14.
- Figaro: Controllable music generation using learned and expert features. In The Eleventh International Conference on Learning Representations.
- Self-consistency improves chain of thought reasoning in language models.
- Chain-of-thought prompting elicits reasoning in large language models. Advances in Neural Information Processing Systems, 35:24824–24837.
- Shangda Wu and Maosong Sun. 2023. Tunesformer: Forming tunes with control codes. arXiv preprint arXiv:2301.02884.
- Decomposition enhances reasoning via self-evaluation guided decoding. arXiv preprint arXiv:2305.00633.
- Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601.
- React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629.
- Butter: A representation learning framework for bi-directional music-sentence retrieval and generation. NLP4MusA 2020, page 54.
- Least-to-most prompting enables complex reasoning in large language models.
- Ernie-music: Text-to-waveform music generation with diffusion models. arXiv preprint arXiv:2302.04456.