ChatMusician: Understanding and Generating Music Intrinsically with LLM (2402.16153v1)
Abstract: While LLMs demonstrate impressive capabilities in text generation, we find that their ability has yet to be generalized to music, humanity's creative language. We introduce ChatMusician, an open-source LLM that integrates intrinsic musical abilities. It is based on continual pre-training and finetuning LLaMA2 on a text-compatible music representation, ABC notation, and the music is treated as a second language. ChatMusician can understand and generate music with a pure text tokenizer without any external multi-modal neural structures or tokenizers. Interestingly, endowing musical abilities does not harm language abilities, even achieving a slightly higher MMLU score. Our model is capable of composing well-structured, full-length music, conditioned on texts, chords, melodies, motifs, musical forms, etc, surpassing GPT-4 baseline. On our meticulously curated college-level music understanding benchmark, MusicTheoryBench, ChatMusician surpasses LLaMA2 and GPT-3.5 on zero-shot setting by a noticeable margin. Our work reveals that LLMs can be an excellent compressor for music, but there remains significant territory to be conquered. We release our 4B token music-language corpora MusicPile, the collected MusicTheoryBench, code, model and demo in GitHub.
- Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325.
- Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
- CCARH at Stanford University. 2023. A library of virtual musical scores in the humdrum **kern data format.
- A systematic review of artificial intelligence-based music generation: Scope, applications, and future trends. Expert Systems with Applications, page 118190.
- Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
- Free dolly: Introducing the world’s first truly open instruction-tuned llm.
- Simple and controllable music generation. arXiv preprint arXiv:2306.05284.
- What is missing in deep music generation? a study of repetition and structure in popular music. arXiv preprint arXiv:2209.00182.
- Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341.
- Singsong: Generating musical accompaniments from singing. arXiv preprint arXiv:2301.12662.
- High fidelity neural audio compression.
- Chessgpt: Bridging policy learning and language modeling. arXiv preprint arXiv:2306.09200.
- The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
- Measuring massive multitask language understanding. In International Conference on Learning Representations.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
- Music transformer. arXiv preprint arXiv:1809.04281.
- Music transformer: Generating music with long-term structure. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
- Yu-Siang Huang and Yi-Hsuan Yang. 2020a. Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions. In MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12-16, 2020, pages 1180–1188. ACM.
- Yu-Siang Huang and Yi-Hsuan Yang. 2020b. Pop music transformer: Generating music with rhythm and harmony. CoRR, abs/2002.00212.
- Harsh Jhamtani and Taylor Berg-Kirkpatrick. 2019. Modeling self-repetition in music generation using generative adversarial networks. In Machine Learning for Music Discovery Workshop, ICML.
- Matthew Kenney. 2023. arxiv-math-instruct-50.
- Camel: Communicative agents for "mind" exploration of large scale language model society.
- Emergent world representations: Exploring a sequence model trained on a synthetic task. arXiv preprint arXiv:2210.13382.
- LinkSoul-AI. 2023. LinkSoul/instruction_merge_set. https://huggingface.co/datasets/LinkSoul/instruction_merge_set.
- Musecoco: Generating symbolic music from text. arXiv preprint arXiv:2306.00110.
- Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583.
- Elizabeth Hellmuth Margulis and Rhimmon Simchy-Gross. 2016. Repetition enhances the musicality of randomly generated tone sequences. Music Perception: An Interdisciplinary Journal, 33(4):509–514.
- Nobuo Masataka. 2007. Music, evolution and language. Developmental science, 10(1):35–39.
- Nobuo Masataka. 2009. The origins of language and the evolution of music: A comparative perspective. Physics of Life Reviews, 6(1):11–22.
- This time with feeling: Learning expressive musical performance. CoRR, abs/1808.03715.
- Christine Payne. 2019. Musenet. OpenAI Blog.
- Christine Payne. 2022. Musenet. https://openai.com/research/musenet.
- The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
- Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
- The association between music and language in children: A state-of-the-art review. Children, 10(5):801.
- Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506.
- Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
- Folk music style modelling by recurrent neural networks with long short term memory units. In 16th international society for music information retrieval conference.
- Anticipatory music transformer. arXiv preprint arXiv:2306.08620.
- Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
- Attention is all you need. Advances in neural information processing systems, 30.
- Openchat: Advancing open-source language models with mixed-quality data. arXiv preprint arXiv:2309.11235.
- How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751.
- Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
- Wikipedia contributors. 2023. Wikipedia database.
- Chord-conditioned melody harmonization with controllable harmonicity. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
- Shangda Wu and Maosong Sun. 2022. Exploring the efficacy of pre-trained checkpoints in text-to-music generation task. arXiv preprint arXiv:2211.11216.
- Shangda Wu and Maosong Sun. 2023. Tunesformer: Forming tunes with control codes. arXiv preprint arXiv:2301.02884.
- Nature language reasoning, a survey. arXiv preprint arXiv:2303.14725.
- Marble: Music audio representation benchmark for universal evaluation. arXiv preprint arXiv:2306.10548.
- Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653.
- A survey of large language models. arXiv preprint arXiv:2303.18223.
- Video background music generation: Dataset, method and evaluation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15637–15647.
- Ruibin Yuan (43 papers)
- Hanfeng Lin (3 papers)
- Yi Wang (1038 papers)
- Zeyue Tian (12 papers)
- Shangda Wu (18 papers)
- Tianhao Shen (15 papers)
- Ge Zhang (170 papers)
- Yuhang Wu (41 papers)
- Cong Liu (169 papers)
- Ziya Zhou (9 papers)
- Ziyang Ma (73 papers)
- Liumeng Xue (24 papers)
- Ziyu Wang (137 papers)
- Qin Liu (84 papers)
- Tianyu Zheng (28 papers)
- Yizhi Li (43 papers)
- Yinghao Ma (24 papers)
- Yiming Liang (22 papers)
- Xiaowei Chi (21 papers)
- Ruibo Liu (42 papers)