Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
5 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ChatMusician: Understanding and Generating Music Intrinsically with LLM (2402.16153v1)

Published 25 Feb 2024 in cs.SD, cs.AI, cs.CL, cs.LG, cs.MM, and eess.AS

Abstract: While LLMs demonstrate impressive capabilities in text generation, we find that their ability has yet to be generalized to music, humanity's creative language. We introduce ChatMusician, an open-source LLM that integrates intrinsic musical abilities. It is based on continual pre-training and finetuning LLaMA2 on a text-compatible music representation, ABC notation, and the music is treated as a second language. ChatMusician can understand and generate music with a pure text tokenizer without any external multi-modal neural structures or tokenizers. Interestingly, endowing musical abilities does not harm language abilities, even achieving a slightly higher MMLU score. Our model is capable of composing well-structured, full-length music, conditioned on texts, chords, melodies, motifs, musical forms, etc, surpassing GPT-4 baseline. On our meticulously curated college-level music understanding benchmark, MusicTheoryBench, ChatMusician surpasses LLaMA2 and GPT-3.5 on zero-shot setting by a noticeable margin. Our work reveals that LLMs can be an excellent compressor for music, but there remains significant territory to be conquered. We release our 4B token music-language corpora MusicPile, the collected MusicTheoryBench, code, model and demo in GitHub.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325.
  2. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712.
  3. CCARH at Stanford University. 2023. A library of virtual musical scores in the humdrum **kern data format.
  4. A systematic review of artificial intelligence-based music generation: Scope, applications, and future trends. Expert Systems with Applications, page 118190.
  5. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168.
  6. Free dolly: Introducing the world’s first truly open instruction-tuned llm.
  7. Simple and controllable music generation. arXiv preprint arXiv:2306.05284.
  8. What is missing in deep music generation? a study of repetition and structure in popular music. arXiv preprint arXiv:2209.00182.
  9. Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341.
  10. Singsong: Generating musical accompaniments from singing. arXiv preprint arXiv:2301.12662.
  11. High fidelity neural audio compression.
  12. Chessgpt: Bridging policy learning and language modeling. arXiv preprint arXiv:2306.09200.
  13. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027.
  14. Measuring massive multitask language understanding. In International Conference on Learning Representations.
  15. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685.
  16. Music transformer. arXiv preprint arXiv:1809.04281.
  17. Music transformer: Generating music with long-term structure. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net.
  18. Yu-Siang Huang and Yi-Hsuan Yang. 2020a. Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions. In MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12-16, 2020, pages 1180–1188. ACM.
  19. Yu-Siang Huang and Yi-Hsuan Yang. 2020b. Pop music transformer: Generating music with rhythm and harmony. CoRR, abs/2002.00212.
  20. Harsh Jhamtani and Taylor Berg-Kirkpatrick. 2019. Modeling self-repetition in music generation using generative adversarial networks. In Machine Learning for Music Discovery Workshop, ICML.
  21. Matthew Kenney. 2023. arxiv-math-instruct-50.
  22. Camel: Communicative agents for "mind" exploration of large scale language model society.
  23. Emergent world representations: Exploring a sequence model trained on a synthetic task. arXiv preprint arXiv:2210.13382.
  24. LinkSoul-AI. 2023. LinkSoul/instruction_merge_set. https://huggingface.co/datasets/LinkSoul/instruction_merge_set.
  25. Musecoco: Generating symbolic music from text. arXiv preprint arXiv:2306.00110.
  26. Wizardmath: Empowering mathematical reasoning for large language models via reinforced evol-instruct. arXiv preprint arXiv:2308.09583.
  27. Elizabeth Hellmuth Margulis and Rhimmon Simchy-Gross. 2016. Repetition enhances the musicality of randomly generated tone sequences. Music Perception: An Interdisciplinary Journal, 33(4):509–514.
  28. Nobuo Masataka. 2007. Music, evolution and language. Developmental science, 10(1):35–39.
  29. Nobuo Masataka. 2009. The origins of language and the evolution of music: A comparative perspective. Physics of Life Reviews, 6(1):11–22.
  30. This time with feeling: Learning expressive musical performance. CoRR, abs/1808.03715.
  31. Christine Payne. 2019. Musenet. OpenAI Blog.
  32. Christine Payne. 2022. Musenet. https://openai.com/research/musenet.
  33. The refinedweb dataset for falcon llm: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116.
  34. Instruction tuning with gpt-4. arXiv preprint arXiv:2304.03277.
  35. The association between music and language in children: A state-of-the-art review. Children, 10(5):801.
  36. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, pages 3505–3506.
  37. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950.
  38. Folk music style modelling by recurrent neural networks with long short term memory units. In 16th international society for music information retrieval conference.
  39. Anticipatory music transformer. arXiv preprint arXiv:2306.08620.
  40. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  41. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  42. Attention is all you need. Advances in neural information processing systems, 30.
  43. Openchat: Advancing open-source language models with mixed-quality data. arXiv preprint arXiv:2309.11235.
  44. How far can camels go? exploring the state of instruction tuning on open resources. arXiv preprint arXiv:2306.04751.
  45. Self-instruct: Aligning language model with self generated instructions. arXiv preprint arXiv:2212.10560.
  46. Wikipedia contributors. 2023. Wikipedia database.
  47. Chord-conditioned melody harmonization with controllable harmonicity. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  48. Shangda Wu and Maosong Sun. 2022. Exploring the efficacy of pre-trained checkpoints in text-to-music generation task. arXiv preprint arXiv:2211.11216.
  49. Shangda Wu and Maosong Sun. 2023. Tunesformer: Forming tunes with control codes. arXiv preprint arXiv:2301.02884.
  50. Nature language reasoning, a survey. arXiv preprint arXiv:2303.14725.
  51. Marble: Music audio representation benchmark for universal evaluation. arXiv preprint arXiv:2306.10548.
  52. Mammoth: Building math generalist models through hybrid instruction tuning. arXiv preprint arXiv:2309.05653.
  53. A survey of large language models. arXiv preprint arXiv:2303.18223.
  54. Video background music generation: Dataset, method and evaluation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15637–15647.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (35)
  1. Ruibin Yuan (43 papers)
  2. Hanfeng Lin (3 papers)
  3. Yi Wang (1038 papers)
  4. Zeyue Tian (12 papers)
  5. Shangda Wu (18 papers)
  6. Tianhao Shen (15 papers)
  7. Ge Zhang (170 papers)
  8. Yuhang Wu (41 papers)
  9. Cong Liu (169 papers)
  10. Ziya Zhou (9 papers)
  11. Ziyang Ma (73 papers)
  12. Liumeng Xue (24 papers)
  13. Ziyu Wang (137 papers)
  14. Qin Liu (84 papers)
  15. Tianyu Zheng (28 papers)
  16. Yizhi Li (43 papers)
  17. Yinghao Ma (24 papers)
  18. Yiming Liang (22 papers)
  19. Xiaowei Chi (21 papers)
  20. Ruibo Liu (42 papers)
Citations (25)

Summary

  • The paper introduces ChatMusician, a novel LLM that treats music as a second language using ABC notation.
  • It refines training with a curated MusicPile corpus and uses MusicTheoryBench to overcome challenges in music generation.
  • Empirical results demonstrate superior performance in both compositional diversity and structured musical understanding compared to established baselines.

Integrating Musical Creativity and Understanding into LLMs with ChatMusician

Overview

ChatMusician introduces an innovative approach to incorporating intrinsic musical abilities into LLMs, enabling them to understand and generate music using ABC notation, a text-compatible music representation. By treating music as a "second language", this open-source LLM can generate coherent and structured musical pieces conditioned on various musical elements. Notably, ChatMusician demonstrates enhanced performance in both music generation tasks and music understanding benchmarks, without compromising its language capabilities.

Challenges in Music Generation and Understanding

Music, with its inherent structure and complexity, poses unique challenges for LLMs, particularly in capturing the long-term context dependency and the intricate connections between musical elements. The paper addresses these challenges by refining the LLM's training on a specially curated music-language corpus, MusicPile, and introducing the novel MusicTheoryBench for evaluating music understanding.

ABC Notation as a Solution

Choosing ABC notation for musical representation offers several advantages, such as a high compression rate and intrinsic encoding of musical repetition and structure, making it an efficient choice for LLM integration. This compatibility enables ChatMusician to effectively process and generate music within the confines of a LLM without requiring additional multi-modal structures.

Empirical Evaluations

Empirical evidence demonstrates ChatMusician's superior ability to compose music across various styles and structures, outperforming established baselines such as GPT-4. Additionally, the model excels in the MusicTheoryBench, showcasing its advanced understanding of music beyond the conventional capabilities of current LLMs. These results are further supported by human evaluation studies and specific metrics designed to assess musicality and controllability within the generated compositions.

Contributions to AI and Music

ChatMusician represents a significant advancement in the fusion of artificial intelligence and music, highlighting the potential for LLMs to serve as tools for creative expression and musicological analysis. The release of the MusicPile corpus, MusicTheoryBench, and the ChatMusician model itself provides a valuable resource for the research community, fostering further exploration into the capabilities of LLMs in understanding and generating music.

Practical and Theoretical Implications

From a practical standpoint, ChatMusician offers a scalable solution for music generation tasks, potentially contributing to various applications in music composition, education, and entertainment. Theoretically, this work enhances our understanding of the parallels between language and music processing in LLMs, supporting the idea that music can be treated as a form of language within these models.

Future Directions

While ChatMusician marks a substantial step forward, its current iteration exhibits a preference for generating Irish music and faces challenges in supporting open-ended music generation tasks. Future work will aim to diversify the model's capabilities and address issues related to hallucinations and the memorization effect, alongside developing strategies for mitigating copyright concerns associated with generated music.

Ethical Considerations

The ethical implications of employing ChatMusician, particularly concerning copyright infringement and the potential for misleading users, are acknowledged. The development of detection algorithms for music plagiarism and further alignment strategies are highlighted as future measures to address these concerns.

Conclusion

ChatMusician illustrates the promising conjunction of AI and music through the lens of LLMs, offering a novel framework for music understanding and generation. The integration of intrinsic musical capabilities within LLMs, as demonstrated by ChatMusician, paves the way for exploring the creative and analytical potentials of AI in the field of music.

Youtube Logo Streamline Icon: https://streamlinehq.com