Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
97 tokens/sec
GPT-4o
53 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MuPT: A Generative Symbolic Music Pretrained Transformer (2404.06393v4)

Published 9 Apr 2024 in cs.SD, cs.AI, and eess.AS

Abstract: In this paper, we explore the application of LLMs to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition. To address the challenges associated with misaligned measures from different tracks during generation, we propose the development of a Synchronized Multi-Track ABC Notation (SMT-ABC Notation), which aims to preserve coherence across multiple musical tracks. Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set. Furthermore, we explore the implications of the Symbolic Music Scaling Law (SMS Law) on model performance. The results indicate a promising direction for future research in music generation, offering extensive resources for community-led research through our open-source contributions.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (48)
  1. Mae-ast: Masked autoencoding audio spectrogram transformer. Proc. Interspeech, 2022.
  2. wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
  3. Data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning, pp.  1298–1312. PMLR, 2022.
  4. Generating folk-like music in abc-notation with masked language models. In Proceedings of the International Society for Music Information Retrieval Conference 2023 Late Breaking/Demo. ISMIR, 2023.
  5. Beats: Audio pre-training with acoustic tokenizers. In International Conference on Machine Learning, pp.  5178–5193. PMLR, 2023.
  6. Eat: Self-supervised pre-training with efficient audio transformer. arXiv preprint arXiv:2401.03497, 2024.
  7. Simple and controllable music generation. Advances in Neural Information Processing Systems, 36, 2024.
  8. Singsong: Generating musical accompaniments from singing. arXiv preprint arXiv:2301.12662, 2023.
  9. Multitrack music transformer. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp.  1–5. IEEE, 2023.
  10. Scaling laws for reward model overoptimization, 2022.
  11. Scaling laws for neural machine translation, 2021.
  12. Scaling laws for autoregressive generative modeling, 2020.
  13. Scaling laws for transfer, 2021.
  14. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  15. Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
  16. Music transformer. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJe4ShAcF7.
  17. Masked autoencoders that listen. Advances in Neural Information Processing Systems, 35:28708–28720, 2022.
  18. Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions. In Proceedings of the 28th ACM international conference on multimedia, pp.  1180–1188, 2020.
  19. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  20. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
  21. Mert: Acoustic music understanding model with large-scale self-supervised training. arXiv preprint arXiv:2306.00107, 2023.
  22. Melhubert: A simplified hubert on mel spectrograms. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp.  1–8. IEEE, 2023.
  23. Visual instruction tuning, 2023.
  24. Musecoco: Generating symbolic music from text. arXiv preprint arXiv:2306.00110, 2023.
  25. On the effectiveness of speech self-supervised learning for music. arXiv preprint arXiv:2307.05161, 2023a.
  26. Mt4ssl: Boosting self-supervised speech representation learning by integrating multiple targets. Proc. Interspeech, 2023b.
  27. Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36, 2024.
  28. OpenAI. Musenet. https://openai.com/blog/musenet/, 2021. Accessed: 2024-01-19.
  29. webmushra—a comprehensive framework for web-based listening tests. Journal of Open Research Software, 6(1):8, 2018.
  30. Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
  31. Byte pair encoding: A text compression scheme that accelerates pattern matching. 09 1999.
  32. Music transcription modelling and composition using deep learning. CoRR, abs/1604.08723, 2016. URL http://arxiv.org/abs/1604.08723.
  33. Roformer: Enhanced transformer with rotary position embedding, 2023.
  34. Anticipatory music transformer. arXiv preprint arXiv:2306.08620, 2023.
  35. Llama: Open and efficient foundation language models. ARXIV, 2023a.
  36. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv: 2307.09288, 2023b.
  37. Attention is all you need, 2023.
  38. Learning interpretable representation for controllable polyphonic music generation. arXiv preprint arXiv:2008.07122, 2020.
  39. Whole-song hierarchical generation of symbolic music using cascaded diffusion models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=sn7CYWyavh.
  40. Exploring the efficacy of pre-trained checkpoints in text-to-music generation task. In The AAAI-23 Workshop on Creative AI Across Modalities, 2023. URL https://openreview.net/forum?id=QmWXskBhesn.
  41. Tunesformer: Forming irish tunes with control codes by bar patching. In Lorenzo Porcaro, Roser Batlle-Roca, and Emilia Gómez (eds.), Proceedings of the 2nd Workshop on Human-Centric Music Information Retrieval 2023 co-located with the 24th International Society for Music Information Retrieval Conference (ISMIR 2023), Milan, Italy, November 10, 2023, volume 3528 of CEUR Workshop Proceedings. CEUR-WS.org, 2023a. URL https://ceur-ws.org/Vol-3528/paper1.pdf.
  42. Clamp: Contrastive language-music pre-training for cross-modal symbolic music information retrieval. In Augusto Sarti, Fabio Antonacci, Mark Sandler, Paolo Bestagini, Simon Dixon, Beici Liang, Gaël Richard, and Johan Pauwels (eds.), Proceedings of the 24th International Society for Music Information Retrieval Conference, ISMIR 2023, Milan, Italy, November 5-9, 2023, pp.  157–165, 2023b. doi: 10.5281/ZENODO.10265247. URL https://doi.org/10.5281/zenodo.10265247.
  43. Fast-hubert: an efficient training framework for self-supervised speech representation learning. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp.  1–7. IEEE, 2023.
  44. Deep music analogy via latent representation disentanglement. arXiv preprint arXiv:1906.03626, 2019.
  45. YouTokenToMe. Youtokentome: Unsupervised text tokenization library, 2021. URL https://github.com/VKCOM/YouTokenToMe. Available online: https://github.com/VKCOM/YouTokenToMe (accessed on March 25, 2024).
  46. Chatmusician: Understanding and generating music intrinsically with llm. arXiv preprint arXiv:2402.16153, 2024.
  47. Root mean square layer normalization, 2019.
  48. Musicbert: A self-supervised learning of music representation. In Proceedings of the 29th ACM International Conference on Multimedia, pp.  3955–3963, 2021.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (28)
  1. Xingwei Qu (30 papers)
  2. Yuelin Bai (13 papers)
  3. Yinghao Ma (24 papers)
  4. Ziya Zhou (9 papers)
  5. Ka Man Lo (5 papers)
  6. Jiaheng Liu (100 papers)
  7. Ruibin Yuan (43 papers)
  8. Lejun Min (3 papers)
  9. Xueling Liu (5 papers)
  10. Tianyu Zhang (111 papers)
  11. Xinrun Du (23 papers)
  12. Shuyue Guo (10 papers)
  13. Yiming Liang (22 papers)
  14. Yizhi Li (43 papers)
  15. Shangda Wu (18 papers)
  16. Junting Zhou (11 papers)
  17. Tianyu Zheng (28 papers)
  18. Ziyang Ma (73 papers)
  19. Fengze Han (4 papers)
  20. Wei Xue (150 papers)
Citations (7)

Summary

  • The paper introduces MuPT, a transformer model that leverages SMT-ABC notation for synchronized measure alignment across multiple tracks.
  • It employs a decoder-only architecture with extended token capacity (up to 8192 tokens) and a 50,000-token YouTokenToMe BPE vocabulary optimized for symbolic music.
  • Empirical results show superior performance with structurally coherent compositions, and the release of open-source checkpoints promotes further research.

MuPT: Pioneering Symbolic Music Generation with Pretrained Transformers

Introduction to MuPT

The proliferation of LLMs has extended beyond text to diverse domains like music, where structured data representation and coherence across multiple tracks play a critical role in determining the quality of generated outputs. This paper introduces MuPT, a series of highly specialized models engineered for symbolic music generation. Unlike conventional approaches that struggle with MIDI's complex structural representation, MuPT leverages ABC Notation and a novel Synchronized Multi-Track ABC Notation (SMT-ABC Notation) to maintain measure alignment across tracks, significantly enhancing music's structural integrity and quality.

Challenges in Symbolic Music Generation

Traditional model architectures and data representations face substantial hurdles in generating coherent and structurally sound music. The predominant use of MIDI in symbolic music modeling often results in models failing to capture the essential structural symmetry that characterizes aesthetically pleasing compositions. This paper identifies and addresses these challenges by:

  • Proposing a transformer decoder-only architecture tailored for symbolic music generation tasks.
  • Introducing a synchronized approach to handle multiple music tracks, ensuring accurate measure alignment across various parts of a composition.

MuPT Architecture and Innovations

MuPT embodies several technical innovations to optimize performance for music generation tasks:

  • Extended Token Capacity: Models are capable of handling up to 8192 tokens, covering a vast majority of symbolic music compositions.
  • SMT-ABC Notation: This novel notation system is specifically designed to address the misalignment of measures across different tracks, fostering improved learning outcomes and music quality.
  • Advanced Tokenizer Implementation: Utilizing the YouTokenToMe framework with a 50,000-token BPE vocabulary optimized for ABC notation, ensuring efficient and effective model interpretation of symbolic music data.

Scaling Law Insights

The exploration of the Symbolic Music Scaling (SMS) Law offers a groundbreaking perspective on model performance in the context of music generation:

  • Comprehensive Training Benefits: The SMS Law reveals that extended training on repetitive data can lead to significant performance improvements.
  • Optimal Resource Allocation: Insights from the SMS Law guide the allocation of computational resources, ensuring models achieve the best possible outcomes within existing constraints.

Empirical Validation and Community Contributions

Empirical results demonstrate MuPT's superior performance compared to existing baselines. The models achieve remarkable success in generating music that is both structurally coherent and aesthetically pleasing. Furthermore, the paper commits to open-sourcing intermediate training checkpoints and foundational models to stimulate further research and innovation in symbolic music modeling.

Future Directions and Conclusion

MuPT's introduction marks a significant advancement in symbolic music generation, addressing longstanding challenges and setting a new standard for model performance in this domain. The insights garnered from the SMS Law and the open-source contribution of foundational models poised for community advancement underscore the potential for continued progress in music generation. As the community delves deeper into optimizing and extending MuPT's capabilities, the future of symbolic music modeling looks promising, with the potential to unlock new levels of creativity and intricacy in automated music composition.