MuPT: A Generative Symbolic Music Pretrained Transformer (2404.06393v4)
Abstract: In this paper, we explore the application of LLMs to the pre-training of music. While the prevalent use of MIDI in music modeling is well-established, our findings suggest that LLMs are inherently more compatible with ABC Notation, which aligns more closely with their design and strengths, thereby enhancing the model's performance in musical composition. To address the challenges associated with misaligned measures from different tracks during generation, we propose the development of a Synchronized Multi-Track ABC Notation (SMT-ABC Notation), which aims to preserve coherence across multiple musical tracks. Our contributions include a series of models capable of handling up to 8192 tokens, covering 90% of the symbolic music data in our training set. Furthermore, we explore the implications of the Symbolic Music Scaling Law (SMS Law) on model performance. The results indicate a promising direction for future research in music generation, offering extensive resources for community-led research through our open-source contributions.
- Mae-ast: Masked autoencoding audio spectrogram transformer. Proc. Interspeech, 2022.
- wav2vec 2.0: A framework for self-supervised learning of speech representations. Advances in neural information processing systems, 33:12449–12460, 2020.
- Data2vec: A general framework for self-supervised learning in speech, vision and language. In International Conference on Machine Learning, pp. 1298–1312. PMLR, 2022.
- Generating folk-like music in abc-notation with masked language models. In Proceedings of the International Society for Music Information Retrieval Conference 2023 Late Breaking/Demo. ISMIR, 2023.
- Beats: Audio pre-training with acoustic tokenizers. In International Conference on Machine Learning, pp. 5178–5193. PMLR, 2023.
- Eat: Self-supervised pre-training with efficient audio transformer. arXiv preprint arXiv:2401.03497, 2024.
- Simple and controllable music generation. Advances in Neural Information Processing Systems, 36, 2024.
- Singsong: Generating musical accompaniments from singing. arXiv preprint arXiv:2301.12662, 2023.
- Multitrack music transformer. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pp. 1–5. IEEE, 2023.
- Scaling laws for reward model overoptimization, 2022.
- Scaling laws for neural machine translation, 2021.
- Scaling laws for autoregressive generative modeling, 2020.
- Scaling laws for transfer, 2021.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3451–3460, 2021.
- Music transformer. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=rJe4ShAcF7.
- Masked autoencoders that listen. Advances in Neural Information Processing Systems, 35:28708–28720, 2022.
- Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions. In Proceedings of the 28th ACM international conference on multimedia, pp. 1180–1188, 2020.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980, 2014.
- Mert: Acoustic music understanding model with large-scale self-supervised training. arXiv preprint arXiv:2306.00107, 2023.
- Melhubert: A simplified hubert on mel spectrograms. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 1–8. IEEE, 2023.
- Visual instruction tuning, 2023.
- Musecoco: Generating symbolic music from text. arXiv preprint arXiv:2306.00110, 2023.
- On the effectiveness of speech self-supervised learning for music. arXiv preprint arXiv:2307.05161, 2023a.
- Mt4ssl: Boosting self-supervised speech representation learning by integrating multiple targets. Proc. Interspeech, 2023b.
- Scaling data-constrained language models. Advances in Neural Information Processing Systems, 36, 2024.
- OpenAI. Musenet. https://openai.com/blog/musenet/, 2021. Accessed: 2024-01-19.
- webmushra—a comprehensive framework for web-based listening tests. Journal of Open Research Software, 6(1):8, 2018.
- Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202, 2020.
- Byte pair encoding: A text compression scheme that accelerates pattern matching. 09 1999.
- Music transcription modelling and composition using deep learning. CoRR, abs/1604.08723, 2016. URL http://arxiv.org/abs/1604.08723.
- Roformer: Enhanced transformer with rotary position embedding, 2023.
- Anticipatory music transformer. arXiv preprint arXiv:2306.08620, 2023.
- Llama: Open and efficient foundation language models. ARXIV, 2023a.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv: 2307.09288, 2023b.
- Attention is all you need, 2023.
- Learning interpretable representation for controllable polyphonic music generation. arXiv preprint arXiv:2008.07122, 2020.
- Whole-song hierarchical generation of symbolic music using cascaded diffusion models. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=sn7CYWyavh.
- Exploring the efficacy of pre-trained checkpoints in text-to-music generation task. In The AAAI-23 Workshop on Creative AI Across Modalities, 2023. URL https://openreview.net/forum?id=QmWXskBhesn.
- Tunesformer: Forming irish tunes with control codes by bar patching. In Lorenzo Porcaro, Roser Batlle-Roca, and Emilia Gómez (eds.), Proceedings of the 2nd Workshop on Human-Centric Music Information Retrieval 2023 co-located with the 24th International Society for Music Information Retrieval Conference (ISMIR 2023), Milan, Italy, November 10, 2023, volume 3528 of CEUR Workshop Proceedings. CEUR-WS.org, 2023a. URL https://ceur-ws.org/Vol-3528/paper1.pdf.
- Clamp: Contrastive language-music pre-training for cross-modal symbolic music information retrieval. In Augusto Sarti, Fabio Antonacci, Mark Sandler, Paolo Bestagini, Simon Dixon, Beici Liang, Gaël Richard, and Johan Pauwels (eds.), Proceedings of the 24th International Society for Music Information Retrieval Conference, ISMIR 2023, Milan, Italy, November 5-9, 2023, pp. 157–165, 2023b. doi: 10.5281/ZENODO.10265247. URL https://doi.org/10.5281/zenodo.10265247.
- Fast-hubert: an efficient training framework for self-supervised speech representation learning. In 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), pp. 1–7. IEEE, 2023.
- Deep music analogy via latent representation disentanglement. arXiv preprint arXiv:1906.03626, 2019.
- YouTokenToMe. Youtokentome: Unsupervised text tokenization library, 2021. URL https://github.com/VKCOM/YouTokenToMe. Available online: https://github.com/VKCOM/YouTokenToMe (accessed on March 25, 2024).
- Chatmusician: Understanding and generating music intrinsically with llm. arXiv preprint arXiv:2402.16153, 2024.
- Root mean square layer normalization, 2019.
- Musicbert: A self-supervised learning of music representation. In Proceedings of the 29th ACM International Conference on Multimedia, pp. 3955–3963, 2021.
- Xingwei Qu (30 papers)
- Yuelin Bai (13 papers)
- Yinghao Ma (24 papers)
- Ziya Zhou (9 papers)
- Ka Man Lo (5 papers)
- Jiaheng Liu (100 papers)
- Ruibin Yuan (43 papers)
- Lejun Min (3 papers)
- Xueling Liu (5 papers)
- Tianyu Zhang (111 papers)
- Xinrun Du (23 papers)
- Shuyue Guo (10 papers)
- Yiming Liang (22 papers)
- Yizhi Li (43 papers)
- Shangda Wu (18 papers)
- Junting Zhou (11 papers)
- Tianyu Zheng (28 papers)
- Ziyang Ma (73 papers)
- Fengze Han (4 papers)
- Wei Xue (150 papers)