Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
126 tokens/sec
GPT-4o
28 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MelodyT5: A Unified Score-to-Score Transformer for Symbolic Music Processing (2407.02277v2)

Published 2 Jul 2024 in cs.SD and eess.AS

Abstract: In the domain of symbolic music research, the progress of developing scalable systems has been notably hindered by the scarcity of available training data and the demand for models tailored to specific tasks. To address these issues, we propose MelodyT5, a novel unified framework that leverages an encoder-decoder architecture tailored for symbolic music processing in ABC notation. This framework challenges the conventional task-specific approach, considering various symbolic music tasks as score-to-score transformations. Consequently, it integrates seven melody-centric tasks, from generation to harmonization and segmentation, within a single model. Pre-trained on MelodyHub, a newly curated collection featuring over 261K unique melodies encoded in ABC notation and encompassing more than one million task instances, MelodyT5 demonstrates superior performance in symbolic music processing via multi-task transfer learning. Our findings highlight the efficacy of multi-task transfer learning in symbolic music processing, particularly for data-scarce tasks, challenging the prevailing task-specific paradigms and offering a comprehensive dataset and framework for future explorations in this domain.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (47)
  1. J. Liu, Y. Dong, Z. Cheng, X. Zhang, X. Li, F. Yu, and M. Sun, “Symphony generation with permutation invariant language model,” in Proceedings of the 23rd International Society for Music Information Retrieval Conference, ISMIR 2022, Bengaluru, India, December 4-8, 2022, 2022, pp. 551–558. [Online]. Available: https://archives.ismir.net/ismir2022/paper/000066.pdf
  2. P. Lu, X. Tan, B. Yu, T. Qin, S. Zhao, and T. Liu, “Meloform: Generating melody with musical form based on expert systems and neural networks,” in Proceedings of the 23rd International Society for Music Information Retrieval Conference, ISMIR 2022, Bengaluru, India, December 4-8, 2022, 2022, pp. 567–574. [Online]. Available: https://archives.ismir.net/ismir2022/paper/000068.pdf
  3. L. Min, J. Jiang, G. Xia, and J. Zhao, “Polyffusion: A diffusion model for polyphonic score generation with internal and external controls,” in Proceedings of the 24th International Society for Music Information Retrieval Conference, ISMIR 2023, Milan, Italy, November 5-9, 2023, 2023, pp. 231–238. [Online]. Available: https://doi.org/10.5281/zenodo.10265265
  4. S. Wu and M. Sun, “Exploring the efficacy of pre-trained checkpoints in text-to-music generation task,” in The AAAI-23 Workshop on Creative AI Across Modalities, 2023. [Online]. Available: https://openreview.net/forum?id=QmWXskBhesn
  5. M. Zeng, X. Tan, R. Wang, Z. Ju, T. Qin, and T. Liu, “Musicbert: Symbolic music understanding with large-scale pre-training,” in Findings of the Association for Computational Linguistics: ACL/IJCNLP 2021, Online Event, August 1-6, 2021, ser. Findings of ACL, vol. ACL/IJCNLP 2021.   Association for Computational Linguistics, 2021, pp. 791–800. [Online]. Available: https://doi.org/10.18653/v1/2021.findings-acl.70
  6. Z. Wang and G. Xia, “Musebert: Pre-training music representation for music understanding and controllable generation,” in Proceedings of the 22nd International Society for Music Information Retrieval Conference, ISMIR 2021, Online, November 7-12, 2021, 2021, pp. 722–729. [Online]. Available: https://archives.ismir.net/ismir2021/paper/000090.pdf
  7. S. Wu, D. Yu, X. Tan, and M. Sun, “Clamp: Contrastive language-music pre-training for cross-modal symbolic music information retrieval,” in Proceedings of the 24th International Society for Music Information Retrieval Conference, ISMIR 2023, Milan, Italy, November 5-9, 2023, 2023, pp. 157–165. [Online]. Available: https://doi.org/10.5281/zenodo.10265247
  8. Y. Zhang and G. Xia, “Symbolic melody phrase segmentation using neural network with conditional random field,” in Proceedings of the 8th Conference on Sound and Music Technology: Selected Papers from CSMT.   Springer, 2021, pp. 55–65.
  9. K. Choi, J. Park, W. Heo, S. Jeon, and J. Park, “Chord conditioned melody generation with transformer based decoders,” IEEE Access, vol. 9, pp. 42 071–42 080, 2021. [Online]. Available: https://doi.org/10.1109/ACCESS.2021.3065831
  10. S. Wu, X. Li, and M. Sun, “Chord-conditioned melody harmonization with controllable harmonicity,” in IEEE International Conference on Acoustics, Speech and Signal Processing ICASSP 2023, Rhodes Island, Greece, June 4-10, 2023.   IEEE, 2023, pp. 1–5. [Online]. Available: https://doi.org/10.1109/ICASSP49357.2023.10096398
  11. S. Wu, Y. Yang, Z. Wang, X. Li, and M. Sun, “Generating chord progression from melody with flexible harmonic rhythm and controllable harmonic density,” EURASIP J. Audio Speech Music. Process., vol. 2024, no. 1, p. 4, 2024. [Online]. Available: https://doi.org/10.1186/s13636-023-00314-6
  12. Z. Wang, K. Chen, J. Jiang, Y. Zhang, M. Xu, S. Dai, and G. Xia, “POP909: A pop-song dataset for music arrangement generation,” in Proceedings of the 21th International Society for Music Information Retrieval Conference, ISMIR 2020, Montreal, Canada, October 11-16, 2020, 2020, pp. 38–45. [Online]. Available: http://archives.ismir.net/ismir2020/paper/000089.pdf
  13. Y. Hsiao, T. Hung, T. Chen, and L. Su, “Bps-motif: A dataset for repeated pattern discovery of polyphonic symbolic music,” in Proceedings of the 24th International Society for Music Information Retrieval Conference, ISMIR 2023, Milan, Italy, November 5-9, 2023, 2023, pp. 281–288. [Online]. Available: https://doi.org/10.5281/zenodo.10265277
  14. Y. Zhang, Z. Zhou, X. Li, F. Yu, and M. Sun, “Ccom-huqin: An annotated multimodal chinese fiddle performance dataset,” Trans. Int. Soc. Music. Inf. Retr., vol. 6, no. 1, pp. 60–74, 2023. [Online]. Available: https://doi.org/10.5334/tismir.146
  15. A. Holzapfel, B. L. Sturm, and M. Coeckelbergh, “Ethical dimensions of music information retrieval technology,” Trans. Int. Soc. Music. Inf. Retr., vol. 1, no. 1, pp. 44–55, 2018. [Online]. Available: https://doi.org/10.5334/tismir.13
  16. J. Howard and S. Ruder, “Universal language model fine-tuning for text classification,” in Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, Melbourne, Australia, July 15-20, 2018, Volume 1: Long Papers.   Association for Computational Linguistics, 2018, pp. 328–339. [Online]. Available: https://aclanthology.org/P18-1031/
  17. S. Ruder, M. E. Peters, S. Swayamdipta, and T. Wolf, “Transfer learning in natural language processing,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2, 2019, Tutorial Abstracts.   Association for Computational Linguistics, 2019, pp. 15–18. [Online]. Available: https://doi.org/10.18653/v1/n19-5004
  18. M. Artetxe and H. Schwenk, “Massively multilingual sentence embeddings for zero-shot cross-lingual transfer and beyond,” Trans. Assoc. Comput. Linguistics, vol. 7, pp. 597–610, 2019. [Online]. Available: https://doi.org/10.1162/tacl\_a\_00288
  19. X. Liu, P. He, W. Chen, and J. Gao, “Multi-task deep neural networks for natural language understanding,” in Proceedings of the 57th Conference of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28- August 2, 2019, Volume 1: Long Papers.   Association for Computational Linguistics, 2019, pp. 4487–4496. [Online]. Available: https://doi.org/10.18653/v1/p19-1441
  20. K. Song, X. Tan, T. Qin, J. Lu, and T. Liu, “MASS: masked sequence to sequence pre-training for language generation,” in Proceedings of the 36th International Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach, California, USA, ser. Proceedings of Machine Learning Research, vol. 97.   PMLR, 2019, pp. 5926–5936. [Online]. Available: http://proceedings.mlr.press/v97/song19d.html
  21. Z. Zhang, W. Yu, M. Yu, Z. Guo, and M. Jiang, “A survey of multi-task learning in natural language processing: Regarding task relatedness and training methods,” in Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, EACL 2023, Dubrovnik, Croatia, May 2-6, 2023.   Association for Computational Linguistics, 2023, pp. 943–956. [Online]. Available: https://doi.org/10.18653/v1/2023.eacl-main.66
  22. A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” 2018.
  23. J. Devlin, M. Chang, K. Lee, and K. Toutanova, “BERT: pre-training of deep bidirectional transformers for language understanding,” in Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers).   Association for Computational Linguistics, 2019, pp. 4171–4186. [Online]. Available: https://doi.org/10.18653/v1/n19-1423
  24. C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,” J. Mach. Learn. Res., vol. 21, pp. 140:1–140:67, 2020. [Online]. Available: http://jmlr.org/papers/v21/20-074.html
  25. S. Wu, X. Li, F. Yu, and M. Sun, “Tunesformer: Forming irish tunes with control codes by bar patching,” in Proceedings of the 2nd Workshop on Human-Centric Music Information Retrieval 2023 co-located with the 24th International Society for Music Information Retrieval Conference (ISMIR 2023), Milan, Italy, November 10, 2023, ser. CEUR Workshop Proceedings, vol. 3528.   CEUR-WS.org, 2023. [Online]. Available: https://ceur-ws.org/Vol-3528/paper1.pdf
  26. L. Casini, N. Jonason, and B. L. Sturm, “Generating folk-like music in abc-notation with masked language models,” in Ismir 2023 Hybrid Conference, 2023.
  27. R. Yuan, H. Lin, Y. Wang, Z. Tian, S. Wu, T. Shen, G. Zhang, Y. Wu, C. Liu, Z. Zhou, Z. Ma, L. Xue, Z. Wang, Q. Liu, T. Zheng, Y. Li, Y. Ma, Y. Liang, X. Chi, R. Liu, Z. Wang, P. Li, J. Wu, C. Lin, Q. Liu, T. Jiang, W. Huang, W. Chen, E. Benetos, J. Fu, G. Xia, R. B. Dannenberg, W. Xue, S. Kang, and Y. Guo, “Chatmusician: Understanding and generating music intrinsically with LLM,” CoRR, vol. abs/2402.16153, 2024. [Online]. Available: https://doi.org/10.48550/arXiv.2402.16153
  28. B. L. Sturm, J. F. Santos, O. Ben-Tal, and I. Korshunova, “Music transcription modelling and composition using deep learning,” CoRR, vol. abs/1604.08723, 2016. [Online]. Available: http://arxiv.org/abs/1604.08723
  29. C. Geerlings and A. Meroño-Peñuela, “Interacting with gpt-2 to generate controlled and believable musical sequences in abc notation,” in NLP4MUSA, 2020. [Online]. Available: https://api.semanticscholar.org/CorpusID:227217204
  30. A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, 2017, pp. 5998–6008. [Online]. Available: https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html
  31. “Abc notation,” http://abcnotation.com/, accessed: 2024-04-12.
  32. “Folkwiki,” http://www.folkwiki.se/, accessed: 2024-04-12.
  33. “Chord-conditioned melody harmonization with controllable harmonicity [icassp 2023],” https://github.com/sander-wood/deepchoir, accessed: 2024-04-12.
  34. “Kernscores,” http://kern.ccarh.org/, accessed: 2024-04-12.
  35. “The meertens tune collections,” https://www.liederenbank.nl/mtc/, accessed: 2024-04-12.
  36. “The nottingham music database,” https://ifdo.ca/~seymour/nottingham/nottingham.html, accessed: 2024-04-12.
  37. “Openscore lieder corpus,” https://musescore.com/openscore-lieder-corpus, accessed: 2024-04-12.
  38. “The session,” https://thesession.org/, accessed: 2024-04-12.
  39. I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” in 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019.   OpenReview.net, 2019. [Online]. Available: https://openreview.net/forum?id=Bkg6RiCqY7
  40. “wikimusictext Dataset on Hugging Face Datasets,” https://huggingface.co/datasets/sander-wood/wikimusictext, accessed: 2024-04-01.
  41. “Download for Wikifonia all 6,675 Lead Sheets - Synth Zone Forum,” http://www.synthzone.com/forum/ubbthreads.php/topics/384909/Download_for_Wikifonia_all_6,6, accessed: 2024-04-01.
  42. “chord-melody-dataset on GitHub,” https://github.com/shiehn/chord-melody-dataset, accessed: 2024-04-01.
  43. “OpenEWLD on GitHub,” https://github.com/00sapo/OpenEWLD, accessed: 2024-04-01.
  44. “KernScores: Essen Folksong Collection,” http://kern.ccarh.org/cgi-bin/ksbrowse?l=/essen, accessed: 2024-04-01.
  45. “KernScores: Erk’s Liederschatz,” https://kern.humdrum.org/cgi-bin/browse?l=users/craig/songs/erk/liederschatz, accessed: 2024-04-01.
  46. S. Rhyu, H. Choi, S. Kim, and K. Lee, “Translating melody to chord: Structured and flexible harmonization of melody with transformer,” IEEE Access, vol. 10, pp. 28 261–28 273, 2022. [Online]. Available: https://doi.org/10.1109/ACCESS.2022.3155467
  47. Y. Yeh, W. Hsiao, S. Fukayama, T. Kitahara, B. Genchel, H. Liu, H. Dong, Y. Chen, T. Leong, and Y. Yang, “Automatic melody harmonization with triad chords: A comparative study,” CoRR, vol. abs/2001.02360, 2020. [Online]. Available: http://arxiv.org/abs/2001.02360
Citations (1)

Summary

  • The paper introduces MelodyT5, a Transformer-based encoder-decoder that employs multi-task pre-training to unify score-to-score symbolic music processing.
  • It utilizes an innovative bar patching technique with ABC notation to effectively capture both global and local musical patterns.
  • Experimental results show improved performance over task-specific models, enhancing harmonic compatibility and segmentation accuracy.

MelodyT5: A Unified Score-to-Score Transformer for Symbolic Music Processing

Abstract:

In the domain of symbolic music processing, the work by Wu et al. proposes MelodyT5, an encoder-decoder framework that aims to overcome the conventional constraints of task-specific models. By utilizing multi-task transfer learning on a newly curated dataset, MelodyHub, MelodyT5 integrates and addresses seven distinctive melody-centric tasks. This essay provides an expert overview of the paper, highlighting its methodology, findings, and implications, as well as its impact on the future development of symbolic music models.

Introduction:

Symbolic music processing, which involves the manipulation of notated music rather than continuous audio signals, presents unique challenges and opportunities within the field of AI. Historically, AI models in this domain have been task-specific, focusing on individual applications without leveraging the potential synergies across different tasks. The scarcity of annotated symbolic music datasets further compounds these challenges, limiting the generalizability and performance of AI models in this area.

Wu et al.'s MelodyT5 addresses these challenges by conceptualizing symbolic music tasks as score-to-score transformations, akin to the text-to-text framework in NLP. This paradigm shift allows for a unified approach to symbolic music processing, making use of multi-task transfer learning to improve performance, particularly in data-scarce tasks.

Methodology:

The core of MelodyT5 is its Transformer-based encoder-decoder architecture, enhanced by the bar patching technique to handle longer sequences efficiently. ABC notation is utilized for its concise representation of musical elements, facilitating the application of NLP techniques.

Data Representation:

MelodyT5 employs ABC notation to encode musical scores textually. To manage the complexity and length of sequences, the bar patching technique groups sequences into bar patches, enabling more efficient processing and preserving semantic coherence within the music.

Model Architecture:

The model architecture includes:

  • Linear Projection: Converts bar patches into dense embeddings, providing input for the patch-level encoder.
  • Patch-level Encoder: Generates contextualized representations using self-attention and feed-forward networks.
  • Patch-level Decoder: Uses these representations for autoregressive generation of the next bar patch.
  • Character-level Decoder: Produces detailed character sequences within bar patches, reconstructing the target musical score.

This hierarchical structure allows MelodyT5 to capture both global and local patterns in musical compositions.

Pre-training Objective:

The pre-training objective relies on cross-entropy loss, focusing on next token prediction. By minimizing cross-entropy loss across tokens in the target sequence, MelodyT5 learns to perform diverse symbolic music tasks under a unified framework optimized for score-to-score transformations.

Dataset:

MelodyHub, the dataset used for pre-training MelodyT5, includes 261,900 unique melodies across over one million task instances spanning seven tasks: generation, harmonization, melodization, segmentation, transcription, cataloging, and variation. These were meticulously curated from publicly available sources, ensuring high quality and uniformity.

Experiments:

Settings:

Experiments were conducted using MelodyHub, with data split into training and validation sets. MelodyT5's configuration included extensive parameterization, processing lengthy sequences and employing complex training protocols to ensure robust performance.

Ablation Studies:

Ablation studies revealed that multi-task pre-training significantly enhances model performance across various tasks. This was evident in lower bits-per-byte (BPB) scores and improved metrics compared to task-specific pre-training or no pre-training, demonstrating improved generalization and efficiency.

Comparative Evaluations:

Comparisons with task-specific models like TunesFormer, STHarm, CMT, and Bi-LSTM-CRF showed that MelodyT5 outperforms these baselines in most tasks. Objective metrics indicated superior performance in terms of controllability, harmonic compatibility, and segmentation accuracy. Subjective evaluations through A/B testing further validated MelodyT5's advantages in generation and harmonization, although CMT outperformed in melodization preferences, highlighting areas for future optimization.

Implications and Future Directions:

MelodyT5 signifies a substantial advancement in symbolic music processing by leveraging multi-task transfer learning. Its ability to generalize across different tasks without task-specific modifications points towards the potential for developing comprehensive, versatile music models.

Conclusion:

In conclusion, MelodyT5 presents a robust and unified framework for symbolic music processing, overcoming the traditional limitations of task-specific models through multi-task transfer learning. The curated MelodyHub dataset provides a rich resource for future research, enabling advancements across various melody-centric tasks. Further work is needed to enhance the model's performance, particularly in complex compositions, aligning AI-generated music more closely with human creative processes.