Whole-Song Hierarchical Generation of Symbolic Music Using Cascaded Diffusion Models (2405.09901v1)
Abstract: Recent deep music generation studies have put much emphasis on long-term generation with structures. However, we are yet to see high-quality, well-structured whole-song generation. In this paper, we make the first attempt to model a full music piece under the realization of compositional hierarchy. With a focus on symbolic representations of pop songs, we define a hierarchical language, in which each level of hierarchy focuses on the semantics and context dependency at a certain music scope. The high-level languages reveal whole-song form, phrase, and cadence, whereas the low-level languages focus on notes, chords, and their local patterns. A cascaded diffusion model is trained to model the hierarchical language, where each level is conditioned on its upper levels. Experiments and analysis show that our model is capable of generating full-piece music with recognizable global verse-chorus structure and cadences, and the music quality is higher than the baselines. Additionally, we show that the proposed model is controllable in a flexible way. By sampling from the interpretable hierarchical languages or adjusting pre-trained external representations, users can control the music flow via various features such as phrase harmonic structures, rhythmic patterns, and accompaniment texture.
- Protodiffusion: Classifier-free diffusion guidance with prototype learning. CoRR, abs/2307.01924, 2023. doi: 10.48550/arXiv.2307.01924. URL https://doi.org/10.48550/arXiv.2307.01924.
- Music sketchnet: Controllable music generation via factorized representations of pitch and rhythm. In Julie Cumming, Jin Ha Lee, Brian McFee, Markus Schedl, Johanna Devaney, Cory McKay, Eva Zangerle, and Timothy de Reuse (eds.), Proceedings of the 21th International Society for Music Information Retrieval Conference, ISMIR 2020, Montreal, Canada, October 11-16, 2020, pp. 77–84, 2020. URL http://archives.ismir.net/ismir2020/paper/000146.pdf.
- Adaptively-realistic image generation from stroke and sketch with diffusion model. In IEEE/CVF Winter Conference on Applications of Computer Vision, WACV 2023, Waikoloa, HI, USA, January 2-7, 2023, pp. 4043–4051. IEEE, 2023. doi: 10.1109/WACV56688.2023.00404. URL https://doi.org/10.1109/WACV56688.2023.00404.
- Gradus ad parnassum. Peters, 2010.
- Simple and controllable music generation. arXiv preprint arXiv:2306.05284, 2023.
- Automatic analysis and influence of hierarchical structure on melody, rhythm and harmony in popular music. In Proceedings of the 2020 Joint Conference on AI Music Creativity (CSMC-MuMe), 2020.
- Controllable deep melody generation via hierarchical music structure representation. In Jin Ha Lee, Alexander Lerch, Zhiyao Duan, Juhan Nam, Preeti Rao, Peter van Kranenburg, and Ajay Srinivasamurthy (eds.), Proceedings of the 22nd International Society for Music Information Retrieval Conference, ISMIR 2021, Online, November 7-12, 2021, pp. 143–150, 2021. URL https://archives.ismir.net/ismir2021/paper/000017.pdf.
- Jukebox: A generative model for music. CoRR, abs/2005.00341, 2020. URL https://arxiv.org/abs/2005.00341.
- σ𝜎\sigmaitalic_σGTTM III: Learning-based time-span tree generator based on pcfg. In International Symposium on Computer Music Multidisciplinary Research, pp. 387–404. Springer, 2015.
- deepGTTM-II: Automatic generation of metrical structure based on deep learning technique. In 13th Sound and Music Conference, pp. 221–249, 2016.
- Music transformer: Generating music with long-term structure. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=rJe4ShAcF7.
- Pop music transformer: Beat-based modeling and generation of expressive pop piano compositions. In Chang Wen Chen, Rita Cucchiara, Xian-Sheng Hua, Guo-Jun Qi, Elisa Ricci, Zhengyou Zhang, and Roger Zimmermann (eds.), MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12-16, 2020, pp. 1180–1188. ACM, 2020. doi: 10.1145/3394171.3413671. URL https://doi.org/10.1145/3394171.3413671.
- Heinrich Christoph Koch. Versuch einer Anleitung zur Composition, volume 72. bey Adam Friedrich Böhme, 1787.
- Carol L. Krumhansl. Cognitive Foundations of Musical Pitch. Oxford University Press, 11 2001. ISBN 9780195148367. doi: 10.1093/acprof:oso/9780195148367.001.0001. URL https://doi.org/10.1093/acprof:oso/9780195148367.001.0001.
- A Generative Theory of Tonal Music, reissue, with a new preface. MIT press, 1996.
- Melodydiffusion: Chord-conditioned melody generation using a transformer-based diffusion model. Mathematics, 11(8):1915, 2023.
- Repaint: Inpainting using denoising diffusion probabilistic models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022, pp. 11451–11461. IEEE, 2022. doi: 10.1109/CVPR52688.2022.01117. URL https://doi.org/10.1109/CVPR52688.2022.01117.
- Polyffusion: A diffusion model for polyphonic score generation with internal and external controls. In Proceedings of the 24th International Society for Music Information Retrieval Conference, ISMIR, 2023.
- Symbolic music generation with diffusion models. In Jin Ha Lee, Alexander Lerch, Zhiyao Duan, Juhan Nam, Preeti Rao, Peter van Kranenburg, and Ajay Srinivasamurthy (eds.), Proceedings of the 22nd International Society for Music Information Retrieval Conference, ISMIR 2021, Online, November 7-12, 2021, pp. 468–475, 2021. URL https://archives.ismir.net/ismir2021/paper/000058.pdf.
- Pop music generation with controllable phrase lengths. In Preeti Rao, Hema A. Murthy, Ajay Srinivasamurthy, Rachel M. Bittner, Rafael Caro Repetto, Masataka Goto, Xavier Serra, and Marius Miron (eds.), Proceedings of the 23rd International Society for Music Information Retrieval Conference, ISMIR 2022, Bengaluru, India, December 4-8, 2022, pp. 125–131, 2022. URL https://archives.ismir.net/ismir2022/paper/000014.pdf.
- Popmag: Pop music accompaniment generation. In Chang Wen Chen, Rita Cucchiara, Xian-Sheng Hua, Guo-Jun Qi, Elisa Ricci, Zhengyou Zhang, and Roger Zimmermann (eds.), MM ’20: The 28th ACM International Conference on Multimedia, Virtual Event / Seattle, WA, USA, October 12-16, 2020, pp. 1198–1206. ACM, 2020. doi: 10.1145/3394171.3413721. URL https://doi.org/10.1145/3394171.3413721.
- Heinrich Schenker. Free Composition (Der freie Satz). Longman, New York, 1979. Translated and edited by Ernst Oster.
- Moûsai: Text-to-music generation with long-context latent diffusion. CoRR, abs/2301.11757, 2023. doi: 10.48550/arXiv.2301.11757. URL https://doi.org/10.48550/arXiv.2301.11757.
- Arnold Schoenberg. Theory of harmony. Univ of California Press, 1983.
- Philip Tagg. Analysing popular music: theory, method and practice. Popular music, 2:37–67, 1982.
- Anticipatory music transformer. arXiv preprint arXiv:2306.08620, 2023.
- Musebert: Pre-training music representation for music understanding and controllable generation. In Jin Ha Lee, Alexander Lerch, Zhiyao Duan, Juhan Nam, Preeti Rao, Peter van Kranenburg, and Ajay Srinivasamurthy (eds.), Proceedings of the 22nd International Society for Music Information Retrieval Conference, ISMIR 2021, Online, November 7-12, 2021, pp. 722–729, 2021. URL https://archives.ismir.net/ismir2021/paper/000090.pdf.
- Pop909: A pop-song dataset for music arrangement generation. In Proceedings of 21st International Conference on Music Information Retrieval, ISMIR, 2020a.
- Learning interpretable representation for controllable polyphonic music generation. In Julie Cumming, Jin Ha Lee, Brian McFee, Markus Schedl, Johanna Devaney, Cory McKay, Eva Zangerle, and Timothy de Reuse (eds.), Proceedings of the 21th International Society for Music Information Retrieval Conference, ISMIR 2020, Montreal, Canada, October 11-16, 2020, pp. 662–669, 2020b. URL http://archives.ismir.net/ismir2020/paper/000094.pdf.
- Audio-to-symbolic arrangement via cross-modal music representation learning. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022, pp. 181–185. IEEE, 2022. doi: 10.1109/ICASSP43922.2022.9747884. URL https://doi.org/10.1109/ICASSP43922.2022.9747884.
- Learning long-term music representations via hierarchical contextual constraints. In Jin Ha Lee, Alexander Lerch, Zhiyao Duan, Juhan Nam, Preeti Rao, Peter van Kranenburg, and Ajay Srinivasamurthy (eds.), Proceedings of the 22nd International Society for Music Information Retrieval Conference, ISMIR 2021, Online, November 7-12, 2021, pp. 738–745, 2021. URL https://archives.ismir.net/ismir2021/paper/000092.pdf.
- Music phrase inpainting using long-term representation and contrastive loss. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2022, Virtual and Singapore, 23-27 May 2022, pp. 186–190. IEEE, 2022. doi: 10.1109/ICASSP43922.2022.9747817. URL https://doi.org/10.1109/ICASSP43922.2022.9747817.
- Deep music analogy via latent representation disentanglement. In Arthur Flexer, Geoffroy Peeters, Julián Urbano, and Anja Volk (eds.), Proceedings of the 20th International Society for Music Information Retrieval Conference, ISMIR 2019, Delft, The Netherlands, November 4-8, 2019, pp. 596–603, 2019. URL http://archives.ismir.net/ismir2019/paper/000072.pdf.
- BUTTER: A representation learning framework for bi-directional music-sentence retrieval and generation. In Proceedings of the 1st Workshop on NLP for Music and Audio (NLP4MusA), pp. 54–58, Online, 16 October 2020. Association for Computational Linguistics. URL https://aclanthology.org/2020.nlp4musa-1.11.
- Accomontage-3: Full-band accompaniment arrangement via sequential style transfer and multi-track function prior. CoRR, abs/2310.16334, 2023. doi: 10.48550/ARXIV.2310.16334. URL https://doi.org/10.48550/arXiv.2310.16334.