Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
169 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
45 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

SongComposer: A Large Language Model for Lyric and Melody Generation in Song Composition (2402.17645v2)

Published 27 Feb 2024 in cs.SD, cs.AI, cs.CL, and eess.AS

Abstract: Creating lyrics and melodies for the vocal track in a symbolic format, known as song composition, demands expert musical knowledge of melody, an advanced understanding of lyrics, and precise alignment between them. Despite achievements in sub-tasks such as lyric generation, lyric-to-melody, and melody-to-lyric, etc, a unified model for song composition has not yet been achieved. In this paper, we introduce SongComposer, a pioneering step towards a unified song composition model that can readily create symbolic lyrics and melodies following instructions. SongComposer is a music-specialized LLM that, for the first time, integrates the capability of simultaneously composing lyrics and melodies into LLMs by leveraging three key innovations: 1) a flexible tuple format for word-level alignment of lyrics and melodies, 2) an extended tokenizer vocabulary for song notes, with scalar initialization based on musical knowledge to capture rhythm, and 3) a multi-stage pipeline that captures musical structure, starting with motif-level melody patterns and progressing to phrase-level structure for improved coherence. Extensive experiments demonstrate that SongComposer outperforms advanced LLMs, including GPT-4, in tasks such as lyric-to-melody generation, melody-to-lyric generation, song continuation, and text-to-song creation. Moreover, we will release SongCompose, a large-scale dataset for training, containing paired lyrics and melodies in Chinese and English.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (39)
  1. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023.
  2. Qwen technical report. arXiv preprint arXiv:2309.16609, 2023.
  3. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  4. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. URL https://lmsys.org/blog/2023-03-30-vicuna/.
  5. Palm: Scaling language modeling with pathways. Journal of Machine Learning Research, 24(240):1–113, 2023.
  6. Simple and controllable music generation. In Thirty-seventh Conference on Neural Information Processing Systems, 2023.
  7. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  8. The nus sung and spoken lyrics corpus: A quantitative comparison of singing and speech. In 2013 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference, pp.  1–9. IEEE, 2013.
  9. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  10. Noise2music: Text-conditioned music generation with diffusion models. arXiv preprint arXiv:2302.03917, 2023.
  11. Multi-singer: Fast multi-singer singing voice vocoder with a large-scale corpus. In Proceedings of the 29th ACM International Conference on Multimedia, pp.  3945–3954, 2021.
  12. M2ugen: Multi-modal music understanding and generation with the power of large language models. arXiv preprint arXiv:2311.11255, 2023.
  13. Telemelody: Lyric-to-melody generation with a template-based two-stage method. In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.  5426–5437, 2022.
  14. Videopoet: A large language model for zero-shot video generation. arXiv preprint arXiv:2312.14125, 2023.
  15. Lin, C.-Y. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pp.  74–81, 2004.
  16. Diffsinger: Singing voice synthesis via shallow diffusion mechanism. In Proceedings of the AAAI conference on artificial intelligence, volume 36, pp.  11020–11028, 2022.
  17. Decoupled weight decay regularization. In International Conference on Learning Representations, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  18. Müller, M. Dynamic time warping. Information retrieval for music and motion, pp.  69–84, 2007.
  19. Tohoku kiritan singing database: A singing database for statistical parametric singing synthesis using japanese pop songs. Acoustical Science and Technology, 42(3):140–145, 2021.
  20. OpenAI. Introducing chatgpt, 2022. URL https://openai.com/blog/chatgpt.
  21. OpenAI. Gpt4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  22. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  23. Improving language understanding by generative pre-training. 2018.
  24. Raffel, C. Learning-Based Methods for Comparing Sequences, with Applications to Audio-to-MIDI Alignment and Matching. PhD thesis, Columbia University, 2016.
  25. Intuitive analysis, creation and manipulation of midi data with pretty_midi. In 15th international society for music information retrieval conference late breaking and demo papers, pp.  84–93, 2014.
  26. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  27. Mo\\\backslash\^ usai: Text-to-music generation with long-context latent diffusion. arXiv preprint arXiv:2301.11757, 2023.
  28. Nhss: A speech and singing parallel database. Speech Communication, 133:9–22, 2021.
  29. Songmass: Automatic song writing with pre-training and alignment constraint. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pp.  13798–13805, 2021.
  30. Jvs-music: Japanese multispeaker singing-voice corpus. arXiv preprint arXiv:2001.07044, 2020.
  31. Team, I. Internlm: A multilingual language model with progressively enhanced capabilities. https://github.com/InternLM/InternLM, 2023.
  32. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  33. Opencpop: A high-quality open source chinese popular song corpus for singing voice synthesis. arXiv preprint arXiv:2201.07429, 2022.
  34. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  35. Xu, M. Text2vec: Text to vector toolkit. https://github.com/shibing624/text2vec, 2023.
  36. Baichuan 2: Open large-scale language models. arXiv preprint arXiv:2309.10305, 2023.
  37. Conditional lstm-gan for melody generation from lyrics. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 17(1):1–20, 2021.
  38. M4singer: A multi-style, multi-singer and musical score provided mandarin singing corpus. Advances in Neural Information Processing Systems, 35:6914–6926, 2022.
  39. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2019.
Citations (14)

Summary

We haven't generated a summary for this paper yet.