Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
166 tokens/sec
GPT-4o
7 tokens/sec
Gemini 2.5 Pro Pro
42 tokens/sec
o3 Pro
4 tokens/sec
GPT-4.1 Pro
38 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Accompanied Singing Voice Synthesis with Fully Text-controlled Melody (2407.02049v1)

Published 2 Jul 2024 in eess.AS, cs.CL, and cs.SD

Abstract: Text-to-song (TTSong) is a music generation task that synthesizes accompanied singing voices. Current TTSong methods, inherited from singing voice synthesis (SVS), require melody-related information that can sometimes be impractical, such as music scores or MIDI sequences. We present MelodyLM, the first TTSong model that generates high-quality song pieces with fully text-controlled melodies, achieving minimal user requirements and maximum control flexibility. MelodyLM explicitly models MIDI as the intermediate melody-related feature and sequentially generates vocal tracks in a LLM manner, conditioned on textual and vocal prompts. The accompaniment music is subsequently synthesized by a latent diffusion model with hybrid conditioning for temporal alignment. With minimal requirements, users only need to input lyrics and a reference voice to synthesize a song sample. For full control, just input textual prompts or even directly input MIDI. Experimental results indicate that MelodyLM achieves superior performance in terms of both objective and subjective metrics. Audio samples are available at https://melodylm666.github.io.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (50)
  1. H. F. Aarabi and G. Peeters. Deep-rhythm for global tempo estimation in music. In ISMIR, pages 636–643, 2019.
  2. Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325, 2023.
  3. Anjok07 and aufr33. Ultimate vocal remover. https://github.com/Anjok07/ultimatevocalremovergui, 2020.
  4. Whisperx: Time-accurate speech transcription of long-form audio. INTERSPEECH 2023, 2023.
  5. Audiolm: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2023.
  6. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  7. Midi-voice: Expressive zero-shot singing voice synthesis via midi-driven priors. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 12622–12626. IEEE, 2024.
  8. Musicldm: Enhancing novelty in text-to-music generation using beat-synchronous mixup strategies. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1206–1210. IEEE, 2024.
  9. Scaling instruction-finetuned language models. Journal of Machine Learning Research, 25(70):1–53, 2024.
  10. Simple and controllable music generation. Advances in Neural Information Processing Systems, 36, 2024.
  11. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  12. Songcomposer: A large language model for lyric and melody composition in song generation. arXiv preprint arXiv:2402.17645, 2024.
  13. Lp-musiccaps: Llm-based pseudo music captioning. arXiv preprint arXiv:2307.16372, 2023.
  14. Singsong: Generating musical accompaniments from singing. arXiv preprint arXiv:2301.12662, 2023.
  15. Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32, 2018.
  16. Clap learning audio concepts from natural language supervision. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023.
  17. Rmssinger: Realistic-music-score based singing voice synthesis. arXiv preprint arXiv:2305.10686, 2023.
  18. Denoising diffusion probabilistic models. Advances in neural information processing systems, 33:6840–6851, 2020.
  19. Mulan: A joint embedding of music audio and natural language. arXiv preprint arXiv:2208.12415, 2022.
  20. Noise2music: Text-conditioned music generation with diffusion models. arXiv preprint arXiv:2302.03917, 2023.
  21. Multi-singer: Fast multi-singer singing voice vocoder with a large-scale corpus. In Proceedings of the 29th ACM International Conference on Multimedia, pages 3945–3954, 2021.
  22. Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. In International Conference on Machine Learning, pages 13916–13932. PMLR, 2023.
  23. D. P. Kingma and M. Welling. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013.
  24. Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in neural information processing systems, 33:17022–17033, 2020.
  25. C. L. Krumhansl. Cognitive foundations of musical pitch. Oxford University Press, 2001.
  26. Efficient neural music generation. Advances in Neural Information Processing Systems, 36, 2024.
  27. Alignsts: Speech-to-singing conversion via cross-modal alignment. arXiv preprint arXiv:2305.04476, 2023.
  28. Robust singing voice transcription serves synthesis, 2024.
  29. Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503, 2023.
  30. Diffsinger: Singing voice synthesis via shallow diffusion mechanism. Proceedings of the AAAI Conference on Artificial Intelligence, 36(10):11020–11028, Jun. 2022.
  31. OpenAI. Introducing chatgpt. https://openai.com/blog/chatgpt, 2022.
  32. W. Peebles and S. Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 4195–4205, 2023.
  33. Prosospeech: Enhancing prosody with quantized vector pre-training in text-to-speech. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7577–7581. IEEE, 2022.
  34. resemble ai. Resemblyzer. https://github.com/resemble-ai/Resemblyzer, 2019.
  35. High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022.
  36. Mo\\\backslash\^ usai: Text-to-music generation with long-context latent diffusion. arXiv preprint arXiv:2301.11757, 2023.
  37. Naturalspeech 2: Latent diffusion models are natural and zero-shot speech and singing synthesizers. arXiv preprint arXiv:2304.09116, 2023.
  38. Songmass: Automatic song writing with pre-training and alignment constraint. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 35, pages 13798–13805, 2021.
  39. S. Team. Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier. https://github.com/snakers4/silero-vad, 2021.
  40. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023.
  41. Neural codec language models are zero-shot text to speech synthesizers. arXiv preprint arXiv:2301.02111, 2023.
  42. Prompt-singer: Controllable singing-voice-synthesis with natural language prompt. arXiv preprint arXiv:2403.11780, 2024.
  43. Opencpop: A high-quality open source chinese popular song corpus for singing voice synthesis. arXiv preprint arXiv:2201.07429, 2022.
  44. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  45. Uniaudio: An audio foundation model toward universal audio generation. arXiv preprint arXiv:2310.00704, 2023.
  46. Museformer: Transformer with fine-and coarse-grained attention for music generation. Advances in Neural Information Processing Systems, 35:1376–1388, 2022.
  47. Soundstream: An end-to-end neural audio codec, 2021.
  48. M4singer: A multi-style, multi-singer and musical score provided mandarin singing corpus. Advances in Neural Information Processing Systems, 35:6914–6926, 2022.
  49. Stylesinger: Style transfer for out-of-domain singing voice synthesis. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 19597–19605, 2024.
  50. Text-to-song: Towards controllable music generation incorporating vocals and accompaniment. arXiv preprint arXiv:2404.09313, 2024.
Citations (3)

Summary

  • The paper introduces MelodyLM, a model that generates accompanied singing voices using fully text-controlled melody, achieving high key accuracy and reduced pitch discrepancies.
  • It employs a three-stage framework—text-to-MIDI, text-to-vocal with multi-scale transformers, and vocal-to-accompaniment via latent diffusion models—to align vocals and instruments effectively.
  • Evaluations demonstrate improved F0 frame error and prosody quality compared to baselines, making singing voice synthesis more accessible and versatile.

Overview of "Accompanied Singing Voice Synthesis with Fully Text-controlled Melody"

The paper introduces MelodyLM, a novel approach for Text-to-Song (TTSong) synthesis that emphasizes completely text-controlled melody generation for accompanied singing voice. Unlike existing methods which rely heavily on musical scores or MIDI sequences, MelodyLM sets itself apart by minimizing user requirements and maximizing control flexibility. This model represents a significant step towards making singing voice synthesis (SVS) more accessible and versatile.

MelodyLM employs a multi-stage process: initially, it models MIDI sequences as intermediate features, then sequentially generates vocal tracks conditioned on both textual and vocal prompts, and finally synthesizes accompaniment music using latent diffusion models enhanced with hybrid conditioning for temporal alignment. The approach allows the user to input merely lyrics and a reference voice to create a song, with optional inputs for enhanced control.

Methodology and Framework

MelodyLM operates through a three-stage framework:

  1. Text-to-MIDI: MIDI note sequences are generated conditioned on text prompts. This step introduces melody-related features without needing prior musical scores.
  2. Text-to-Vocal: Using a LLM framework, detailed in a multi-scale transformer setup, the model generates vocal tracks that align with both the MIDI and the textual inputs.
  3. Vocal-to-Accompaniment: Accompaniment music is synthesized via a latent diffusion model that aligns the generated vocals with instrumental tracks.

The paper leverages significant advancements in LLMs (LMs) and introduces MIDI sequences as an intermediate feature to bridge the gap between textual descriptions and music generation. Additionally, MelodyLM harnesses the power of both pre-existing music and lyrics to influence its generative process.

Evaluation and Results

The experiments conducted demonstrate MelodyLM's superiority over existing baselines in both objective and subjective evaluations. Specifically, for MIDI generation, MelodyLM achieved a high key accuracy rate and significantly reduced pitch and duration discrepancies compared to other models. The singing voice synthesis showed improvements in F0 frame error, and subjectively, listeners rated the generated singing voices favorably in terms of prosody and quality, albeit with some challenges in singer similarity due to the data's variable quality.

When tested on generating both vocals and accompaniment, MelodyLM outperformed existing models such as Melodist, particularly when melding vocals with instruments, showing high alignment with the intended melody and text prompts.

Implications and Future Directions

The implications of this research are manifold, both from practical and theoretical perspectives. Practically, MelodyLM's framework reduces the barriers for creating synthesized music, making it more accessible for users with limited musical training. Theoretically, it opens new pathways for exploring how LLMs can interface with other models to produce coherent, high-quality artistic outputs.

Future efforts could expand the model's training dataset to include a broader variety of music styles beyond Mandarin pop, facilitating more universal applications. Additionally, streamlining the multi-stage process into a more cohesive end-to-end framework might improve efficiency and reduce complexity.

In conclusion, MelodyLM showcases promising capabilities in synthesizing high-quality accompanied songs with flexible user inputs facilitated by contemporary language and diffusion models, staking out new territory in SVS research and application.

Github Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com