Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment (2404.09313v3)

Published 14 Apr 2024 in eess.AS and cs.AI

Abstract: A song is a combination of singing voice and accompaniment. However, existing works focus on singing voice synthesis and music generation independently. Little attention was paid to explore song synthesis. In this work, we propose a novel task called text-to-song synthesis which incorporating both vocals and accompaniments generation. We develop Melodist, a two-stage text-to-song method that consists of singing voice synthesis (SVS) and vocal-to-accompaniment (V2A) synthesis. Melodist leverages tri-tower contrastive pretraining to learn more effective text representation for controllable V2A synthesis. A Chinese song dataset mined from a music website is built up to alleviate data scarcity for our research. The evaluation results on our dataset demonstrate that Melodist can synthesize songs with comparable quality and style consistency. Audio samples can be found in https://text2songMelodist.github.io/Sample/.

References (41)

Citations (3)

View on Semantic Scholar

Summary

We haven't generated a summary for this paper yet.

Summarize Now

Tweets

https://twitter.com/AudioAndSpeech/status/1780107910028104100

https://twitter.com/AudioAndSpeech/status/1780524655276605833

https://twitter.com/AudioAndSpeech/status/1792866994024239315

Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment (2404.09313v3)

Summary

Related Papers

Tweets