Text-to-Song: Towards Controllable Music Generation Incorporating Vocals and Accompaniment (2404.09313v3)
Abstract: A song is a combination of singing voice and accompaniment. However, existing works focus on singing voice synthesis and music generation independently. Little attention was paid to explore song synthesis. In this work, we propose a novel task called text-to-song synthesis which incorporating both vocals and accompaniments generation. We develop Melodist, a two-stage text-to-song method that consists of singing voice synthesis (SVS) and vocal-to-accompaniment (V2A) synthesis. Melodist leverages tri-tower contrastive pretraining to learn more effective text representation for controllable V2A synthesis. A Chinese song dataset mined from a music website is built up to alleviate data scarcity for our research. The evaluation results on our dataset demonstrate that Melodist can synthesize songs with comparable quality and style consistency. Audio samples can be found in https://text2songMelodist.github.io/Sample/.
- Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325.
- Yourtts: Towards zero-shot multi-speaker tts and zero-shot voice conversion for everyone. In International Conference on Machine Learning, pages 2709–2720. PMLR.
- Learning music sequence representation from text supervision. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4583–4587. IEEE.
- Simple and controllable music generation. arXiv preprint arXiv:2306.05284.
- Music source separation in the waveform domain. arXiv preprint arXiv:1911.13254.
- Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
- Lp-musiccaps: Llm-based pseudo music captioning. arXiv preprint arXiv:2307.16372.
- Singsong: Generating musical accompaniments from singing. arXiv preprint arXiv:2301.12662.
- Musegan: Multi-track sequential generative adversarial networks for symbolic music generation and accompaniment. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 32.
- Clap learning audio concepts from natural language supervision. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
- Ast: Audio spectrogram transformer. arXiv preprint arXiv:2104.01778.
- Generative adversarial networks. Commun. ACM, 63(11):139–144.
- Audioclip: Extending clip to image, text and audio. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 976–980. IEEE.
- Unisinger: Unified end-to-end singing voice synthesis with cross-modality information matching. In Proceedings of the 31st ACM International Conference on Multimedia, pages 7569–7579.
- Mulan: A joint embedding of music audio and natural language. arXiv preprint arXiv:2208.12415.
- Multi-singer: Fast multi-singer singing voice vocoder with a large-scale corpus. In Proceedings of the 29th ACM International Conference on Multimedia, MM ’21, page 3945–3954, New York, NY, USA. Association for Computing Machinery.
- Singgan: Generative adversarial network for high-fidelity singing voice generation. In Proceedings of the 30th ACM International Conference on Multimedia, pages 2525–2535.
- Make-an-audio: Text-to-audio generation with prompt-enhanced diffusion models. arXiv preprint arXiv:2301.12661.
- Generspeech: Towards style transfer for generalizable out-of-domain text-to-speech. In Advances in Neural Information Processing Systems.
- Make-a-voice: Unified voice synthesis with discrete representation. arXiv preprint arXiv:2305.19269.
- Hifi-gan: Generative adversarial networks for efficient and high fidelity speech synthesis. Advances in Neural Information Processing Systems, 33:17022–17033.
- Bigvgan: A universal neural vocoder with large-scale training. arXiv preprint arXiv:2206.04658.
- Diffsinger: Singing voice synthesis via shallow diffusion mechanism. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 11020–11028.
- Automatic neural lyrics and melody composition. arXiv preprint arXiv:2011.06380.
- Contrastive audio-language learning for music. arXiv preprint arXiv:2208.12208.
- Montreal forced aligner: Trainable text-speech alignment using kaldi. In Interspeech, volume 2017, pages 498–502.
- Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748.
- Autovc: Zero-shot voice style transfer with only autoencoder loss. In International Conference on Machine Learning, pages 5210–5219. PMLR.
- Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551.
- Popmag: Pop music accompaniment generation. In Proceedings of the 28th ACM international conference on multimedia, pages 1198–1206.
- AISHELL-3: A multi-speaker mandarin TTS corpus and the baselines. CoRR, abs/2010.11567.
- Opencpop: A high-quality open source chinese popular song corpus for singing voice synthesis. arXiv preprint arXiv:2201.07429.
- Wav2clip: Learning robust audio representations from clip. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 4563–4567. IEEE.
- Uniaudio: An audio foundation model toward universal audio generation. arXiv preprint arXiv:2310.00704.
- Midinet: A convolutional generative adversarial network for symbolic-domain music generation. arXiv preprint arXiv:1703.10847.
- Megabyte: Predicting million-byte sequences with multiscale transformers. arXiv preprint arXiv:2305.07185.
- Conditional lstm-gan for melody generation from lyrics. ACM Transactions on Multimedia Computing, Communications, and Applications (TOMM), 17(1):1–20.
- Soundstream: An end-to-end neural audio codec. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507.
- M4singer: A multi-style, multi-singer and musical score provided mandarin singing corpus. Advances in Neural Information Processing Systems, 35:6914–6926.
- Visinger: Variational inference with adversarial learning for end-to-end singing voice synthesis. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7237–7241. IEEE.