Sing-On-Your-Beat: Simple Text-Controllable Accompaniment Generations
Abstract: Singing is one of the most cherished forms of human entertainment. However, creating a beautiful song requires an accompaniment that complements the vocals and aligns well with the song instruments and genre. With advancements in deep learning, previous research has focused on generating suitable accompaniments but often lacks precise alignment with the desired instrumentation and genre. To address this, we propose a straightforward method that enables control over the accompaniment through text prompts, allowing the generation of music that complements the vocals and aligns with the song instrumental and genre requirements. Through extensive experiments, we successfully generate 10-second accompaniments using vocal input and text control.
- Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325.
- AI@Meta. 2024. Llama 3 model card.
- Audiolm: a language modeling approach to audio generation. IEEE/ACM transactions on audio, speech, and language processing, 31:2523–2533.
- Fastsag: Towards fast non-autoregressive singing accompaniment generation. arXiv preprint arXiv:2405.07682.
- Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919.
- Simple and controllable music generation. Advances in Neural Information Processing Systems, 36.
- Alexandre Défossez. 2021. Hybrid spectrogram and waveform source separation. In Proceedings of the ISMIR 2021 Workshop on Music Source Separation.
- High fidelity neural audio compression. arXiv preprint arXiv:2210.13438.
- Jukebox: A generative model for music. arXiv preprint arXiv:2005.00341.
- Lp-musiccaps: Llm-based pseudo music captioning. arXiv preprint arXiv:2307.16372.
- Singsong: Generating musical accompaniments from singing. arXiv preprint arXiv:2301.12662.
- Clap learning audio concepts from natural language supervision. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
- Fast timing-conditioned latent audio diffusion. arXiv preprint arXiv:2402.04825.
- Cnn architectures for large-scale audio classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp), pages 131–135. IEEE.
- Hubert: Self-supervised speech representation learning by masked prediction of hidden units. IEEE/ACM transactions on audio, speech, and language processing, 29:3451–3460.
- Fr\\\backslash\’echet audio distance: A metric for evaluating music enhancement algorithms. arXiv preprint arXiv:1812.08466.
- Mert: Acoustic music understanding model with large-scale self-supervised training. arXiv preprint arXiv:2306.00107.
- Visual instruction tuning. Advances in neural information processing systems, 36.
- Samplernn: An unconditional end-to-end neural audio generation model. arXiv preprint arXiv:1612.07837.
- Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67.
- The MUSDB18 corpus for music separation.
- Hybrid transformers for music source separation. In ICASSP 23.
- Wavenet: A generative model for raw audio. arXiv preprint arXiv:1609.03499, 12.
- AÂ Vaswani. 2017. Attention is all you need. Advances in Neural Information Processing Systems.
- Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.