Music Style Transfer with Time-Varying Inversion of Diffusion Models (2402.13763v1)
Abstract: With the development of diffusion models, text-guided image style transfer has demonstrated high-quality controllable synthesis results. However, the utilization of text for diverse music style transfer poses significant challenges, primarily due to the limited availability of matched audio-text datasets. Music, being an abstract and complex art form, exhibits variations and intricacies even within the same genre, thereby making accurate textual descriptions challenging. This paper presents a music style transfer approach that effectively captures musical attributes using minimal data. We introduce a novel time-varying textual inversion module to precisely capture mel-spectrogram features at different levels. During inference, we propose a bias-reduced stylization technique to obtain stable results. Experimental results demonstrate that our method can transfer the style of specific instruments, as well as incorporate natural sounds to compose melodies. Samples and source code are available at https://lsfhuihuiff.github.io/MusicTI/.
- MusicLM: Generating Music from Text. arXiv preprint arXiv:2301.11325.
- Music-STAR: a Style Translation system for Audio-based Re-instrumentation. In International Society for Music Information Retrieval Conference (ISMIR), 419–426.
- Modulated variational auto-encoders for many-to-many musical timbre transfer. arXiv preprint arXiv:1810.00222.
- Timbre Transfer with Variational Auto Encoding and Cycle-Consistent Adversarial Networks. In International Joint Conference on Neural Networks (IJCNN), 1–8. IEEE.
- Audiolm: a language modeling approach to audio generation. IEEE/ACM Transactions on Audio, Speech, and Language Processing.
- MIDI-VAE: Modeling dynamics and instrumentation of music with applications to style transfer. In International Society for Music Information Retrieval Conference (ISMIR), 747–754.
- Semi-supervised many-to-many music timbre transfer. In International Conference on Multimedia Retrieval (ICMR), 442–446.
- Pop2Piano: Pop Audio-Based Piano Cover Generation. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5. IEEE.
- Self-supervised vq-vae for one-shot music style transfer. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 96–100. IEEE.
- Groove2Groove: One-Shot Music Style Transfer With Supervision From Synthetic Data. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28: 2638–2650.
- Simple and Controllable Music Generation. arXiv preprint arXiv:2306.05284.
- Music style transfer: A position paper. arXiv preprint arXiv:1803.06841.
- Riffusion - Stable diffusion for real-time music generation. https://riffusion.com/about. Accessed: 2022-12-31.
- An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. In International Conference on Learning Representations (ICLR).
- Encoder-Based Domain Tuning for Fast Personalization of Text-to-Image Models. ACM Transactions on Graphics, 42(4): 150:1–150:13.
- Text-to-Audio Generation using Instruction Tuned LLM and Latent Diffusion Model. arXiv preprint arXiv:2304.13731.
- Audio style transfer. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 586–590. IEEE.
- Visual attention network. Computational Visual Media, 9(4): 733–752.
- MuLan: A Joint Embedding of Music Audio and Natural Language. In International Society for Music Information Retrieval Conference (ISMIR), 559–566.
- Noise2Music: Text-Conditioned Music Generation with Diffusion Models. arXiv preprint arXiv:2302.03917.
- Make-An-Audio: Text-to-Audio Generation with Prompt-Enhanced Diffusion Models. In International Conference on Machine Learning (ICML).
- TimbreTron: A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical Timbre Transfer. In International Conference on Learning Representations (ICLR).
- ReVersion: Diffusion-Based Relation Inversion from Images. arXiv preprint arXiv:2303.13495.
- ATT: Attention-based timbre transfer. In 2020 International Joint Conference on Neural Networks (IJCNN), 1–6. IEEE.
- StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing. arXiv preprint arXiv:2303.15649.
- Audioldm: Text-to-audio generation with latent diffusion models. arXiv preprint arXiv:2301.12503.
- Play as you like: Timbre-enhanced multi-modal music style transfer. In Proceedings of the AAAI Conference on Artificial Intelligence (AAAI), 1061–1068.
- Transferring the Style of Homophonic Music Using Recurrent Neural Networks and Autoregressive Model. In International Society for Music Information Retrieval Conference (ISMIR), 740–746.
- A universal music translation network. In International Conference on Learning Representations (ICLR).
- Training language models to follow instructions with human feedback. In Advances in Neural Information Processing Systems (NeurIPS), volume 35, 27730–27744.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML), 8748–8763. PMLR.
- High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10684–10695.
- Schneider, F. 2023. Archisound: Audio generation with diffusion. arXiv preprint arXiv:2301.13267.
- Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion. arXiv preprint arXiv:2301.11757.
- Denoising diffusion implicit models. arXiv preprint arXiv:2010.02502.
- Key-Locked Rank One Editing for Text-to-Image Personalization. In ACM SIGGRAPH 2023 Conference Proceedings, 12:1–12:11. New York, NY, USA: Association for Computing Machinery.
- P+limit-from𝑃P+italic_P +: Extended Textual Conditioning in Text-to-Image Generation. arXiv preprint arXiv:2303.09522.
- MuseMorphose: Full-Song and Fine-Grained Piano Music Style Transfer With One Transformer VAE. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 31: 1953–1967.
- Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5. IEEE.
- Transplayer: Timbre Style Transfer with Flexible Timbre Control. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 1–5. IEEE.
- ProSpect: Prompt Spectrum for Attribute-Aware Personalization of Diffusion Models. ACM Transactions on Graphics, 42(6): 244:1–244:14.
- Inversion-Based Style Transfer with Diffusion Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 10146–10156.