Dance-to-Music Generation with Encoder-based Textual Inversion (2401.17800v2)
Abstract: The seamless integration of music with dance movements is essential for communicating the artistic intent of a dance piece. This alignment also significantly improves the immersive quality of gaming experiences and animation productions. Although there has been remarkable advancement in creating high-fidelity music from textual descriptions, current methodologies mainly focus on modulating overall characteristics such as genre and emotional tone. They often overlook the nuanced management of temporal rhythm, which is indispensable in crafting music for dance, since it intricately aligns the musical beats with the dancers' movements. Recognizing this gap, we propose an encoder-based textual inversion technique to augment text-to-music models with visual control, facilitating personalized music generation. Specifically, we develop dual-path rhythm-genre inversion to effectively integrate the rhythm and genre of a dance motion sequence into the textual space of a text-to-music model. Contrary to traditional textual inversion methods, which directly update text embeddings to reconstruct a single target object, our approach utilizes separate rhythm and genre encoders to obtain text embeddings for two pseudo-words, adapting to the varying rhythms and genres. We collect a new dataset called In-the-wild Dance Videos (InDV) and demonstrate that our approach outperforms state-of-the-art methods across multiple evaluation metrics. Furthermore, our method is able to adapt to changes in tempo and effectively integrates with the inherent text-guided generation capability of the pre-trained model. Our source code and demo videos are available at \url{https://github.com/lsfhuihuiff/Dance-to-music_Siggraph_Asia_2024}
- Youtube-8m: A large-scale video classification benchmark. arXiv preprint arXiv:1609.08675 (2016).
- Gunjan Aggarwal and Devi Parikh. 2021. Dance2music: Automatic dance-driven music generation. arXiv preprint arXiv:2107.06252 (2021).
- MusicLM: Generating Music From Text. arXiv preprint arXiv:2301.11325 (2023).
- A Neural Space-Time Representation for Text-to-Image Personalization. ACM Transactions on Graphics 42, 6, Article 243 (dec 2023), 10 pages.
- MS-SincResnet: Joint learning of 1D and 2D kernels using multi-scale SincNet and ResNet for music genre classification. In International Conference on Multimedia Retrieval (ICMR). 29–36.
- MMPose Contributors. 2020. OpenMMLab Pose Estimation Toolbox and Benchmark. https://github.com/open-mmlab/mmpose.
- Simple and Controllable Music Generation. In Thirty-seventh Conference on Neural Information Processing Systems (NeurIPS).
- Video background music generation with controllable music transformer. In ACM International Conference on Multimedia. 2037–2045.
- Conditional Generation of Audio from Video via Foley Analogies. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 2426–2436.
- Seth* Forsgren and Hayk* Martiros. 2022. Riffusion - Stable diffusion for real-time music generation. (2022). https://riffusion.com/about
- An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion. In International Conference on Learning Representations (ICLR).
- Encoder-Based Domain Tuning for Fast Personalization of Text-to-Image Models. ACM Transactions on Graphics 42, 4, Article 150 (jul 2023), 13 pages.
- Foley music: Learning to generate music from videos. In European Conference on Computer Vision (ECCV). Springer, 758–775.
- Text-to-Audio Generation using Instruction-Tuned LLM and Latent Diffusion Model. arXiv preprint arXiv:2304.13731 (2023).
- Bo Han and Yi Ren. 2023. Dance2MIDI: Dance-Driven Multi-Instrument Music Generation. arXiv preprint arXiv:2301.09080 (2023).
- CNN architectures for large-scale audio classification. In International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 131–135.
- LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations (ICLR).
- Noise2Music: Text-conditioned Music Generation with Diffusion Models. arXiv preprint arXiv:2302.03917 (2023).
- Make-An-Audio: Text-To-Audio Generation with Prompt-Enhanced Diffusion Models. In International Conference on Machine Learning (ICML).
- ReVersion: Diffusion-Based Relation Inversion from Images. arXiv preprint arXiv:2303.13495 (2023).
- NeuralSound: learning-based modal sound synthesis with acoustic transfer. ACM Transactions on Graphics 41, 4, Article 121 (2022), 15 pages.
- Video2Music: Suitable Music Generation from Videos using an Affective Multimodal Transformer model. arXiv preprint arXiv:2311.00968 (2023).
- Fréechet Audio Distance: A Reference-free Metric for Evaluating Music Enhancement Algorithms. In INTERSPEECH. 2350–2354.
- AI Choreographer: Music Conditioned 3D Dance Generation with AIST++. In IEEE/CVF International Conference on Computer Vision (ICCV). 13381–13392.
- StyleDiffusion: Prompt-Embedding Inversion for Text-Based Editing. arXiv preprint arXiv:2303.15649 (2023).
- Microsoft COCO: Common Objects in Context. In European Conference on Computer Vision (ECCV). Springer, 740–755.
- AudioLDM: Text-to-Audio Generation with Latent Diffusion Models. In International Conference on Machine Learning (ICML), Vol. 202. PMLR, 21450–21474.
- AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining. arXiv preprint arXiv:2308.05734 (2023).
- MuseCoco: Generating Symbolic Music from Text. arXiv preprint arXiv:2306.00110 (2023).
- Null-text Inversion for Editing Real Images using Guided Diffusion Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 6038–6047.
- Investigating Personalization Methods in Text to Music Generation. arXiv preprint arXiv:2309.11140 (2023).
- RD-FGFS: A Rule-Data Hybrid Framework for Fine-Grained Footstep Sound Synthesis from Visual Guidance. In ACM International Conference on Multimedia. 8525–8533.
- Learning transferable visual models from natural language supervision. In International Conference on Machine Learning (ICML). PMLR, 8748–8763.
- Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research 21, 1 (2020), 5485–5551.
- High-resolution image synthesis with latent diffusion models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10684–10695.
- Dreambooth: Fine tuning text-to-image diffusion models for subject-driven generation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 22500–22510.
- Flavio Schneider. 2023. Archisound: Audio generation with diffusion. arXiv preprint arXiv:2301.13267 (2023).
- Moûsai: Text-to-Music Generation with Long-Context Latent Diffusion. arXiv preprint arXiv:2301.11757 (2023).
- V2Meow: Meowing to the Visual Beat via Music Generation. arXiv preprint arXiv:2305.06594 (2023).
- Audeo: Audio generation for a silent performance video. Advances in Neural Information Processing Systems (NeurIPS) 33 (2020), 3325–3337.
- How Does it Sound? Generation of Rhythmic Soundtracks for Human Movement Videos. In Thirty-fifth Conference on Neural Information Processing Systems (NeurIPS).
- Physics-Driven Diffusion Models for Impact Sound Synthesis from Videos. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 9749–9759.
- Deep high-resolution representation learning for human pose estimation. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 5693–5703.
- Motion to Dance Music Generation using Latent Diffusion Model. In SIGGRAPH Asia 2023 Technical Communications (Sydney, NSW, Australia) (SA ’23). Association for Computing Machinery, New York, NY, USA, Article 5, 4 pages.
- Any-to-Any Generation via Composable Diffusion. arXiv preprint arXiv:2305.11846 (2023).
- P+limit-from𝑃P+italic_P +: Extended Textual Conditioning in Text-to-Image Generation. arXiv preprint arXiv:2303.09522 (2023).
- NExT-GPT: Any-to-Any Multimodal LLM. arXiv preprint arXiv:2309.05519 (2023).
- Music ControlNet: Multiple Time-varying Controls for Music Generation. arXiv preprint arXiv:2311.07069 (2023).
- Large-scale contrastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. In IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5.
- Long-Term Rhythmic Video Soundtracker. In International Conference on Machine Learning (ICML) (Honolulu, Hawaii, USA). JMLR.org, Article 1688, 15 pages.
- ProSpect: Prompt Spectrum for Attribute-Aware Personalization of Diffusion Models. ACM Transactions on Graphics 42, 6, Article 244 (dec 2023), 14 pages.
- Inversion-Based Style Transfer with Diffusion Models. In IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 10146–10156.
- Quantized gan for complex music generation from dance videos. In European Conference on Computer Vision (ECCV). Springer, 182–199.
- Discrete contrastive diffusion for cross-modal music and image generation. In International Conference on Learning Representations (ICLR).
- Video background music generation: Dataset, method and evaluation. In IEEE/CVF International Conference on Computer Vision (ICCV). 15637–15647.
- Sifei Li (4 papers)
- Weiming Dong (50 papers)
- Yuxin Zhang (91 papers)
- Fan Tang (46 papers)
- Chongyang Ma (52 papers)
- Oliver Deussen (34 papers)
- Tong-Yee Lee (21 papers)
- Changsheng Xu (101 papers)