Mozart's Touch: A Lightweight Multi-modal Music Generation Framework Based on Pre-Trained Large Models
Abstract: In recent years, AI-Generated Content (AIGC) has witnessed rapid advancements, facilitating the creation of music, images, and other artistic forms across a wide range of industries. However, current models for image- and video-to-music synthesis struggle to capture the nuanced emotions and atmosphere conveyed by visual content. To fill this gap, we propose Mozart's Touch, a multi-modal music generation framework capable of generating music aligned with cross-modal inputs such as images, videos, and text. The framework consists of three key components: Multi-modal Captioning Module, LLM understanding & Bridging Module, and Music Generation Module. Unlike traditional end-to-end methods, Mozart's Touch uses LLMs to accurately interpret visual elements without requiring the training or fine-tuning of music generation models, providing efficiency and transparency through clear, interpretable prompts. We also introduce the "LLM-Bridge" method to resolve the heterogeneous representation challenges between descriptive texts from different modalities. Through a series of objective and subjective evaluations, we demonstrate that Mozart's Touch outperforms current state-of-the-art models. Our code and examples are available at https://github.com/TiffanyBlews/MozartsTouch.
- Musiclm: Generating music from text. arXiv preprint arXiv:2301.11325 (2023).
- A comprehensive survey of ai-generated content (aigc): A history of generative ai from gan to chatgpt. arXiv preprint arXiv:2303.04226 (2023).
- Simple and controllable music generation. Advances in Neural Information Processing Systems 36 (2024).
- Latent alignment and variational attention. Advances in neural information processing systems 31 (2018).
- An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020).
- Audio Set: An ontology and human-labeled dataset for audio events. In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 776–780. https://doi.org/10.1109/ICASSP.2017.7952261
- ImageBind: One Embedding Space To Bind Them All. In CVPR.
- Tanmay Gupta and Aniruddha Kembhavi. 2023. Visual programming: Compositional visual reasoning without training. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 14953–14962.
- Noise2music: Text-conditioned music generation with diffusion models. arXiv preprint arXiv:2302.03917 (2023).
- M2UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models. arXiv preprint arXiv:2311.11255 (2023).
- Fréchet Audio Distance: A Reference-Free Metric for Evaluating Music Enhancement Algorithms. In Interspeech. https://api.semanticscholar.org/CorpusID:202725406
- AudioGen: Textually Guided Audio Generation. In The Eleventh International Conference on Learning Representations. https://openreview.net/forum?id=CYK7RfcOzQ4
- Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In International conference on machine learning. PMLR, 12888–12900.
- VideoChat: Chat-Centric Video Understanding. arXiv preprint arXiv:2305.06355 (2023).
- R. Likert. 1932. A technique for the measurement of attitudes. Archives of Psychology 22 140 (1932), 55–55.
- AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining. arXiv preprint arXiv:2308.05734 (2023).
- WavJourney: Compositional Audio Creation with Large Language Models. arXiv preprint arXiv:2307.14335 (2023).
- Sora: A Review on Background, Technology, Limitations, and Opportunities of Large Vision Models. arXiv preprint arXiv:2402.17177 (2024).
- Learning transferable visual models from natural language supervision. In International conference on machine learning. PMLR, 8748–8763.
- Hierarchical text-conditional image generation with clip latents. arXiv preprint arXiv:2204.06125 1, 2 (2022), 3.
- Zero-shot text-to-image generation. In International conference on machine learning. Pmlr, 8821–8831.
- High-resolution image synthesis with latent diffusion models. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 10684–10695.
- Mo\\\backslash\^ usai: Text-to-music generation with long-context latent diffusion. arXiv preprint arXiv:2301.11757 (2023).
- From show to tell: A survey on deep learning-based image captioning. IEEE transactions on pattern analysis and machine intelligence 45, 1 (2022), 539–559.
- Any-to-Any Generation via Composable Diffusion. In Thirty-seventh Conference on Neural Information Processing Systems. https://openreview.net/forum?id=2EDqbSCnmF
- A Survey on Multimodal Large Language Models. ArXiv abs/2306.13549 (2023). https://api.semanticscholar.org/CorpusID:259243718
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.