M$^{2}$UGen: Multi-modal Music Understanding and Generation with the Power of Large Language Models (2311.11255v5)
Abstract: The current landscape of research leveraging LLMs is experiencing a surge. Many works harness the powerful reasoning capabilities of these models to comprehend various modalities, such as text, speech, images, videos, etc. They also utilize LLMs to understand human intention and generate desired outputs like images, videos, and music. However, research that combines both understanding and generation using LLMs is still limited and in its nascent stage. To address this gap, we introduce a Multi-modal Music Understanding and Generation (M${2}$UGen) framework that integrates LLM's abilities to comprehend and generate music for different modalities. The M${2}$UGen framework is purpose-built to unlock creative potential from diverse sources of inspiration, encompassing music, image, and video through the use of pretrained MERT, ViT, and ViViT models, respectively. To enable music generation, we explore the use of AudioLDM 2 and MusicGen. Bridging multi-modal understanding and music generation is accomplished through the integration of the LLaMA 2 model. Furthermore, we make use of the MU-LLaMA model to generate extensive datasets that support text/image/video-to-music generation, facilitating the training of our M${2}$UGen framework. We conduct a thorough evaluation of our proposed framework. The experimental results demonstrate that our model achieves or surpasses the performance of the current state-of-the-art models.
- MusicLM: Generating Music from Text. arXiv preprint arXiv:2301.11325, 2023.
- Flamingo: A Visual Language Model for Few-Shot Learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022.
- VQA: Visual Question Answering. In Proceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015.
- ViViT: A Video Vision Transformer. 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 6816–6826, 2021.
- METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In ACL Workshop, pages 65–72, 2005a.
- METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In ACL Workshop, pages 65–72, 2005b.
- Promptify: Text-to-Image Generation through Interactive Prompt Exploration with Large Language Models. In Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, pages 1–14, 2023.
- VisualGPT: Data-efficient Adaptation of Pretrained Language Models for Image Captioning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 18030–18040, 2022.
- Simple and Controllable Music Generation. arXiv preprint arXiv:2306.05284, 2023.
- Jukebox: A Generative Model for Music. arXiv preprint arXiv:2005.00341, 2020.
- Video Background Music Generation with Controllable Music Transformer. In Proceedings of the 29th ACM International Conference on Multimedia, pages 2037–2045, 2021.
- Towards Duration Robust Weakly Supervised Sound Event Detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:887–900, 2021.
- LP-MusicCaps: LLM-Based Pseudo Music Captioning. arXiv preprint arXiv:2307.16372, 2023.
- DreamLLM: Synergistic Multimodal Comprehension and Creation. arXiv preprint arXiv:2309.11499, 2023.
- An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale. In International Conference on Learning Representations, 2021.
- Sigmoid-weighted Linear Units for Neural Network Function Approximation in Reinforcement Learning. Neural networks, 107:3–11, 2018.
- Temporal Reasoning via Audio Question Answering. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28:2283–2294, 2020.
- LLaMA-Adapter V2: Parameter-Efficient Visual Instruction Model. arXiv preprint arXiv:2304.15010, 2023.
- LLark: A Multimodal Foundation Model for Music. arXiv preprint arXiv:2310.07160, 2023.
- Planting a Seed of Vision in Large Language Model. arXiv preprint arXiv:2307.08041, 2023a.
- Making LLaMA SEE and Draw with SEED Tokenizer. arXiv preprint arXiv:2310.01218, 2023b.
- Audio Set: An Ontology and Human-labeled Dataset for Audio Events. In Proc. IEEE ICASSP 2017, 2017.
- ImageBind: One Embedding Space To Bind Them All. In CVPR, 2023.
- PSLA: Improving Audio Tagging with Pretraining, Sampling, Labeling, and Aggregation. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 29:3292–3306, 2021.
- Listen, Think, and Understand. arXiv preprint arXiv:2305.10790, 2023.
- Point-Bind & Point-LLM: Aligning Point Cloud with Multi-modality for 3D Understanding, Generation, and Instruction Following. arXiv preprint arXiv:2309.00615, 2023.
- InstructME: An Instruction Guided Music Edit And Remix Framework with Latent Diffusion Models. arXiv preprint arXiv:2308.14360, 2023.
- CNN Architectures for Large-scale Audio Classification. In 2017 ieee international conference on acoustics, speech and signal processing (icassp), pages 131–135. IEEE, 2017.
- Denoising Diffusion Probabilistic Models. Advances in neural information processing systems, 33:6840–6851, 2020.
- CogVideo: Large-scale Pretraining for Text-to-Video Generation via Transformers. In The Eleventh International Conference on Learning Representations, 2023.
- LoRA: Low-Rank Adaptation of Large Language Models. In International Conference on Learning Representations, 2022.
- Noise2Music: Text-conditioned Music Generation with Diffusion Models. arXiv preprint arXiv:2302.03917, 2023a.
- AudioGPT: Understanding and Generating Speech, Music, Sound, and Talking Head. arXiv preprint arXiv:2304.12995, 2023b.
- Multi-modal Dense Video Captioning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition workshops, pages 958–959, 2020.
- A Deep Learning based Approach for Precise Video Tagging. In 2019 15th International Conference on Emerging Technologies (ICET), pages 1–6. IEEE, 2019.
- Video Summarization with Attention-based Encoder-Decoder Networks. IEEE Transactions on Circuits and Systems for Video Technology, 30(6):1709–1717, 2019.
- Fréchet Audio Distance: A Reference-Free Metric for Evaluating Music Enhancement Algorithms. In Interspeech, 2019.
- TVQA: Localized, Compositional Video Question Answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 1369–1379. Association for Computational Linguistics, 2018.
- BLIP: Bootstrapping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. In International Conference on Machine Learning, pages 12888–12900. PMLR, 2022.
- BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. arXiv preprint arXiv:2301.12597, 2023a.
- MERT: Acoustic Music Understanding Model with Large-Scale Self-supervised Training. arXiv preprint arXiv:2306.00107, 2023b.
- Zero-shot Event Detection via Event-adaptive Concept Relevance Mining. Pattern Recognition, 88:595–603, 2019.
- Chin-Yew Lin. ROUGE: A Package for Automatic Evaluation of Summaries. In Text summarization branches out, pages 74–81, 2004a.
- Chin-Yew Lin. ROUGE: A Package for Automatic Evaluation of Summaries. In Text summarization branches out, pages 74–81, 2004b.
- Microsoft COCO: Common Objects in Context. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part V 13, pages 740–755. Springer, 2014.
- AudioLDM 2: Learning Holistic Audio Generation with Self-supervised Pretraining. arXiv preprint arXiv:2308.05734, 2023a.
- Music Understanding LLaMA: Advancing Text-to-Music Generation with Question Answering and Captioning. arXiv preprint arXiv:2308.11276, 2023b.
- WavJourney: Compositional Audio Creation with Large Language Models. arXiv preprint arXiv:2307.14335, 2023c.
- Macaw-LLM: Multi-Modal Language Modeling with Image, Audio, Video, and Text Integration. arXiv preprint arXiv:2306.09093, 2023.
- MusCaps: Generating Captions for Music Audio. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE, 2021.
- Audio Captioning Transformer. In Proceedings of the 6th Detection and Classification of Acoustic Scenes and Events 2021 Workshop (DCASE2021), pages 211–215, 2021.
- Multi-Modal Understanding and Generation for Medical Images and Text via Vision-Language Pre-Training. IEEE Journal of Biomedical and Health Informatics, 26(12):6070–6080, 2022.
- Salman Khan Muhammad Maaz, Hanoona Rasheed and Fahad Khan. Video-ChatGPT: Towards Detailed Video Understanding via Large Vision and Language Models. arXiv preprint arXiv:2306.05424, 2023.
- OpenAI. ChatGPT (Mar 14 version) [Large language model], 2023.
- BLEU: A Method for Automatic Evaluation of Machine Translation. In ACL, pages 311–318, 2002a.
- BLEU: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002b.
- Mo\\\backslash\^ usai: Text-to-Music Generation with Long-Context Latent Diffusion. arXiv preprint arXiv:2301.11757, 2023.
- PandaGPT: One Model To Instruction-Follow Them All. arXiv preprint arXiv:2305.16355, 2023.
- 3D-GPT: Procedural 3D Modeling with Large Language Models. arXiv preprint arXiv:2310.12945, 2023.
- SALMONN: Towards Generic Hearing Abilities for Large Language Models. arXiv preprint arXiv:2310.13289, 2023a.
- Any-to-Any Generation via Composable Diffusion. arXiv preprint arXiv:2305.11846, 2023b.
- Stanford Alpaca: An Instruction-following LLaMA model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
- MosaicML NLP Team. Introducing MPT-7B: A New Standard for Open-Source, Commercially Usable LLMs, 2023. Accessed: 2023-05-05.
- VideoMAE: Masked Autoencoders are Data-Efficient Learners for Self-Supervised Video Pre-Training. In Advances in Neural Information Processing Systems, 2022.
- Llama 2: Open Foundation and Fine-tuned Chat Models. arXiv preprint arXiv:2307.09288, 2023.
- Audio Summarization for Podcasts. In 2021 29th European signal processing conference (EUSIPCO), pages 431–435. IEEE, 2021.
- Attention is All You Need. Advances in neural information processing systems, 30, 2017.
- AUDIT: Audio Editing by Following Instructions with Latent Diffusion Models. arXiv preprint arXiv:2304.00830, 2023a.
- Chat-3D: Data-efficiently Tuning Large Language Model for Universal Dialogue of 3D Scenes. arXiv preprint arXiv:2308.08769, 2023b.
- Wav2CLIP: Learning Robust Audio Representations From CLIP. In ICASSP, pages 4563–4567. IEEE, 2022.
- NExT-GPT: Any-to-Any Multimodal LLM. arXiv preprint arXiv:2309.05519, 2023a.
- Large-scale Contrastive Language-Audio Pretraining with Feature Fusion and Keyword-to-Caption Augmentation. In IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP, 2023b.
- Multi-modal Pre-training for Medical Vision-language Understanding and Generation: An Empirical Study with A New Benchmark. In Proceedings of the Conference on Health, Inference, and Learning, pages 117–132, 2023a.
- PointLLM: Empowering Large Language Models to Understand Point Clouds. arXiv preprint arXiv:2308.16911, 2023b.
- Just Ask: Learning to Answer Questions from Millions of Narrated Videos. In ICCV, 2021.
- Boosting Image Captioning with Attributes. In Proceedings of the IEEE international conference on computer vision, pages 4894–4902, 2017.
- A Survey on Multimodal Large Language Models. arXiv preprint arXiv:2306.13549, 2023.
- Scaling Vision Transformers. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12104–12113, 2022.
- InternLM-XComposer: A Vision-Language Large Model for Advanced Text-image Comprehension and Composition. arXiv preprint arXiv:2309.15112, 2023a.
- Vis2Mus: Exploring Multimodal Representation Mapping for Controllable Music Generation. arXiv preprint arXiv:2211.05543, 2022.
- BERTScore: Evaluating Text Generation with BERT. In ICLR, 2020a.
- BERTScore: Evaluating Text Generation with BERT. In ICLR, 2020b.
- Fast Zero-shot Image Tagging. In 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5985–5994. IEEE, 2016.
- Loop Copilot: Conducting AI Ensembles for Music Generation and Iterative Editing. arXiv preprint arXiv:2310.12404, 2023b.
- Learning Video Representations from Large Language Models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6586–6597, 2023.
- Video Background Music Generation: Dataset, Method and Evaluation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 15637–15647, 2023.
- Atin Sakkeer Hussain (5 papers)
- Shansong Liu (19 papers)
- Chenshuo Sun (5 papers)
- Ying Shan (252 papers)
- Qilong Wu (25 papers)