MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response (2309.08730v3)
Abstract: LLMs have shown immense potential in multimodal applications, yet the convergence of textual and musical domains remains not well-explored. To address this gap, we present MusiLingo, a novel system for music caption generation and music-related query responses. MusiLingo employs a single projection layer to align music representations from the pre-trained frozen music audio model MERT with a frozen LLM, bridging the gap between music audio and textual contexts. We train it on an extensive music caption dataset and fine-tune it with instructional data. Due to the scarcity of high-quality music Q&A datasets, we created the MusicInstruct (MI) dataset from captions in the MusicCaps datasets, tailored for open-ended music inquiries. Empirical evaluations demonstrate its competitive performance in generating music captions and composing music-related Q&A pairs. Our introduced dataset enables notable advancements beyond previous ones.
- Flamingo: a visual language model for few-shot learning.
- Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
- The million song dataset. ISMIR 2011: Proceedings of the 12th International Society for Music Information Retrieval Conference, October 24-28, 2011, Miami, Florida.
- The mtg-jamendo dataset for automatic music tagging. In Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML).
- Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
- Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
- Lp-musiccaps: Llm-based pseudo music captioning. ISMIR.
- Toward universal text-to-music retrieval. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
- Joint audio and speech understanding. IEEE Automatic Speech Recognition and Understanding Workshop.
- Listen, think, and understand. arXiv preprint arXiv:2305.10790.
- Mulan: A joint embedding of music audio and natural language. arXiv preprint arXiv:2208.12415.
- Evaluation of algorithms using games: The case of music tagging. In ISMIR, pages 387–392. Citeseer.
- Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
- Mert: Acoustic music understanding model with large-scale self-supervised training. arXiv preprint arXiv:2306.00107.
- Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
- Chin-Yew Lin and Franz Josef Och. 2004. ORANGE: a method for evaluating automatic evaluation metrics for machine translation. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, pages 501–507, Geneva, Switzerland. COLING.
- Visual instruction tuning.
- Music understanding llama: Advancing text-to-music generation with question answering and captioning. arXiv preprint arXiv:2308.11276.
- Video-chatgpt: Towards detailed video understanding via large vision and language models.
- Muscaps: Generating captions for music audio. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE.
- Audio captioning transformer. arXiv preprint arXiv:2107.09817.
- OpenAI. 2023. Gpt-4 technical report.
- Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
- Unified model for image, video, audio and language tasks. arXiv preprint arXiv:2307.16184.
- Semi-supervised music tagging transformer. International Society for Music Information Retrieval (ISMIR).
- Evaluation of cnn-based automatic music tagging models. Proceedings of 17th Sound and Music Computing (SMC).
- Video-llama: An instruction-tuned audio-visual language model for video understanding.
- Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199.
- Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
- Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
- Zihao Deng (20 papers)
- Yinghao Ma (24 papers)
- Yudong Liu (31 papers)
- Rongchen Guo (4 papers)
- Ge Zhang (170 papers)
- Wenhu Chen (134 papers)
- Wenhao Huang (98 papers)
- Emmanouil Benetos (89 papers)