Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
119 tokens/sec
GPT-4o
56 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
47 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

MusiLingo: Bridging Music and Text with Pre-trained Language Models for Music Captioning and Query Response (2309.08730v3)

Published 15 Sep 2023 in eess.AS, cs.AI, cs.CL, cs.MM, and cs.SD

Abstract: LLMs have shown immense potential in multimodal applications, yet the convergence of textual and musical domains remains not well-explored. To address this gap, we present MusiLingo, a novel system for music caption generation and music-related query responses. MusiLingo employs a single projection layer to align music representations from the pre-trained frozen music audio model MERT with a frozen LLM, bridging the gap between music audio and textual contexts. We train it on an extensive music caption dataset and fine-tune it with instructional data. Due to the scarcity of high-quality music Q&A datasets, we created the MusicInstruct (MI) dataset from captions in the MusicCaps datasets, tailored for open-ended music inquiries. Empirical evaluations demonstrate its competitive performance in generating music captions and composing music-related Q&A pairs. Our introduced dataset enables notable advancements beyond previous ones.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (30)
  1. Flamingo: a visual language model for few-shot learning.
  2. Satanjeev Banerjee and Alon Lavie. 2005. Meteor: An automatic metric for mt evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization, pages 65–72.
  3. The million song dataset. ISMIR 2011: Proceedings of the 12th International Society for Music Information Retrieval Conference, October 24-28, 2011, Miami, Florida.
  4. The mtg-jamendo dataset for automatic music tagging. In Machine Learning for Music Discovery Workshop, International Conference on Machine Learning (ICML).
  5. Language models are few-shot learners. arXiv preprint arXiv:2005.14165.
  6. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  7. Lp-musiccaps: Llm-based pseudo music captioning. ISMIR.
  8. Toward universal text-to-music retrieval. In ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE.
  9. Joint audio and speech understanding. IEEE Automatic Speech Recognition and Understanding Workshop.
  10. Listen, think, and understand. arXiv preprint arXiv:2305.10790.
  11. Mulan: A joint embedding of music audio and natural language. arXiv preprint arXiv:2208.12415.
  12. Evaluation of algorithms using games: The case of music tagging. In ISMIR, pages 387–392. Citeseer.
  13. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. arXiv preprint arXiv:2301.12597.
  14. Mert: Acoustic music understanding model with large-scale self-supervised training. arXiv preprint arXiv:2306.00107.
  15. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pages 74–81, Barcelona, Spain. Association for Computational Linguistics.
  16. Chin-Yew Lin and Franz Josef Och. 2004. ORANGE: a method for evaluating automatic evaluation metrics for machine translation. In COLING 2004: Proceedings of the 20th International Conference on Computational Linguistics, pages 501–507, Geneva, Switzerland. COLING.
  17. Visual instruction tuning.
  18. Music understanding llama: Advancing text-to-music generation with question answering and captioning. arXiv preprint arXiv:2308.11276.
  19. Video-chatgpt: Towards detailed video understanding via large vision and language models.
  20. Muscaps: Generating captions for music audio. In 2021 International Joint Conference on Neural Networks (IJCNN), pages 1–8. IEEE.
  21. Audio captioning transformer. arXiv preprint arXiv:2107.09817.
  22. OpenAI. 2023. Gpt-4 technical report.
  23. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318.
  24. Unified model for image, video, audio and language tasks. arXiv preprint arXiv:2307.16184.
  25. Semi-supervised music tagging transformer. International Society for Music Information Retrieval (ISMIR).
  26. Evaluation of cnn-based automatic music tagging models. Proceedings of 17th Sound and Music Computing (SMC).
  27. Video-llama: An instruction-tuned audio-visual language model for video understanding.
  28. Llama-adapter: Efficient fine-tuning of language models with zero-init attention. arXiv preprint arXiv:2303.16199.
  29. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations.
  30. Minigpt-4: Enhancing vision-language understanding with advanced large language models. arXiv preprint arXiv:2304.10592.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Zihao Deng (20 papers)
  2. Yinghao Ma (24 papers)
  3. Yudong Liu (31 papers)
  4. Rongchen Guo (4 papers)
  5. Ge Zhang (170 papers)
  6. Wenhu Chen (134 papers)
  7. Wenhao Huang (98 papers)
  8. Emmanouil Benetos (89 papers)
Citations (13)

Summary

We haven't generated a summary for this paper yet.