CMI-Bench: A Comprehensive Benchmark for Evaluating Music Instruction Following (2506.12285v1)

Published 14 Jun 2025 in eess.AS, cs.AI, cs.LG, and cs.SD

Abstract: Recent advances in audio-text LLMs have opened new possibilities for music understanding and generation. However, existing benchmarks are limited in scope, often relying on simplified tasks or multi-choice evaluations that fail to reflect the complexity of real-world music analysis. We reinterpret a broad range of traditional MIR annotations as instruction-following formats and introduce CMI-Bench, a comprehensive music instruction following benchmark designed to evaluate audio-text LLMs on a diverse set of music information retrieval (MIR) tasks. These include genre classification, emotion regression, emotion tagging, instrument classification, pitch estimation, key detection, lyrics transcription, melody extraction, vocal technique recognition, instrument performance technique detection, music tagging, music captioning, and (down)beat tracking: reflecting core challenges in MIR research. Unlike previous benchmarks, CMI-Bench adopts standardized evaluation metrics consistent with previous state-of-the-art MIR models, ensuring direct comparability with supervised approaches. We provide an evaluation toolkit supporting all open-source audio-textual LLMs, including LTU, Qwen-audio, SALMONN, MusiLingo, etc. Experiment results reveal significant performance gaps between LLMs and supervised models, along with their culture, chronological and gender bias, highlighting the potential and limitations of current models in addressing MIR tasks. CMI-Bench establishes a unified foundation for evaluating music instruction following, driving progress in music-aware LLMs.

Authors (5)

Yinghao Ma (24 papers)
Siyou Li (3 papers)
Juntao Yu (13 papers)
Emmanouil Benetos (89 papers)
Akira Maezawa (4 papers)

Summary

Comprehensive Benchmark for Evaluating Music Instruction Following: CMI-Bench

The paper presents CMI-Bench, a novel benchmark specifically designed to assess the efficacy of audio-text LLMs across various music information retrieval (MIR) tasks. Recent developments in audio-text LLMs have shown promise in advancing music understanding and generation capabilities, yet existing evaluation frameworks have proven inadequate, frequently simplifying tasks or relying on evaluation methods that lack the complexity required for real-world music analysis. CMI-Bench is introduced as a comprehensive benchmark addressing these limitations by framing a diverse range of MIR tasks as instruction-following challenges, emphasizing standard evaluation metrics that ensure comparability with traditional state-of-the-art supervised approaches in MIR.

Key Contributions and Design

The paper's primary contributions include redefining traditional MIR annotations as instruction-following tasks, thereby leveraging a rich variety of MIR datasets for training and evaluation. It incorporates standardized benchmarking that accommodates major open-source audio-text LLMs, providing metrics consistent with prior MIR literature, such as genre classification, emotion regression/tagging, instrument classification, pitch estimation, key detection, lyrics transcription, melody extraction, vocal technique recognition, instrument performance technique detection, music tagging, music captioning, and (down)beat tracking.

CMI-Bench diverges from previous benchmarks by adopting task-specific open-ended evaluation strategies, which allow for rigorous comparisons between LLMs and traditional models. Additionally, the benchmark analyzes potential biases within models, such as cultural and gender bias, offering insights into the limitations and future directions for enhancing music-aware LLMs.

Evaluation and Results

CMI-Bench employs a standardized set of tasks and metrics to assess model performance. The evaluation reveals significant performance disparities between LLMs and traditional supervised models, underscoring the current limitations in addressing complex MIR tasks. The benchmark identifies that while LLMs show potential, their performance is heavily contingent on training data overlap and prompt design. Sequential tasks involving structured outputs, such as melody extraction and beat tracking, remain challenging for current audio-text LLMs due to limited pretraining exposure to temporally dense data. Emotion regression tasks also highlight substantial shortcomings in mapping continuous perceptual attributes from music using instruction-following paradigms.

Implications and Future Directions

CMI-Bench establishes a unified framework for assessing music instruction-following capabilities of LLMs, paving the way for systematic progress in music-aware AI research. The implications are vast, highlighting the necessity for improved pretraining corpora that include timestamped audio data and culturally diverse music traditions. Additionally, the benchmark serves as a catalyst for further exploration into embedding-based evaluation strategies for multi-label tasks—balancing computational efficiency and retrieval quality. Ensuring equitable model performance across diverse musical genres and instruments remains a critical area for development.

In conclusion, CMI-Bench is a pivotal step toward bridging the gap between MIR and NLP paradigms within LLMs, advocating for tailored, inclusive approaches that embrace the intricacies of music-related tasks. It invites collaboration and innovation in creating LLMs that are robust, culturally sensitive, and capable of generalizing across the expansive domain of MIR.

PDF Markdown

Related Papers

Find Related Papers

Tweets

https://twitter.com/nicolaus625/status/1935374308915744806

https://twitter.com/TeachableAI/status/1934998310676541652