Comprehensive Benchmark for Evaluating Music Instruction Following: CMI-Bench
The paper presents CMI-Bench, a novel benchmark specifically designed to assess the efficacy of audio-text LLMs across various music information retrieval (MIR) tasks. Recent developments in audio-text LLMs have shown promise in advancing music understanding and generation capabilities, yet existing evaluation frameworks have proven inadequate, frequently simplifying tasks or relying on evaluation methods that lack the complexity required for real-world music analysis. CMI-Bench is introduced as a comprehensive benchmark addressing these limitations by framing a diverse range of MIR tasks as instruction-following challenges, emphasizing standard evaluation metrics that ensure comparability with traditional state-of-the-art supervised approaches in MIR.
Key Contributions and Design
The paper's primary contributions include redefining traditional MIR annotations as instruction-following tasks, thereby leveraging a rich variety of MIR datasets for training and evaluation. It incorporates standardized benchmarking that accommodates major open-source audio-text LLMs, providing metrics consistent with prior MIR literature, such as genre classification, emotion regression/tagging, instrument classification, pitch estimation, key detection, lyrics transcription, melody extraction, vocal technique recognition, instrument performance technique detection, music tagging, music captioning, and (down)beat tracking.
CMI-Bench diverges from previous benchmarks by adopting task-specific open-ended evaluation strategies, which allow for rigorous comparisons between LLMs and traditional models. Additionally, the benchmark analyzes potential biases within models, such as cultural and gender bias, offering insights into the limitations and future directions for enhancing music-aware LLMs.
Evaluation and Results
CMI-Bench employs a standardized set of tasks and metrics to assess model performance. The evaluation reveals significant performance disparities between LLMs and traditional supervised models, underscoring the current limitations in addressing complex MIR tasks. The benchmark identifies that while LLMs show potential, their performance is heavily contingent on training data overlap and prompt design. Sequential tasks involving structured outputs, such as melody extraction and beat tracking, remain challenging for current audio-text LLMs due to limited pretraining exposure to temporally dense data. Emotion regression tasks also highlight substantial shortcomings in mapping continuous perceptual attributes from music using instruction-following paradigms.
Implications and Future Directions
CMI-Bench establishes a unified framework for assessing music instruction-following capabilities of LLMs, paving the way for systematic progress in music-aware AI research. The implications are vast, highlighting the necessity for improved pretraining corpora that include timestamped audio data and culturally diverse music traditions. Additionally, the benchmark serves as a catalyst for further exploration into embedding-based evaluation strategies for multi-label tasks—balancing computational efficiency and retrieval quality. Ensuring equitable model performance across diverse musical genres and instruments remains a critical area for development.
In conclusion, CMI-Bench is a pivotal step toward bridging the gap between MIR and NLP paradigms within LLMs, advocating for tailored, inclusive approaches that embrace the intricacies of music-related tasks. It invites collaboration and innovation in creating LLMs that are robust, culturally sensitive, and capable of generalizing across the expansive domain of MIR.