CMI-Bench: Evaluating Music-Aware LLMs

Updated 30 June 2025

CMI-Bench is a comprehensive benchmark that assesses instruction-following in audio-text LLMs across a diverse suite of music information retrieval tasks.
It integrates 14 core MIR tasks across 20 datasets, reformulating traditional MIR annotations into natural language prompts that support both zero- and few-shot evaluations.
The toolkit employs established evaluation metrics to highlight significant performance gaps and task-specific biases, facilitating fair comparison with state-of-the-art MIR systems.

CMI-Bench is a comprehensive benchmark designed to evaluate the instruction-following capabilities of audio-text LLMs on a diverse suite of music information retrieval (MIR) tasks. It addresses critical gaps in prior MIR benchmarks by providing a broad, instruction-oriented task set, unified evaluation protocols, and facilitating transparent comparison between emerging music-aware LLMs and traditional state-of-the-art MIR systems.

1. Comprehensive Task Coverage and Instruction Reformulation

CMI-Bench integrates 14 core MIR tasks, spanning 20 datasets, to reflect the multifaceted challenges present in music analysis and understanding:

Genre Classification: Identifying the genre of music recordings.
Emotion Regression & Tagging: Predicting both continuous arousal/valence values and multi-label emotion tags.
Instrument Classification & Tagging: Detecting present instruments, both as single-label and multi-label tasks.
Pitch Estimation: Assigning MIDI-scale pitches to segments in music audio.
Key Detection: Determining the musical key (e.g., C major, D minor).
Lyrics Transcription: Performing singing audio-to-text recognition.
Melody Extraction: Identifying main melodic lines, particularly vocals, from polyphonic mixes.
Vocal Technique Recognition: Classifying singing techniques such as vibrato, chest, or head voice.
Instrument Performance Technique Detection: Detecting advanced expressive techniques, e.g., Guzheng string playing methods.
Music Tagging: Assigning multiple descriptive tags (mood, genre, instruments, etc.).
Music Captioning: Producing textual descriptions of musical excerpts.
Beat & Downbeat Tracking: Predicting rhythmic structure through onset markers.
Other tasks: Including multi-f0 (polyphonic pitch tracking) and specialized MIR tasks.

A distinguishing feature of CMI-Bench is the systematic reinterpretation of MIR annotations into explicit instruction-following formats. Each task is reformulated into natural language prompts suitable for LLMs, mimicking those found in NLP benchmarks such as FLAN or Super-NaturalInstructions. For sequential or temporal prediction tasks (e.g., beat tracking), prompts specify output formats with example parsing instructions, supporting both zero- and few-shot settings.

2. Standardized, MIR-Consistent Evaluation Metrics

To ensure reliable and fair comparison with MIR literature, CMI-Bench adopts established, task-appropriate metrics from the field. Key metric-task pairings include:

Task	Metric(s)	Details
Multi-class Classification	Accuracy	Case/punctuation-insensitive, exact match
Multi-label Tagging	ROC-AUC, PR-AUC	Using BGE encoder for semantic retrieval
Key Detection	Gmean (mir_eval.key)	Allows musically plausible "near-misses"
Emotion Regression	$R^2$	After z-score normalization, per-song basis
Beat/Downbeat Tracking	F-measure (20ms tolerance)	mir_eval.beat, tolerant to timing errors
Melody Extraction	Melody Accuracy (≤50 cents)	mir_eval.melody standard
Captioning	BLEU, METEOR, ROUGE, BERTScore	As per NLP text generation
Lyrics Transcription	WER, CER	Word/character error rates
Instrument Technique Detection	Macro-F1, Micro-F1	Frame-level, e.g., for Guzheng

This approach preserves direct comparability with supervised MIR systems and avoids bias introduced by simplified or nonstandard metrics, an issue in prior multi-choice or short-answer LLM-oriented MIR benchmarks.

3. Supported Model Families and Evaluation Ecosystem

CMI-Bench supports a wide array of open-source audio-textual LLMs, providing an inclusive testbed for the current landscape of music-aware models:

Qwen2-Audio and Qwen-Audio (large-scale Chinese/English audio-text LLMs)
SALMONN-Audio and MU-LLaMA
MusiLingo (music-captions and tagging)
LTU (Listen, Think, Understand) & LTU-AS
Audio-Flamingo, GAMA, GAMA-IT, Pengi

The CMI-Bench toolkit automates response normalization, prompt generation, parsing, and scoring. It can process both classic MIR annotation-CSV formats and LLM string responses, mapping these to evaluation with mir_eval and related tools. For multi-label tasks, it leverages BGE embedding-based similarity, accommodating free-form LLM outputs and closely matching semantic intent.

This integrated toolkit is essential for enabling rigorous, repeatable evaluation and batch-processed model comparison across all core benchmark tasks.

4. Empirical Performance Analysis and Insights

Experiments conducted with CMI-Bench reveal several impactful findings:

Significant Performance Gaps: All tested LLMs substantially underperform SOTA MIR models on classification, regression, tagging, and sequence prediction tasks. For example, key detection Gmean for the best LLM ( $\sim$ 8%) is notably lower than traditional MIR baselines ( $>$ 74%).
Captioning Exception: Some LLMs approach MIR baselines on captioning, suggesting their pretraining imparts strength in open-text generation, not structured music analysis.
Task- and Domain-Specific Biases: LLMs exhibit higher scores on tasks or datasets that closely align with their pretraining corpora, indicating a lack of generalization beyond observed data.
Structural Task Weakness: Sequential tasks (melody, beat, instrument techniques) are especially challenging, often yielding format errors or poor alignment with ground truth.
Emotion Regression Limitation: LLMs fail to consistently predict continuous emotional values, performing worse than mean-value baselines.
Format Sensitivity and Hallucination: Output correctness is highly prompt- and format-sensitive. Violations of explicit input/output conventions frequently reduce metric scores and lead to hallucinations.
Cultural, Chronological, and Gender Bias: Analysis shows that models perform better on mainstream Western instruments/genres, with pronounced deficiencies for world genres (e.g., Bossanova, Guzheng) and non-binary vocal/gender classes. Some models show class ranking versus calibration mismatches (e.g., ROC-AUC vs. PR-AUC discrepancies).

These insights highlight current LLMs' limitations as music generalists, especially outside their training distributions or in tasks that require sequential, structural, or temporally aligned output.

5. Methodological Advances and Benchmarking Principles

CMI-Bench adopts several methodological innovations over prior MIR benchmarks:

Instruction-Following Standardization: All annotations and evaluation scripts are recast to match prominent NLP instruction formats, supporting instruction-tuning of LLMs and evaluation reproducibility.
Cross-Task, Cross-Dataset Unity: The unified pipeline promotes joint benchmarking for research in both MIR (audio-focused) and LLM/NLP (instruction-following) communities.
Direct SOTA Comparability: By enforcing MIR-standard metrics and protocols, the benchmark avoids optimism from answer-matching or overfitting to instruction-targeted test cases.
Bias Quantification: Cultural, temporal, and vocal/gender bias is directly measured and reported on subsets of the benchmark.
Toolkit and Open Sourcing: Accompanied by code and dataset release, the benchmark aims for transparency and extensibility.

6. Future Directions and Community Impact

Key avenues for advancing CMI-Bench and music-aware LLM research include:

Pretraining on Diverse and Timestamped Corpora: Expanding LLM training data to cover global genres/instruments, temporal labels, and more complex MIR tasks.
Unified Output Schema and Robust Parsing: Developing models and prompts that robustly generalize across output formats, especially for sequence and structure heavy categories.
Bias Mitigation and Coverage Expansion: Curating datasets and training strategies to explicitly target underrepresented culture/gender categories.
Sequential Modeling Enhancements: Introducing or adapting architectures with stronger temporal inductive bias for beat, melody, and performance technique tasks.
Extension Beyond Open Models: Enabling evaluation of proprietary LLMs (e.g., GPT-4o, Gemini) when applicable, and expanding tasks (e.g., source separation, chord tracking).

CMI-Bench establishes a unifying reference point for reproducible, interpretable, and comprehensive evaluation of music instruction-following LLMs, fostering progress toward unbiased, generalist, and practically useful music-aware LLMs.

PDF Markdown Chat (Pro)

Follow Topic

Get notified by email when new papers are published related to CMI-Bench.