Papers
Topics
Authors
Recent
Search
2000 character limit reached

MuseBench: Multimodal Benchmark Suite

Updated 24 January 2026
  • MuseBench is a family of benchmarking resources that evaluates qualitative reasoning in text and structured musical understanding across audio, image, and text modalities.
  • It features task-specific evaluations such as codebook generation for text and MCQ-based assessments for music theory and sheet music interpretation.
  • The resource leverages diverse datasets and rigorous metrics, including inter-rater reliability and Levenshtein distance, to provide detailed error analysis and performance comparison.

MuseBench is a family of benchmarking resources and evaluation suites for probing qualitative reasoning and multimodal understanding in both textual and musical domains. The term “MuseBench” encompasses distinct benchmarks introduced for two primary domains: (1) qualitative text analysis, exemplified in human-AI blended coding tasks (Matveyenko et al., 14 Oct 2025), and (2) structured musical understanding, spanning text, score images, and audio (Zhao et al., 17 Jan 2026). Additionally, it should not be conflated with the separate “MUSE Benchmark” for audio LLMs (Carone et al., 21 Oct 2025), though the naming reflects a broader trend toward multi-domain benchmarking in research.

1. Motivation and Design Principles

MuseBench addresses recognized gaps in the evaluation of computational models for both qualitative textual analysis and interactive music understanding. In qualitative research, scaling thematic coding across heterogeneous corpora typically suffers from coder fatigue, interpretive drift, and reproducibility deficits; existing computational methods lack the nuance required for domain fidelity (Matveyenko et al., 14 Oct 2025). For music understanding, state-of-the-art multimodal LLMs display proficiency in general text–vision tasks but fail to ground reasoning in symbolic score notations and performance-level audio (Zhao et al., 17 Jan 2026).

The principal design goals of MuseBench benchmarks include:

  • Unified, task-oriented evaluation spanning multiple modalities relevant to the domain (text, image, audio).
  • Focus on high-level, interactive, and agent-style reasoning, eschewing isolated prediction tasks.
  • Balanced task coverage and difficulty, ensuring representative sampling across domains and modalities.
  • Robust error analysis to guide system improvements and account for human–AI disagreement.

2. Task Taxonomy and Modalities

MuseBench for qualitative research consists of two primary task types:

  • Codebook generation: Steerable, hierarchical code construction akin to topic modeling.
  • Code application: Annotation with binary or multi-code labeling over text excerpts (Matveyenko et al., 14 Oct 2025).

The music-centric MuseBench encompasses three top-level dimensions:

  • Music Theory Understanding (Text): MCQ-based analysis of abstract theoretical concepts (intervals, keys, forms).
  • Sheet Music Understanding (Image): MCQ information extraction and symbolic interpretation from piano score snippets.
  • Performance Audio Analysis (Audio): True/False evaluation and consistency checking of performance excerpts (Zhao et al., 17 Jan 2026).

Each task dimension contains multiple sub-tasks, yielding 28 task types and approximately 2,052 expert-verified QA pairs for the music benchmark.

3. Datasets, Sources, and Preprocessing

Textual Qualitative MuseBench

A diverse set of eleven publicly available, human-coded datasets forms the backbone of MuseBench for qualitative research. Attributes collected per dataset include domain, document count, total words, unique codes, and a dataset quality index (1–10 scale). Representative datasets:

Name Domain Docs Words Codes Quality
Reuters-21578 News 18,918 2.45M 6 10/10
Medical Abstracts Medical 11,550 2.07M 5 7/10
arXiv Categories Academic Abstracts 163,168 24.1M 9 9/10
Reddit Posts Social Media 200,000 5.0M 5 10/10
Hate-Speech Tweets Social Media 24,783 350K 6 6/10

Dataset selection ensures coverage of news, medical, legal, online discussion, and social media, with annotations evaluated for consistency and definition completeness (Matveyenko et al., 14 Oct 2025).

Music Multimodal MuseBench

MuseBench for music blends public-domain piano repertoire (image and audio), paired symbolic representations (ABC notation, MusicXML), and curated, expert-aligned annotations:

  • Sheet-music images sourced from MuseScore (~300), IMSLP (~200), and Mutopia/Gutenberg (~100).
  • Performance audio: 513 CC-licensed or public-domain recordings, denoised and standardized to 16 kHz.
  • Symbolic extraction: Both ABC and MusicXML representations linked at bar level for granular annotation.
  • Annotation quality: Technical difficulty and expressive features hand-validated, inter-annotator reliability κ=0.87 (Zhao et al., 17 Jan 2026).

The final benchmark aggregates 591 sheet images, 513 audio excerpts, and 2,052 QA pairs.

4. Evaluation Metrics and Protocols

For textual MuseBench, inter-rater reliability between Muse (the blended human–AI system) and human coders on well-specified codes is reported as Cohen’s κ = 0.71, indicating substantial agreement (Matveyenko et al., 14 Oct 2025).

Music MuseBench employs:

  • QA Accuracy: Proportion of correct answers across all multiple-choice (MCQ, chance 25%) and true/false (T/F, chance 50%) questions.
  • Closed-set Image-to-ABC (OMR): Levenshtein distance between generated and ground-truth note sequences.

D[i][j]=min{D[i1][j]+1,D[i][j1]+1,D[i1][j1]+cost}D[i][j] = \min\{ D[i-1][j]+1,\, D[i][j-1]+1,\, D[i-1][j-1]+\text{cost} \}

where cost=0\text{cost}=0 if R[i1]=A[j1]R[i-1]=A[j-1], else $1$.

  • Open-set Score Understanding: Semantic similarity (LSA cosine), ROUGE-1, ROUGE-L, and METEOR metrics.
  • Audio Alignment: Onset and frame F1, transposition-invariant matching, and custom JSON metrics (e.g., eva_note, eva_speed).

All evaluations are performed zero-shot, without fine-tuning, with standardized expert system prompts for each modality.

5. Comparative Performance and Results

Quantitative benchmarks reveal distinct modality-specific strengths and weaknesses:

Modality Model Accuracy (%)
Text GPT-4.1 86.7
Text Gemini2.5-Pro 83.9
Text Random 25.0
Audio MuseAgent (w/ GPT-4.1) 79.1
Audio GPT-4o 55.9
Audio Gemini2.5-Pro 53.1
Image MuseAgent (w/ GPT-4.1) 74.1
Image NotaGPT-7B 68.1
Image GPT-4.1 66.1
Image Random 25.0

MuseAgent achieves substantial improvements in image and audio modalities compared to raw multimodal LLMs, approaching general LLM upper bounds on textual tasks (Zhao et al., 17 Jan 2026).

Closed-set OMR task distances: | Model | Levenshtein Distance | |---------------|---------------------| | LLaVA-13B | 147.47 | | NotaGPT-7B | 59.47 | | M-OMR (ours) | 18.39 |

Open-set score understanding (aggregate semantic metrics): | Model | LSA | ROUGE-1 | ROUGE-L | METEOR | Avg | |-------------------------|-------|---------|---------|--------|-------| | GPT-4o | 15.92 | 18.27 | 11.35 | 20.26 | 16.45 | | Gemini-pro-vision | 15.88 | 22.21 | 15.09 | 20.31 | 18.37 | | MuseAgent (w/ M-OMR) | 15.75 | 24.92 | 15.76 | 20.17 | 19.15 |

A plausible implication is that structured grounding—via symbolic representation alignment and bar-level annotation—offers performance gains unattainable by pure end-to-end multimodal LLMs.

6. Critical Analysis and Implications

MuseBench, in both textual and music modalities, surfaces fundamental limits of current AI and LLM systems. In qualitative research, the AI-powered Muse system demonstrates substantial inter-rater reliability and capacity for bias correction when compared with humans, but challenges persist for ambiguous or underspecified codes (Matveyenko et al., 14 Oct 2025).

For music, the performance gap between generalist multimodal LLMs and agentic, perceptually grounded systems (e.g., MuseAgent) is pronounced outside text domain tasks. The structured approach—combining Optical Music Recognition (M-OMR), Automatic Music Transcription (AMT), and retrieval-augmented reasoning—enables fine-grained, interactive responses crucial for robust musical understanding (Zhao et al., 17 Jan 2026). This suggests that future multimodal models in structured domains should prioritize domain-specific tools and interactive workflows over purely classification-based strategies.

From the music MuseBench, an important finding is that general LLMs perform near chance on audio and image modalities unless augmented with tailored recognition and alignment modules, whereas MuseAgent attains near-human accuracy. OMR and AMT evaluations highlight the need for symbolically-aware architectures for score and performance audio, rather than reliance on visual or acoustic texture cues alone.

7. Future Directions

MuseBench establishes a precedent for unified multimodal benchmarking in structured domains. Key anticipated directions include:

  • Extension to additional modalities (e.g., video for performance analysis, spoken narratives for qualitative research).
  • Integration of more sophisticated annotation schemes, including hierarchical and relational coding frameworks.
  • Development of agentic, retrieval-augmented systems equipped with structured priors (music theory, thematic hierarchies) (Zhao et al., 17 Jan 2026).
  • Advancement of metrics and interpretability tools tailored to the error modes revealed in MuseBench evaluations.
  • Benchmarking longitudinal task performance to assess learning and adaptation beyond zero-shot settings.

A plausible implication is that the adoption of MuseBench-like benchmarks in other fields—such as computational humanities, clinical text mining, and science-of-music—will accelerate progress toward robust, human-aligned multimodal AI systems.

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to MuseBench.