Papers
Topics
Authors
Recent
2000 character limit reached

WildScore: Music-in-the-Wild Dataset

Updated 3 December 2025
  • WildScore is a large-scale dataset created from Reddit’s r/musictheory posts, featuring 807 image-based score excerpts paired with advanced music theory questions.
  • It uses a custom YOLOv11 detector and GPT-4.1-mini to extract, transform, and annotate raw score images into exam-style multiple-choice items.
  • Benchmarking across seven state-of-the-art MLLMs shows visual gains in accuracy alongside challenges in nuanced symbolic music interpretation.

The Music-in-the-Wild Dataset (“WildScore”) constitutes the first large-scale, real-world benchmark for evaluating multimodal LLMs (MLLMs) in symbolic music reasoning using raw score images and authentic user-generated queries. Sourced from Reddit’s r/musictheory community (2012–2022), it encapsulates 807 multiple-choice items pairing scanned or screenshot-based music notation excerpts with complex musicological questions, providing rigorous coverage of diverse composers, styles, and historical periods. The benchmark systematically assesses both perceptual and inferential model competences in real-world music analysis, leveraging a hierarchical taxonomy of music-theory concepts and curated multicategory evaluation protocols (Mundada et al., 5 Sep 2025).

1. Dataset Coverage and Representational Scope

Music-in-the-Wild comprises 807 MCQ instances, each binding a genuine score image with a music-theoretical question and four answer choices. Its primary data sources are scanned notation images embedded in r/musictheory threads, encompassing repertoire from Baroque (e.g., Bach) and Classical (e.g., Mozart) to twentieth-century and modern‐rock contexts (e.g., Metallica). The exact distributions of composers or periods are not specified, but the collection is explicitly diverse, spanning both canonical art-music and popular genres.

Data sources include rasterized scores from Reddit posts (scanned excerpts, screenshots, or PDF-derived images). No machine-readable symbolic formats (MusicXML or similar) are used; the benchmark is intended to test models' direct symbolic recognition and interpretation from images.

2. Data Collection and Annotation Workflow

Data gathering involved crawling all 2012–2022 r/musictheory posts containing both an embedded score image and at least three top-level comments, yielding an initial pool of approximately 4,000 candidates. A custom YOLOv11-based detector (fine-tuned on 215 hand-annotated examples) filtered valid notation content, resulting in 807 items after discarding overly verbose posts (>200 words) and posts lacking discursive engagement (<3 comments).

Annotation proceeded through a multiphase pipeline:

  • Question transformation: GPT-4.1-mini reformulated each Reddit thread into an exam-style MCQ.
  • Ground truth extraction: Preference votes (upvotes minus downvotes) determined the “human-preferred” answer for each item; ties were adjudicated by GPT-4.1-mini (“language-model preference”). 549 items follow human preference, 258 follow model preference.
  • Distractor synthesis: GPT-4.1-mini generated plausible distractors—subtly differing from the correct answer.
  • Quality control: Three expert annotators reviewed each item for musical correctness, clarity, and appropriateness.

3. Taxonomy and Ontological Structure

WildScore interrogates five high-level musicological domains, granularized into fine-grained subcategories:

High-Level Category Fine-Grained Subcategories
Harmony & Tonality Chord Progressions, Modulation Patterns, Modal Mixture
Rhythm & Meter Metric Structure, Rhythmic Patterns
Texture Homophonic, Polyphonic, Orchestral Texture
Expression & Performance Dynamics & Articulation, Technique & Interpretation
Form Phrase Structure, Contrapuntal Forms

Definitions are given informally: harmony concerns chord progression and simultaneity, tonality orders pitches hierarchically, rhythm organizes notes/silences temporally, texture refers to inter-layer interaction, expression comprises dynamics/articulation/phrasing/tempo and their performance realization, and form captures macrostructure.

4. Question Format and Dataset Organization

All benchmark items are multiple-choice (A–D) with three tailored distractors. Distractors are finely constructed—e.g., “passing tone” vs. “neighbor tone,” “two dotted quarters” vs. “three quarters”—to probe subtle reasoning and perceptual distinctions in score analysis. No train/validation/test splits are standardized; all 807 items serve for direct benchmarking.

Exemplars:

  • Harmony & Tonality: A Mozart excerpt questions the functional role of an A♯ over a G major chord, with answer choices distinguishing passing tone, suspension, modulation, and dominance.
  • Rhythm & Meter: A Metallica transcription presents complex tuplets; options dissect counting strategies reflecting differentiated metric interpretation.

5. Benchmarking Protocol and Model Evaluation

Model competence is assessed using top-1 accuracy:

Accuracy=# correctly answered itemsTotal # items×100%\mathrm{Accuracy} = \frac{\text{\# correctly answered items}}{\text{Total \# items}} \times 100\%

Seven state-of-the-art MLLMs were benchmarked:

  • GPT-4.1-mini (undisclosed size)
  • Qwen-2.5-VL (8.3B)
  • Phi-3-Vision (4.15B)
  • Gemma-3 (4.3B)
  • MiniCPM (3.43B)
  • InternVL (9.14B)
  • LLaVA (7.06B)

GPT-4.1-mini attained 68.31% overall accuracy on image+text, with category scores ranging from 63.20% (Rhythm & Meter) to 72.12% (Expression & Performance). Visual context provided a consistent accuracy gain for top models (GPT-4.1-mini: 68.31% with images vs. 65.76% text-only). Lesser models showed inconsistent or even negative gains from image modality.

6. Performance Analysis and Failure Modes

Model scores exhibit pronounced category and subcategory variation. GPT-4.1-mini excels in “Dynamics & Articulation” (87.18%) and “Modal Mixture” (79.25%), but underperforms on “Orchestral Texture” (33.33%) and “Contrapuntal Forms” (40%), suggesting systemic disparities in symbolic parsing versus abstract reasoning.

Observed failures fall into:

  • Perception errors: misinterpretation of clefs, accidentals, beams, and rests.
  • Reasoning errors: incorrect application of harmonic or formal rules post-symbol recognition.
  • Subcategory extremes: robust modal mixture and dynamic identification; limited cross-instrumental texture and advanced contrapuntal forms.

A perception-only probe (50 factual queries): GPT-4.1-mini (52%), InternVL (38%), LLaVA (26%).

7. Open Challenges and Pathways Forward

Recommendations for enhancing MLLM performance include pretraining on notation-dense corpora (e.g., OMR datasets) for improved symbol extraction, developing structure-aware encoders to track staves/measures/cross-measure dependencies, introducing formalized train/val/test splits for robust fine-tuning, and expanding beyond the MCQ paradigm toward generative symbolic reasoning.

A plausible implication is that future advances will require conjunctions of improved perceptual modeling and theory-centric reasoning architectures. WildScore establishes a foundation for systematic benchmarking, elucidating both extant model strengths and the persistent bottlenecks in symbolic music understanding (Mundada et al., 5 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Music-in-the-Wild Dataset.