MidiCaps: Captioned MIDI Dataset

Updated 10 October 2025

MidiCaps is a large-scale dataset pairing over 168,000 MIDI files with detailed, multi-attribute textual captions that describe key musical features.
Its methodology blends robust MIDI parsing, audio synthesis for feature extraction, and LLM in-context learning to produce rich, multi-sentence music descriptions.
The dataset supports applications in music information retrieval, cross-modal text-to-MIDI generation, and benchmarking of AI-driven music and language models.

MidiCaps is a large-scale, captioned symbolic music dataset designed to bridge the gap between text and MIDI representations for research in music information retrieval, music–language modeling, and multimodal generation. It provides comprehensive, multi-attribute textual descriptions for over 168,000 MIDI files, facilitating cross-modal tasks, advanced MIR benchmarks, and enabling the development of text-to-MIDI generative models guided by LLMs. The dataset’s construction leverages automated feature extraction, state-of-the-art synthesis and music tagging models, and sophisticated natural language generation, positioning it as a pivotal resource for advancing multimodal music AI research (Melechovsky et al., 4 Jun 2024).

1. Dataset Composition

MidiCaps consists of 168,407 MIDI files derived from an augmented and filtered subset of the Lakh MIDI Dataset. Each MIDI file is paired with a rich set of extracted musical features:

Symbolic features: Key, time signature, tempo (converted from MIDI set_tempo events to beats per minute), file duration, and instrument ranking based on cumulative note duration.
Audio-derived features: Genre, mood/theme, and chord progression are obtained from synthesized audio (using FluidSynth via Midi2Audio) and subsequent tagging with models from Essentia and Chordino.

The dataset exhibits diversity across genres, styles, and instrumentations, as shown in distributions for primary/secondary genres and instrument types. This breadth supports the training and benchmarking of models across wide musical traditions, including electronic, pop, rock, and classical domains.

2. Caption Generation Methodology

Textual captions in MidiCaps are generated via in-context learning with Claude 3 Opus, a LLM. The process involves:

Curated Feature–Caption Examples: A set of 17 expert-crafted feature–caption pairs is presented for in-context learning. These pairs are based solely on extracted musical features and exemplify desired descriptive styles.
Automated Captioning: For each MIDI file, the extracted feature vector (tempo, key, time signature, dominant instruments, genre, mood, chord progression) is fed to the LLM, which produces a multi-sentence, natural-language caption. Typical captions highlight musical attributes such as “a 4/4 pop track in C major at 120 BPM featuring piano and synth, with an upbeat mood and common I–IV–V chord progression,” ensuring coverage of tempo, harmony, instrumentation, genre, and mood.

Captions typically span three to seven sentences and emulate the clarity and coverage found in expert-written music descriptions.

3. Feature Extraction and Technical Pipeline

The feature extraction pipeline combines robust music informatics tools:

Symbolic Features: Mido and Music21 are used to parse MIDI data, compute key, time signature, and duration, and map program change events to instrument names. Instruments are ranked by their total note duration and similar instrument variants are grouped.
Genre, Mood, Chord Extraction: MIDI files are rendered to audio via FluidSynth; Essentia’s models tag genre and mood, and Chordino extracts chord progressions. The chord pattern summary algorithm selects the most frequent subpattern from all detected progressions, applying selection rules such as:

$\text{if } n_5 \geq 0.8 n_4 \text{ and } n_5 \geq 0.25 (n_3+n_4+n_5), \text{ choose the 5-chord pattern;}$

with analogous logic for four- and three-chord patterns.

Filtering: MIDI files with problematic properties (e.g., never-ending notes, outlier durations) are removed using Mido, ensuring dataset consistency.

4. Evaluation and Quality Assessment

The quality of AI-generated captions was evaluated via a controlled PsyToolkit listening paper:

Setup: 15 randomly sampled MIDI files with LLM captions and 5 with expert (absolute pitch holder) captions provided the basis for evaluation. Participants rated captions across seven dimensions: overall match, human-likeness, genre, mood, key, chord accuracy, and tempo (scale: 1–7).
Findings: For general listeners, AI-generated captions marginally outperformed or matched expert captions in overall matching (5.63 vs. 5.46) and human-likeness. Music experts showed more critical ratings, but AI captions remained competitive on key and tempo accuracy. These results validate both the fidelity and realism of the in-context learning approach and the multi-feature extraction pipeline.

5. Applications and Use Cases

MidiCaps provides foundational infrastructure for multiple research and application domains:

Music Information Retrieval (MIR): Enables semantic search, genre classification, chord progression discovery, similarity analysis, and mapping between textual descriptions and symbolic music data.
Cross-Modal Generation: Supports development of text-to-MIDI and MIDI-to-text translation models, addressing the lack of large-scale paired captioned MIDI datasets for LLM-guided symbolic music generation.
Music Understanding: Facilitates question answering, automated content analysis, and multimodal music recommendation systems.
Benchmarking: Provides a new standard for evaluating MIR and cross-modal modeling performance, applicable to models such as CLaMP 2 (Wu et al., 17 Oct 2024), Text2MIDI (Bhandari et al., 21 Dec 2024), and MIDI-GPT (Pasquier et al., 28 Jan 2025).

6. Impact and Future Directions

MidiCaps is positioned as a catalyst for research at the intersection of music informatics and natural language processing:

Stimulates the development of LLM-driven symbolic music generation systems and multimodal models with semantic control over generated content.
Enables studies on the natural language–music mapping, potentially informing future advances in AI-assisted composition, automated music annotation, and cross-modal retrieval.
By providing open, large-scale, paired data, MidiCaps underpins reproducibility, model comparison, and methodological advancements in MIR and multimodal machine learning.

7. Limitations and Considerations

Caption Objectivity: Automated caption generation relies on extracted features and LLM in-context learning. While most musical and technical aspects are captured accurately, subtle nuances in mood, style, or instrumentation (especially for non-standard combinations) may occasionally be misrepresented.
Genre Boundaries: Genre tags are extracted via audio synthesis and Essentia models, which may not perfectly disambiguate stylistically hybrid pieces.
Directness of Mapping: A plausible implication is that models trained on MidiCaps may benefit from domain-specific fine-tuning to improve handling of rare musical forms or instructions not easily captured by feature extraction alone.

Conclusion

MidiCaps introduces a rigorously curated, large-scale paired dataset of captioned MIDI files equipped with rich musical feature extraction and automated natural language descriptions. Its construction methodology, statistical diversity, quality assessment, and clear applications establish a new foundation for text-guided symbolic music modeling, semantic music retrieval, and multimodal research at the interface of AI, music, and language (Melechovsky et al., 4 Jun 2024).