Lakh MIDI Dataset: A Symbolic Music Resource

Updated 27 September 2025

Lakh MIDI Dataset is a large-scale public collection of over 170,000 MIDI files covering diverse genres and rich musical complexity.
Comprehensive preprocessing and deduplication techniques, including neural and heuristic methods, ensure high data quality for varied music research tasks.
The dataset underpins innovations in music machine learning, facilitating tasks like sequence modeling, structure segmentation, and cross-modal retrieval with companion subsets.

The Lakh MIDI Dataset (LMD) is one of the largest publicly available collections of symbolic music in the MIDI format, characterized by its substantial scale, wide genre coverage, and centrality in machine learning and music information retrieval (MIR) research. Drawn from thousands of user-contributed web sources, LMD furnishes researchers with a resource for large-scale modeling of Western popular, classical, and jazz idioms, underpinning diverse downstream tasks in music analysis, performance modeling, audio rendering, structure segmentation, and cross-modal retrieval.

1. Dataset Composition and Structure

The Lakh MIDI Dataset consists of over 170,000 individual MIDI files, representing an estimated 9,000+ hours of multi-instrumental, multi-genre music. LMD inherits substantial heterogeneity in style and quality, owing to its web-scraped origins and the predominance of community-contributed arrangements. The dataset is stored as raw MIDI files, retaining all original meta-event information (such as track markers, program change events, and tempo maps), and is not limited to a fixed number of tracks or instrumentations. Typical MIDI files range widely in structural length and complexity, with prevailing use of General MIDI mapping for instruments.

To facilitate downstream research, several curated or derived subsets and companion datasets have emerged:

Subset/Variant	Description/Criteria	Size
LMD-full	The complete collection as published	~170,000+ files
LMD-clean	A filtered subset of higher-quality files	~129,000+ files
LPD (Lakh Pianoroll)	Preprocessed into pianoroll binary format	Similar coverage
Slakh2100	Audio renderings with isolated stems	2,100 files
MidiCaps	Text-captioned LMD subset	168,407 files

Files in LMD are typically loaded into symbolic modeling pipelines using MIDI processing libraries (e.g., pretty_midi, Mido, Music21), allowing conversion into event-based, piano roll, or structured tuple representations as required for specific modeling tasks.

2. Preprocessing, Quality Control, and De-duplication

Due to the LMD’s diverse origin, substantial preprocessing is often required to ensure data quality and analytical consistency. Key challenges include duplicate arrangements, inconsistent track metadata, and varying conventions for file encoding.

Several strategies are documented for dataset refinement:

Channel and Track Filtering: Non-musical channels (e.g., less than two note events, dedicated metadata tracks) are removed. Drum/percussion tracks—often unpitched and inconsistently mapped—may be excluded or processed using specialized mappings (Walder, 2016).
Duplicate and Overlap Removal: Methods such as “Octuple encoding” hashing, beat-position entropy, and chroma-DTW are employed to identify exact and near-duplicate files (Choi et al., 20 Sep 2025). Neural methods including symbolic retrieval via MusicBERT, CLaMP, and contrastive BERTs (CAugBERT) offer superior recall and precision, especially when augmented with domain-specific data augmentations. High-threshold similarity clustering can remove up to 21.4% of files as duplicates.
Resampling and Timing Normalization: To harmonize temporal resolution across the dataset, start/end times are commonly re-sampled to a standard tick value or quantized grid (e.g., 2400 ticks per quarter note in classical-scope datasets).
Heuristic Quality Scoring: For subsets requiring higher reliability, probabilistic note-distribution models are used to assign quality scores and filter out anomalous or low-confidence files (Walder, 2016).
MIDI Instrument Mapping: Cleaning pipelines may remap instrument assignments for research targeting specific ensembles (e.g., monophonic melodic voice mapping for NES pre-training (Donahue et al., 2019)).

Data splits (train/validation/test) for modeling tasks are typically defined to minimize leakage, often using hierarchical clustering over feature- or histogram-based signatures to separate duplicate or near-duplicate content.

A defining property of LMD is its accessibility for rapid generation of enriched, task-specific datasets:

Textual Captions (MidiCaps): 168,407 LMD files are paired with natural language captions, generated using an in-context learning setup with a LLM (Claude 3 Opus). Captioning leverages features extracted from MIDI (tempo, chord progressions via Chordino, time signature, key via Music21, prominent instruments), automating semantic annotation for cross-modal retrieval and text-to-MIDI tasks (Melechovsky et al., 4 Jun 2024).
Structural Segmentation: The Segmented Lakh MIDI Subset (SLMS) annotates over 6,100 files with human-validated section boundaries and structure labels, leveraging embedded marker events and manual validation (Eldeeb et al., 20 Sep 2025).
Audio Renderings: The Slakh2100 dataset renders 2,100 selected LMD files into multi-instrument “stems” using professional sample libraries, enabling studies of source separation and audio-to-symbolic learning (Manilow et al., 2019). Selection criteria impose minimum instrument presence and note counts; each part is synthesized with randomized instrument patches for timbral diversity.
Bootleg Score Representations: Cross-modal linkage to sheet music resources (e.g., IMSLP) is achieved via projection of MIDI events onto binary “bootleg scores,” enabling scalable retrieval between symbolic and staff image data (Tsai, 2020).

4. Applications in Music Machine Learning and MIR

LMD is foundational to a broad range of MIR and machine learning tasks, including but not limited to:

Sequence Modeling and Generation: RNN, Transformer, and VAE-based models leverage LMD’s scale and diversity for polyphonic sequence generation, style modeling, and long-range structure learning. Pre-training on LMD has been shown to yield up to a 10% improvement in sequence model perplexity in transfer learning scenarios (e.g., LakhNES (Donahue et al., 2019)).
Multi-track and Polyphonic Modeling: Hierarchical VAEs (such as MIDI-Sandwich2) use multi-track LMD derivatives (e.g., LPD) for multi-instrument, multi-timbral sequence generation, enhancing inter-track coherence and offering refined output binarization (Liang et al., 2019).
Real-Time Performance and Probabilistic Modeling: Autoregressive probabilistic models, such as Notochord, are trained on LMD to deliver low-latency (<10 ms) polyphonic and multi-instrumental event generation, allowing for interactive applications including steerable generation, auto-harmonization, and real-time improvisation (Shepardson et al., 18 Mar 2024).
Structure and Segmentation: CNN architectures utilizing overtone-encoded symbolic piano rolls reliably detect sectional boundaries in LMD-derived datasets, outperforming supervised and unsupervised audio-based methods by over 0.2 in F1 score (Eldeeb et al., 20 Sep 2025).
Source Separation: Slakh2100 enables the creation of high-quality audio mixtures with ground-truth instrument stems, offering an order of magnitude greater duration than standard datasets (e.g., MUSDB18) and facilitating deep learning research requiring large-scale synthetic ground truth (Manilow et al., 2019).
Cross-Modal and Multimodal Research: The availability of paired MIDI and textual or image modalities supports research in cross-modal retrieval, controllable music generation from text prompts, and score-to-performance alignment (Tsai, 2020, Melechovsky et al., 4 Jun 2024).

Despite its centrality, LMD presents distinctive challenges:

Duplication and Redundancy: The presence of multiple arrangements and user edits of the same piece, coupled with inconsistent metadata handling, introduces both hard and soft duplicates. This can cause unreliable evaluations and overestimation of model generalizability when cross-split leakage occurs. Advanced de-duplication using symbolic embedding models and contrastive learning has become standard to address this (Choi et al., 20 Sep 2025).
Quality Variance: Due to its “in the wild” web-crawled composition, LMD contains files with varying degrees of musical accuracy, timing fidelity, and completeness. Annotated subsets (e.g., LMD-clean) and probabilistic scoring pipelines are frequently used to mitigate this.
Instrumental and Genre Imbalance: While LMD covers a broad range of genres, style distribution is skewed toward popular and electronic idioms. Annotated summaries (as in MidiCaps) indicate strong representation for certain genres and keys; some domains (e.g., detailed classical repertoire) may be better served by more focused datasets (Walder, 2016).
Metadata Inconsistencies: Instrument mappings, track labels, and key signature reporting vary widely in reliability, requiring extensive harmonization steps for modeling consistency.

6. Impact and Future Directions

The Lakh MIDI Dataset represents a standard benchmark for symbolic music modeling, driving advances in representation learning, generative modeling, and multimodal research. Its open accessibility and extensibility have catalyzed derivative annotated datasets, audio renderings, and cross-modal linkages—extending symbolic MIR research to address tasks spanning structure analysis, style transfer, and text-driven music generation.

Ongoing efforts to curate, annotate, and de-duplicate LMD (with methods such as contrastive BERT-based retrieval and clustering) substantially improve evaluation validity and enable improved generalization and fairness in MIR experimentation (Choi et al., 20 Sep 2025). Enhanced annotation protocols—such as large-scale human-in-the-loop captioning and structural segmentation—broaden the scope of possible research at the intersection of music, language, and audio.

Further work is likely to include additional multimodal linkages (audio ↔ symbolic ↔ score/image ↔ text), more granular annotation (including expressive timing and articulation), and framework development for robust, duplication-aware benchmarking. As symbolic music research matures, the Lakh MIDI Dataset and its progeny serve as both testbeds and reference corpora for the field, driving both methodological rigor and creative exploration.