Lakh MIDI Dataset Overview
- Lakh MIDI Dataset is a comprehensive symbolic music corpus containing over 170,000 diverse MIDI files spanning genres like pop, rock, and classical.
- It supports numerous research tasks including generative modeling, music information retrieval, structural analysis, and cross-modal linkage.
- Extensive preprocessing, de-duplication, and specialized derivatives such as Slakh and LPD enhance its applicability in deep learning and multimodal studies.
The Lakh MIDI Dataset (LMD) is one of the largest and most influential symbolic music corpora publicly available for machine learning, generative modeling, and music informatics research. Comprising over 170,000 MIDI files with a wide stylistic range, LMD has become a foundational resource for methods spanning data-driven creation, music information retrieval (MIR), structure analysis, expressive performance detection, cross-modal linkage, de-duplication, and multimodal captioning. Its scale, diversity, and open accessibility have precipitated new methodologies and benchmarking standards across symbolic music and audio-related domains.
1. Dataset Structure, Coverage, and Diversity
LMD contains over 170,000 MIDI files collected via web scraping and aggregation from numerous internet repositories (Manilow et al., 2019). Each file encodes symbolic musical information as a series of note events (pitch, onset/duration, velocity), program changes, metadata, text markers, and instrument assignments. Distribution is broad, spanning genres such as pop, rock, electronic, classical, jazz, and soundtracks, with instrumentation ranging from solo piano to large multi-instrumental arrangements. Notable characteristics include:
- Typical file composition: several instrument tracks, including “band” configurations such as piano, bass, guitar, drums, and optionally strings, winds, or electronic sounds.
- Rich stylistic and temporal diversity, with variable quantization (ticks per quarter note).
- Predominance of quantized, programmatic content, but with significant representation of performed (expressive) human recordings and transcriptions (Lee et al., 24 Feb 2025).
- Metadata often includes time signature, key, and expressive markers; however, annotation granularity varies widely by source file.
The dataset is often used both as a raw symbolic corpus and as the basis for derived datasets tailored to specific research objectives (e.g., Slakh2100 for source separation, Lakh Pianoroll Dataset (LPD) for multi-track generative modeling, Segmented Lakh MIDI Subset (SLMS) for structural analysis, and filtered versions for de-duplication).
2. Preprocessing, Quality Control, and De-duplication
LMD's utility for robust MIR and generative modeling hinges on careful preprocessing to address noise, duplicates, and inconsistencies inherent in scraped datasets.
- Filtering pipelines typically exclude files with insufficient musical content, invalid or missing metadata, or excessive sustained notes.
- Cleaning steps remove percussion tracks, consolidate sustain pedal signals (for piano content), deduplicate overlapping note events, and harmonize timing resolutions (for example, resampling all files to 2400 ticks per quarter note in the CMA dataset) (Walder, 2016).
- De-duplication is critical due to frequent appearance of multiple arrangements, re-edits, and metadata-only modifications. Recent work employs rule-based hashing (Octuple and MD5), beat entropy analysis, chroma-DTW alignment, and symbolic music retrieval models (MusicBERT, CLaMP); advanced contrastive learning (CAugBERT) leverages token-level data augmentation to distinguish subtle note-level variations (Choi et al., 20 Sep 2025). Conservative filtering can exclude ~21.4% of files as redundant, mitigating data leakage in train/test splits and improving representation diversity.
- Hierarchical clustering strategies (using signature vectors from note numbers and durations) are used to generate splits with minimal overlap among versions or transcriptions for experimental validity (Walder, 2016).
3. Derivative Datasets and Specialized Applications
Several targeted datasets and benchmarks are derived from LMD for particular research domains:
Slakh Dataset (Manilow et al., 2019): Synthesized mixtures and instrument stems obtained by rendering LMD MIDI with professional-grade virtual instruments; 2100 songs with 145 hours of CD-quality audio and labeled tracks. Enables large-scale supervised training for source separation models, addressing the scarcity of instrument stems in commercial audio.
Lakh Pianoroll Dataset (LPD): Binary multi-track piano roll representation with five fixed tracks (bass, drums, guitar, strings, piano) for multi-track symbolic music generation modeling (Liang et al., 2019). Forms the basis for hierarchical models such as MIDI-Sandwich2.
Segmented Lakh MIDI Subset (SLMS) (Eldeeb et al., 20 Sep 2025): 6134 human-annotated MIDI files with section boundary markers for supervised music structure analysis. Derived via marker extraction, barwise quantization, and deduplication.
MidiCaps (Melechovsky et al., 4 Jun 2024): 168,407 MIDI files paired with automatically generated text captions describing features such as tempo, time signature, key, instruments, genre, and mood. Text produced using in-context learning with a LLM conditioned on extracted features.
These datasets enable new research in cross-modal generation, structure-aware music modeling, expressive performance paper, and multimodal retrieval.
4. Modeling Approaches Leveraging LMD
LMD’s size and diversity enable high-capacity deep learning architectures and complex generative strategies.
- Pre-training and Transfer Learning: LMD is routinely utilized to imbue models with broad musical understanding (harmonic, melodic, rhythmic, and instrumental knowledge). For example, LakhNES pre-trains Transformer-XL on LMD mapped to NES ensemble format before fine-tuning on domain-specific NES-MDB, yielding improved perplexity and user-perceived human-likeness in output (Donahue et al., 2019). Data augmentation (transposition, tempo scaling, instrument dropping/shuffling) addresses overfitting and domain adaptation challenges.
- Probabilistic Sequence Modeling: Notochord exemplifies state-of-the-art probabilistic modeling on LMD, employing an autoregressive GRU backbone with event-level factorization (instrument, pitch, inter-event time, velocity). Key architecture choices—continuous-time modeling via discretized mixture logistics, sub-event conditioning for steerable output—are enabled by LMD’s rich data diversity (Shepardson et al., 18 Mar 2024).
- Hierarchical Multi-modal Fusion: The MIDI-Sandwich2 VAE model, validated on LPD, constructs independent track-wise VAEs before fusing latents with a multi-modal VAE. This hierarchical design, empowered by LMD’s multi-track structure, enables harmonious multi-track generation and expressive restoration (Liang et al., 2019).
- Expressive Performance Detection: GigaMIDI (superset including LMD) introduces velocity/onset-based heuristics (DNVR, DNODR) and hierarchical metric alignment (NOMML) to identify expressive human performances within symbolic datasets, achieving up to 100% accuracy on NE/EP discrimination with the NOMML metric (Lee et al., 24 Feb 2025).
5. Cross-modal and Music Structure Modeling
LMD has proven foundational for research bridging symbolic and other modalities (audio, text, sheet music):
- Cross-modal Retrieval: Efficient bootleg score hashing enables linkage between MIDI files in LMD and sheet music images in IMSLP. The bootleg representation projects events to binary matrices capturing notehead vertical positions. Hashing and reverse indexing facilitate scalable database search, with mean reciprocal rank (MRR) of 0.84 and retrieval time of 25.4 seconds per query (Tsai, 2020).
- Music Structure Analysis: Section boundary detection in symbolic music leverages LMD-derived SLMS. CNN models with overtone-encoded piano roll tensors (inspired by audio spectrograms) outperform audio-based techniques in F₁ score by ~0.22–0.31 (Eldeeb et al., 20 Sep 2025). This enables data-driven paper of musical form, segmentation, and higher-level structure in symbolic corpora.
6. Limitations, Data Quality, and Future Research
- Duplication and Quality Control: Data leakage from duplicate files or overlapping arrangements can skew model evaluation and lead to overfit generative outputs. Comprehensive de-duplication using hybrid embedding-based filtering is essential for reliable machine learning outcomes (Choi et al., 20 Sep 2025).
- Performance and Expressivity: While LMD’s breadth is substantial, a sizable fraction of files are quantized or programmatically generated, limiting modeling of human-like expressive nuance. Approaches such as velocity and onset-based heuristics (NOMML) help filter and annotate expressive subsets, but challenges remain in capturing subtle agogic and dynamic variation (Lee et al., 24 Feb 2025).
- Annotation Granularity: Metadata coverage (keys, markers, expressive controls) is inconsistent, limiting certain MIR tasks and downstream applications. Text captioning, as with MidiCaps, improves content interpretability but depends on downstream accuracy of feature extraction and language modeling (Melechovsky et al., 4 Jun 2024).
- Extension to Audio and Multi-modal Domains: LMD-derived datasets (Slakh, MidiCaps) support source separation, cross-modal generation, and retrieval; a plausible implication is that ongoing synthesis and annotation efforts will further bridge symbolic and acoustic modalities.
7. Impact and Ongoing Directions
The Lakh MIDI Dataset stands as a cornerstone of symbolic music machine learning, supporting a wide constellation of downstream tasks—generative modeling, expressive detection, structure analysis, cross-modal translation, and MIR benchmarking. Continuous efforts to improve annotation, de-duplication, expressive labeling, and cross-modal linkage are expanding its utility, while derived resources (Slakh, LPD, SLMS, MidiCaps) continue to set standards for reproducibility, scalability, and multimodal research in computer music.