Segmented Lakh MIDI Subset (SLMS)
- Segmented Lakh MIDI Subset (SLMS) is a curated symbolic music dataset with over 6,000 multi-track MIDI files annotated at the bar level.
- It employs a rigorous curation process including deduplication, marker-based filtering, and manual DAW validation to ensure high-quality section boundary annotations.
- Using an overtone-aware 3-channel piano-roll encoding, SLMS facilitates deep learning applications in Music Structure Analysis and robust model benchmarking.
The Segmented Lakh MIDI Subset (SLMS) is a large-scale, human-annotated symbolic music dataset specifically constructed for section boundary detection in multi-track MIDI files. Derived from the Lakh MIDI Dataset (LMD), SLMS comprises 6,134 MIDI files with bar-wise annotations of musical section boundaries and is organized to enable machine learning benchmarks for Music Structure Analysis (MSA) in the symbolic domain. Its design leverages symbolic music’s explicit pitch, timing, and instrumentation representations, introducing both a detailed annotation schema and a novel data encoding that facilitates deep learning applications (Eldeeb et al., 20 Sep 2025).
1. Dataset Composition and Subset Structure
SLMS contains 6,134 curated multi-track MIDI files, split into two non-overlapping subsets:
- Tubb files: 3,907 files (∼225.1 hours) emphasizing 19th-century popular song arrangements, typically single or small-ensemble.
- Non-Tubb files: 2,227 files (∼143.5 hours) covering a diverse range of genres, including rock, jazz, symphonic, solo classical piano, metal, and karaoke.
Selection and curation from LMD applied several layers:
- Deduplication: Employed Malandro’s "CA" deduplication methodology to eliminate near-duplicate files.
- Marker-based Filtering: Required at least 3 MIDI marker meta-events per file; ensured the ratio of measures to markers is between 6 and 24; enforced at least one marker between the first and last note onsets.
- Manual Curation: Visual inspection in a Digital Audio Workstation (DAW) verified alignment of markers to perceived section boundaries, supplemented by auditioning where ambiguities arose. Files with incorrectly placed or extraneous markers were removed.
Acceptance rates indicate rigorous filtering: 3,907 of 4,466 Tubb candidates persisted (12.5% rejection), compared to 2,227 of 3,336 non-Tubb candidates (33.2% rejection).
All standard MIDI program numbers are represented, with drums allocated to a dedicated channel, supporting a broad spectrum of MIDI-instrumented music.
2. Annotation Scheme and Curation Workflow
Annotations in SLMS are strictly derived from original MIDI author “marker” meta-events. Markers are deemed valid section boundaries if, or when quantized, they coincide with bar-line boundaries as specified in the MIDI time-signature track. Original text labels (e.g., "Verse," "Chorus," "Bridge") are retained unmodified.
The multi-stage curation protocol is as follows:
- Post-filtering, candidate files are listed by marker count.
- Each candidate undergoes visual examination within a DAW for marker-to-section correspondence.
- When necessary, auditory checks are performed via synthesized playback.
- Files displaying missing, erroneous, or non-section markers are excluded; remaining markers are quantized to the nearest bar line.
Unlike inter-annotator-validated corpora such as SALAMI (which reports ~0.74 F₁ at 0.5s tolerance), agreement metrics are not reported, as only the original author’s annotations are used and no second-pass curation is performed.
Each file’s annotations are delivered in a JSON structure referencing: file_id (MD5 hash), subset, split, and a list of markers with beat (quarter-note units), time (seconds), and label.
3. Data Representation and Encoding for Machine Learning
The SLMS introduces a 3-channel overtone-aware piano-roll encoding for symbolic music processing. For CNN-based section boundary detection, each MIDI file is converted into fixed-sized “patches” centered on bar boundaries. Each patch is an array , where:
- Height: 128 (for MIDI pitches 0–127)
- Width: 512 time steps (representing 128 quarter notes = 32 bars at 4 ticks per beat)
Channels:
- Primary piano roll (): Aggregates all non-drum melodic/harmonic events.
- Drum channel (): Encodes drum note-on events for 1/16-note durations.
with
- Harmonic Overtone Channel (): Synthesizes overtones for each non-drum instrument with overtone series . For each note :
where .
This 3-channel encoding improves the network’s capacity to model polyphonic texture, rhythmic density, and overtone spectra, serving as input to a MobileNetV3-based CNN.
4. Organization, Accessibility, and Licensing
The public release at https://github.com/m-malandro/SLMS comprises:
- MIDI files: /SLMS/midi/
- Annotations: /SLMS/annotations/
- Train/Val/Test splits: /SLMS/splits/
Annotation and MIDI filenames are consistent, derived from the file’s LMD MD5 hash.
JSON annotation schema:
| Field | Type | Example Value |
|---|---|---|
| file_id | str | "ca05cc474fd2010484c1201bf57b3cfd" |
| subset | str | "Tubb" or "non-Tubb" |
| split | str | "train", "val", or "test" |
| markers | list | [{"beat": 32.0, "time": 14.37, "label": "Verse"}, ...] |
Pretrained boundary-detection models and code are available at https://github.com/omareldeeb/midi-msa.
Licensing follows the Lakh MIDI Dataset terms for MIDI files (see Raffel 2016), while annotations and code are MIT-licensed, subject to updates per the repository documentation.
5. Benchmarking and Model Performance
SLMS enables bar-wise binary classification for section boundary detection: given a 3-channel piano-roll patch centered on a bar boundary, a model predicts whether a true section boundary is present.
Primary quantitative results (Tubb + non-Tubb, aggregated):
- Symbolic-MIDI baseline (MobileNetV3): F₁ = 0.7675, Precision 0.7704, Recall 0.7647
- 3-model ensemble: F₁ = 0.7838
- Ablations:
- No ImageNet pretraining: F₁ = 0.7572
- No overtone channel: F₁ = 0.7593
- No overtone + no drum separation: F₁ = 0.7661
Comparison to analogous audio approaches on the same files rendered (FluidSynth + Arachno SF2):
| Method | F₁ (per-measure) | F₁ (±0.5s tol.) |
|---|---|---|
| Symbolic (CNN, 3ch) | 0.7675 | — |
| Supervised Audio (CNN, melspec) | 0.5135 | 0.5523 |
| Unsupervised CBM | 0.4583 | 0.4488 |
By subset:
| Subset | Symbolic F₁ | Audio F₁ | CBM F₁ |
|---|---|---|---|
| Non-Tubb | 0.6981 | 0.4435 | 0.5436 |
| Tubb | 0.8413 | 0.5678 | 0.3718 |
SLMS combined with overtone-aware 3-channel encoding and ImageNet-pretrained MobileNetV3 substantially outperforms both a directly analogous supervised audio-based system and a strong unsupervised block-matching (CBM) segmentation baseline—by margins of +0.22 F₁ (supervised audio) and +0.31 F₁ (CBM) in strict per-measure boundary detection.
6. Research Applications and Domain Significance
SLMS is the largest publicly available, human-segmented, multi-track MIDI corpus, offering per-bar section boundaries and optional section-function labels. Its scope—comprising extensive stylistic diversity, overtone-aware piano-roll encoding, and standardized train/validation/test splits—addresses central limitations of both prior symbolic and audio corpora in MSA.
Practical applications include:
- Training and benchmarking boundary detection models: SLMS supports direct evaluation of symbolic- and audio-domain models in comparative settings.
- MSA and structure-aware symbolic music generation: The granularity and genre range of SLMS facilitate learning structure-sensitive generative models.
- Analysis of model ablations and transfer learning: Rigorous ablation studies and efficacy of ImageNet pretraining highlight SLMS’s utility for understanding cross-domain model transfer and the importance of encoding choices for symbolic music tasks.
A plausible implication is that overtone-aware multi-channel symbolic representations can serve as a superior substrate for deep learning in symbolic music structure tasks, compared to conventional piano-roll or audio-based representations.
7. Limitations and Prospective Developments
SLMS’s annotation schema relies exclusively on original author-supplied section markers. No inter-annotator adjudication is factored, contrasting with dual-pass or consensus-labeled corpora. This design choice precludes reporting internal agreement metrics and potentially introduces annotation idiosyncrasies.
While SLMS’s bar-level granularity and breadth render it uniquely valuable, further research may target multi-granular, consensus-based segmentations, and explore domain adaptation between symbolic and audio-based MSA models using the provided paired data. The public codebase and extensible data structure facilitate continued benchmarking and refinement of both annotation and learning protocols (Eldeeb et al., 20 Sep 2025).