Papers
Topics
Authors
Recent
Search
2000 character limit reached

Segmented Lakh MIDI Subset (SLMS)

Updated 10 March 2026
  • Segmented Lakh MIDI Subset (SLMS) is a curated symbolic music dataset with over 6,000 multi-track MIDI files annotated at the bar level.
  • It employs a rigorous curation process including deduplication, marker-based filtering, and manual DAW validation to ensure high-quality section boundary annotations.
  • Using an overtone-aware 3-channel piano-roll encoding, SLMS facilitates deep learning applications in Music Structure Analysis and robust model benchmarking.

The Segmented Lakh MIDI Subset (SLMS) is a large-scale, human-annotated symbolic music dataset specifically constructed for section boundary detection in multi-track MIDI files. Derived from the Lakh MIDI Dataset (LMD), SLMS comprises 6,134 MIDI files with bar-wise annotations of musical section boundaries and is organized to enable machine learning benchmarks for Music Structure Analysis (MSA) in the symbolic domain. Its design leverages symbolic music’s explicit pitch, timing, and instrumentation representations, introducing both a detailed annotation schema and a novel data encoding that facilitates deep learning applications (Eldeeb et al., 20 Sep 2025).

1. Dataset Composition and Subset Structure

SLMS contains 6,134 curated multi-track MIDI files, split into two non-overlapping subsets:

  • Tubb files: 3,907 files (∼225.1 hours) emphasizing 19th-century popular song arrangements, typically single or small-ensemble.
  • Non-Tubb files: 2,227 files (∼143.5 hours) covering a diverse range of genres, including rock, jazz, symphonic, solo classical piano, metal, and karaoke.

Selection and curation from LMD applied several layers:

  • Deduplication: Employed Malandro’s "CA" deduplication methodology to eliminate near-duplicate files.
  • Marker-based Filtering: Required at least 3 MIDI marker meta-events per file; ensured the ratio of measures to markers is between 6 and 24; enforced at least one marker between the first and last note onsets.
  • Manual Curation: Visual inspection in a Digital Audio Workstation (DAW) verified alignment of markers to perceived section boundaries, supplemented by auditioning where ambiguities arose. Files with incorrectly placed or extraneous markers were removed.

Acceptance rates indicate rigorous filtering: 3,907 of 4,466 Tubb candidates persisted (12.5% rejection), compared to 2,227 of 3,336 non-Tubb candidates (33.2% rejection).

All standard MIDI program numbers are represented, with drums allocated to a dedicated channel, supporting a broad spectrum of MIDI-instrumented music.

2. Annotation Scheme and Curation Workflow

Annotations in SLMS are strictly derived from original MIDI author “marker” meta-events. Markers are deemed valid section boundaries if, or when quantized, they coincide with bar-line boundaries as specified in the MIDI time-signature track. Original text labels (e.g., "Verse," "Chorus," "Bridge") are retained unmodified.

The multi-stage curation protocol is as follows:

  1. Post-filtering, candidate files are listed by marker count.
  2. Each candidate undergoes visual examination within a DAW for marker-to-section correspondence.
  3. When necessary, auditory checks are performed via synthesized playback.
  4. Files displaying missing, erroneous, or non-section markers are excluded; remaining markers are quantized to the nearest bar line.

Unlike inter-annotator-validated corpora such as SALAMI (which reports ~0.74 F₁ at 0.5s tolerance), agreement metrics are not reported, as only the original author’s annotations are used and no second-pass curation is performed.

Each file’s annotations are delivered in a JSON structure referencing: file_id (MD5 hash), subset, split, and a list of markers with beat (quarter-note units), time (seconds), and label.

3. Data Representation and Encoding for Machine Learning

The SLMS introduces a 3-channel overtone-aware piano-roll encoding for symbolic music processing. For CNN-based section boundary detection, each MIDI file is converted into fixed-sized “patches” centered on bar boundaries. Each patch is an array XR3×128×512X \in \mathbb{R}^{3 \times 128 \times 512}, where:

  • Height: 128 (for MIDI pitches 0–127)
  • Width: 512 time steps (representing 128 quarter notes = 32 bars at 4 ticks per beat)

Channels:

  1. Primary piano roll (R1R_1): Aggregates all non-drum melodic/harmonic events.

R1(t,p)=nvn1{pn=ptnt<tn+dn}R_1(t, p) = \sum_n v_n \cdot 1\{p_n = p \wedge t_n \leq t < t_n + d_n\}

  1. Drum channel (R2R_2): Encodes drum note-on events for 1/16-note durations.

R2(t,p)=nDrumsvn1{pn=ptnt<tn+δ}R_2(t, p) = \sum_{n \in \text{Drums}} v_n \cdot 1\{p_n = p \wedge t_n \leq t < t_n + \delta\}

with δ=1/16 note\delta = \text{1/16 note}

  1. Harmonic Overtone Channel (R3R_3): Synthesizes overtones for each non-drum instrument with overtone series Op={(kj,αj)}j=1KO_p = \{(k_j, \alpha_j)\}_{j=1}^K. For each note nn:

R3(t,p)=njαjvnmax(0,1ttndn)1{p=pnjtnt<tn+dn}R_3(t, p) = \sum_n \sum_j \alpha_j v_n \max(0, 1-\frac{t-t_n}{d_n}) 1\{p = p_{nj} \wedge t_n \leq t < t_n + d_n\}

where pnj=round(pn+12log2kj)p_{nj} = \text{round}(p_n + 12 \log_2 k_j).

This 3-channel encoding improves the network’s capacity to model polyphonic texture, rhythmic density, and overtone spectra, serving as input to a MobileNetV3-based CNN.

4. Organization, Accessibility, and Licensing

The public release at https://github.com/m-malandro/SLMS comprises:

  • MIDI files: /SLMS/midi/
  • Annotations: /SLMS/annotations/
  • Train/Val/Test splits: /SLMS/splits/

Annotation and MIDI filenames are consistent, derived from the file’s LMD MD5 hash.

JSON annotation schema:

Field Type Example Value
file_id str "ca05cc474fd2010484c1201bf57b3cfd"
subset str "Tubb" or "non-Tubb"
split str "train", "val", or "test"
markers list [{"beat": 32.0, "time": 14.37, "label": "Verse"}, ...]

Pretrained boundary-detection models and code are available at https://github.com/omareldeeb/midi-msa.

Licensing follows the Lakh MIDI Dataset terms for MIDI files (see Raffel 2016), while annotations and code are MIT-licensed, subject to updates per the repository documentation.

5. Benchmarking and Model Performance

SLMS enables bar-wise binary classification for section boundary detection: given a 3-channel piano-roll patch centered on a bar boundary, a model predicts whether a true section boundary is present.

Primary quantitative results (Tubb + non-Tubb, aggregated):

  • Symbolic-MIDI baseline (MobileNetV3): F₁ = 0.7675, Precision 0.7704, Recall 0.7647
  • 3-model ensemble: F₁ = 0.7838
  • Ablations:
    • No ImageNet pretraining: F₁ = 0.7572
    • No overtone channel: F₁ = 0.7593
    • No overtone + no drum separation: F₁ = 0.7661

Comparison to analogous audio approaches on the same files rendered (FluidSynth + Arachno SF2):

Method F₁ (per-measure) F₁ (±0.5s tol.)
Symbolic (CNN, 3ch) 0.7675
Supervised Audio (CNN, melspec) 0.5135 0.5523
Unsupervised CBM 0.4583 0.4488

By subset:

Subset Symbolic F₁ Audio F₁ CBM F₁
Non-Tubb 0.6981 0.4435 0.5436
Tubb 0.8413 0.5678 0.3718

SLMS combined with overtone-aware 3-channel encoding and ImageNet-pretrained MobileNetV3 substantially outperforms both a directly analogous supervised audio-based system and a strong unsupervised block-matching (CBM) segmentation baseline—by margins of +0.22 F₁ (supervised audio) and +0.31 F₁ (CBM) in strict per-measure boundary detection.

6. Research Applications and Domain Significance

SLMS is the largest publicly available, human-segmented, multi-track MIDI corpus, offering per-bar section boundaries and optional section-function labels. Its scope—comprising extensive stylistic diversity, overtone-aware piano-roll encoding, and standardized train/validation/test splits—addresses central limitations of both prior symbolic and audio corpora in MSA.

Practical applications include:

  • Training and benchmarking boundary detection models: SLMS supports direct evaluation of symbolic- and audio-domain models in comparative settings.
  • MSA and structure-aware symbolic music generation: The granularity and genre range of SLMS facilitate learning structure-sensitive generative models.
  • Analysis of model ablations and transfer learning: Rigorous ablation studies and efficacy of ImageNet pretraining highlight SLMS’s utility for understanding cross-domain model transfer and the importance of encoding choices for symbolic music tasks.

A plausible implication is that overtone-aware multi-channel symbolic representations can serve as a superior substrate for deep learning in symbolic music structure tasks, compared to conventional piano-roll or audio-based representations.

7. Limitations and Prospective Developments

SLMS’s annotation schema relies exclusively on original author-supplied section markers. No inter-annotator adjudication is factored, contrasting with dual-pass or consensus-labeled corpora. This design choice precludes reporting internal agreement metrics and potentially introduces annotation idiosyncrasies.

While SLMS’s bar-level granularity and breadth render it uniquely valuable, further research may target multi-granular, consensus-based segmentations, and explore domain adaptation between symbolic and audio-based MSA models using the provided paired data. The public codebase and extensible data structure facilitate continued benchmarking and refinement of both annotation and learning protocols (Eldeeb et al., 20 Sep 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Segmented Lakh MIDI Subset (SLMS).