Clean MIDI Subset: Methods & Applications

Updated 27 September 2025

Clean MIDI Subset is a rigorously curated collection of MIDI files meeting strict criteria for metadata accuracy, expressiveness, and deduplication.
It integrates multiple filtering approaches including audio-based detection, multi-stage pruning, and deduplication to ensure high data fidelity.
The dataset enhances research in music information retrieval, generative modeling, and computational musicology by providing robust, error-pruned symbolic music data.

A Clean MIDI Subset refers to a rigorously curated collection of MIDI files, selected or modified to optimize data fidelity, metadata accuracy, expressiveness, compatibility, and utility for downstream symbolic music research. The concept is domain-specific, shaped by the goals of dataset construction and the needs of music information retrieval (MIR), generative modeling, or evaluation. Across symbolic music research, several benchmark datasets and methods embody different facets of this “clean” criterion, ranging from deduplication for prevention of data leakage to expressiveness filtering, error-pruned transcription, velocity enrichment, and controlled multi-track composition.

1. Principles and Definitions

A Clean MIDI Subset is characterized by criteria including:

Absence of faulty data (e.g., corrupt or unreasonably short/long files)
Precise metadata annotations (composer, title, genre, instrumentation)
Expressive musical content (with human-like velocity, microtiming, and articulation)
Minimal duplication (i.e., no multiple arrangements or copies of the same piece)
Compatibility with various symbolic music tasks (MIR, generation, transcription, style analysis)

“Clean” may specifically indicate files that:

Survive error-pruning steps (e.g., removal of misattributed, noisy, or mis-segmented content (Kong et al., 2020, Bradshaw et al., 21 Apr 2025, Melechovsky et al., 4 Jun 2024))
Are grouped by deduplication and validated similarity (Choi et al., 20 Sep 2025)
Are filtered for expressiveness using rigorously evaluated heuristics (Lee et al., 24 Feb 2025)
Employ advanced generative or editing workflows for targeted selection (Tchemeube et al., 18 Apr 2025)
Have post-processed attributes such as velocity predicted to match human performance (He et al., 11 Aug 2025)

2. Curated Subset Construction Approaches

Several approaches and pipelines exemplify construction of a Clean MIDI Subset:

Audio-based Filtering and Transcription:

In the GiantMIDI-Piano dataset, audio files corresponding to classical piano works are first detected using a CNN (solo piano probability F1 ≈ 88.14%). A surname filter on recording titles increases composer metadata accuracy (e.g., for Chopin from 37% to 82%). The curated subset comprises works only where both automatic detection and composer surname criteria are satisfied, resulting in 7,236 reliably attributed pieces (Kong et al., 2020).

Multi-Stage Pruning and Metadata Extraction:

The Aria-MIDI dataset employs a LLM to score YouTube recordings (1–5 scale), a CNN-based audio classifier (5-sec windows, λ-threshold) with pseudo-labeling via source separation, and segment-based pruning. Files with segment overlap ≥94% and no compositional duplication—quantified by matching metadata triples—form a clean subset of 800,973 files (Bradshaw et al., 21 Apr 2025).

Deduplication and Benchmark Validation:

The LMD-clean subset (17,184 files), with explicit duplicate labeling (10,355 marked in metadata), is used as a benchmark for de-duplication frameworks. By organizing files per artist and song, clustered versions, and testing various detection algorithms (MIDI encoding hash, Beat Position Entropy, Chroma-DTW, MusicBERT/CLaMP embeddings, contrastive learning BERT augmentation), this approach allows systematic assessment of duplicate detection with precision thresholds and clustering (Choi et al., 20 Sep 2025).

Feature-Based Filtering:

MidiCaps begins by cleaning faulty MIDI files (removal of endless notes, duration bounds of 3–900 seconds). Instrument groups, genre, and chord progression are rigorously extracted, and captions generated with in-context LLM prompting create a semantically rich dataset for robust downstream multifaceted retrieval or cross-modal research (Melechovsky et al., 4 Jun 2024).

3. Expressiveness and Musical Quality Filtering

Expressive performance filtering is implemented in GigaMIDI via three heuristics:

Distinctive Note Velocity Ratio (DNVR):

$DNVR = (c_{velocity} / 127) \times 100$ ; higher values denote greater dynamic variation.

Distinctive Note Onset Deviation Ratio (DNODR):

$DNODR = (c_{onset} / TPQN) \times 100$ ; quantifies microtiming.

Note Onset Median Metric Level (NOMML):

Assigns hierarchical metric levels to each onset; median acts as separator between mechanical vs. expressive tracks (optimal threshold: level 12 at 63.85 percentile). NOMML leads to 100% classification accuracy for expressive/non-expressive track separation.

The resulting expressive MIDI subset contains ~1.65M tracks (31% of GigaMIDI) across all General MIDI instruments, facilitating research in generative modeling, computational musicology, and style analysis (Lee et al., 24 Feb 2025).

4. Error Analysis, Metadata Reliability, and Quality Metrics

Quality evaluation of Clean MIDI Subsets is typically measured in terms of:

Detection Metrics:

F1 score for solo piano detection (~88.14%), precision (~89.66%), recall (~86.67%) (Kong et al., 2020).

Metadata Accuracy:

Manual checks (e.g., 97% accuracy with surname constraint) support reliable association of MIDI files to composers or works.

Transcription Error Rate:

Alignment-based error rate $ER = (S+D+I)/N$ ; GiantMIDI-Piano median ER is 0.154, compared to 0.061 on MAESTRO. The relative error, $r = ER_{GiantMIDI} - ER_{MAESTRO} \approx 0.094$ , quantifies transcription deviations.

Subjective and Objective Matching:

MidiCaps employs listener studies across dimensions such as key, chord, tempo, genre, and “overall match,” showing that AI-generated captions are comparable or slightly superior to human annotations for key/tempo recognition (Melechovsky et al., 4 Jun 2024).

Duplication Detection Benchmarks:

LMD-clean guides precision-recall tuning; clustering based on similarity thresholds (e.g., 0.99 for cosine similarity) enforces minimal false positive duplicate removals (Choi et al., 20 Sep 2025).

5. Velocity Enhancement and Expressive Post-Processing

Many studio-generated MIDI files exhibit flat velocity (typically 64), lacking expressive dynamics. U-Net-based image colorization methods, such as those presented in (He et al., 11 Aug 2025), model MIDI piano rolls as multi-channel images (onset, frame, velocity) and employ window attention for sparsity, as well as a custom combined loss:

$\mathcal{L}_{Comb} = (1 - α) \mathcal{L}_{BCE} + α (1 - CosSim)$

(masking on onsets; $α = 0.2$ ).

Quantitative results include MAE, MSE, recall (within 10% error tolerance), and standard deviation of velocity (a proxy for expressiveness). U-Net approaches outperform ConvAE and Seq2Seq baselines and rank higher on listening-based MOS, contributing to a transformed “clean” MIDI subset that more closely reflects human performance metrics.

6. User-Driven Cleaning, Editing, and Generation Workflows

Interactive systems, such as Calliope (Tchemeube et al., 18 Apr 2025), enable users to upload, visualize, and edit MIDI tracks (delete, reorder, fill-in). Global and per-track generative parameters (e.g., temperature, polyphony, “percentage preserved,” batch output) support both manual and automated creation of clean subsets. Export options and DAW streaming facilitate professional integration.

A plausible implication is that hybrid workflows (combining algorithmic heuristics and collaborative editing) can yield subsets optimized for both machine learning and creative deployment.

7. Applications and Future Directions

Clean MIDI Subsets underpin a breadth of applications:

Benchmarking expressive composition modeling (multi-instrumental, solo, or multi-track)
Reliable evaluation and training for deep generative systems (pretrained models, transformer-based MMM, diffusion architectures)
MIR tasks: tagging, style classification, retrieval, cross-modal alignment (text-to-MIDI, MIDI-to-text)
Deduplication for data leakage prevention in evaluation splits
Musicological analysis (historical, genre, stylistic trends)
Dataset bias and provenance investigation
Future enhancements: addition of richer metadata, improved source-separation, expressive articulation detection, and multimodal augmentation

Contemporary research points towards more granular and hybrid approaches, such as integrating heuristic-driven, LLM-based annotation, and data augmentation for robust expressiveness and deduplication. The convergence of automatic transcription, semantic annotation, and controllable generation continues to raise the standard for what constitutes a “clean” subset within symbolic music datasets.