Similar Music Pair Dataset
- Similar Music Pair Dataset is a curated collection of paired music segments annotated for fine-grained similarity in melody, rhythm, and harmony.
- It supports various MIR tasks such as plagiarism detection, cover song identification, and generative modeling by providing precise segment alignments.
- It employs both manual and algorithmic annotation protocols to ensure reproducible benchmarks across diverse genres and audio formats.
A Similar Music Pair dataset consists of curated pairs of music segments or pieces that are annotated as musically similar according to specific, reproducible criteria. Such datasets constitute a foundational resource for music information retrieval (MIR) research, especially for tasks such as plagiarism detection, cover song identification, machine learning of music similarity, and generative modeling. Distinctions in dataset construction arise from variations in segment granularity (full track, excerpt, or motif), representation (audio, symbolic, hybrid), and annotation (expert, algorithmic, simulated transformations). Leading collections target both real-world and algorithmically simulated cases, provide precise annotation protocols, and are typically released in open formats to facilitate benchmarking and methodological innovation.
1. Definitions and Core Principles
A Similar Music Pair is composed of two music items—typically either entire pieces or precisely annotated segments—designated as "original" and "comparison." The similarity relation is established based on musical content, including melody, harmony, rhythm, or stylistic attributes, according to explicit annotation rules.
Primary relation tags include:
- Plagiarism Case: Pairs where similarity has prompted legal or public claims of copyright infringement.
- Remake: Official remixes, re-recordings, or covers by affiliated artists.
- Variation: In composition-centric datasets, a performed or generated reinterpretation of a source phrase.
The explicit objective is to support fine-grained similarity analysis at the level of musical content and structure. Similar Music Pair datasets are indispensable for evaluating feature engineering, machine learning, and statistical modeling approaches in MIR, and for facilitating reproducible research by providing standardized tasks and evaluation sets.
2. Major Datasets: Properties and Comparisons
Recent research initiatives have produced several notable Similar Music Pair datasets, differentiated by their data sources, annotation rigor, and intended applications.
| Dataset | Pairs/Segments | Annotation Type | Format(s) | Domain |
|---|---|---|---|---|
| SMP (Go & Kim ICASSP 2026, MIPPIA) (Go et al., 29 Jan 2026) | 72 pairs, avg. ~149 segments/piece | Manual, expert, segment-level w/ binary labels | Audio (WAV), segment CSV, features | Pop, rock, jazz, folk (real) |
| SMP (MIPPIA 2025) (Go, 10 Sep 2025) | 70 pairs, 1+ segments/pair | Manual, expert, start/end times | Audio (WAV/MP3), metadata CSV/JSON | Broad, real-world |
| JAZZVAR (Row et al., 2023) | 502 pairs | Manual, performer-matched, symbolic | MIDI | Jazz (variation) |
| MPD-Set (Liu et al., 2021) | 1,000 pairs (simulated) + 29 real | Programmatic + real case | MIDI | Multigenre simulated + real |
| MelodySim (Lu et al., 27 May 2025) | >1M segment pairs (95k+ test) | Algorithmic (augmented), user study | Audio (WAV), MIDI, JSON | Multigenre, melody-centric |
These datasets span real-world case studies, symbolic simulation, and large-scale generation via augmentation. The SMP datasets (Go et al., 29 Jan 2026, Go, 10 Sep 2025) emphasize authentic, expert-annotated segment boundaries and leverage MIR pipelines for segment transcription and feature extraction. MPD-Set (Liu et al., 2021) provides controlled, transformation-based positive pairs for benchmarking algorithmic robustness. JAZZVAR (Row et al., 2023) targets musical variation within jazz performance, facilitating studies of stylistic reinterpretation rather than copyright. MelodySim (Lu et al., 27 May 2025) focuses on melodic preservation and is validated both algorithmically and via user studies.
3. Annotation Protocols and Segment Alignment
Annotation in Similar Music Pair datasets is distinguished by its focus on fine-grained segment alignment, rather than coarse-grained, whole-piece comparison. In the SMP datasets (Go et al., 29 Jan 2026, Go, 10 Sep 2025), musically trained annotators listen to both tracks, identify intervals where the topline melody, chord progression, or rhythmic motifs are “substantially identical or highly analogous,” and mark segment boundaries (to typically 0.1 s precision) aligned with beat/downbeat estimates.
In the latest SMP (Go & Kim ICASSP 2026), audio is processed into fixed-length 4-bar segments using beat-tracking derived from demixed stems. Annotators provide two lists of aligned segment onset times per pair:
- for the original,
- for the comparative track, corresponding to the aligned annotated intervals.
Inter-annotator agreement is systematically evaluated (Cohen’s κ up to 0.82 for boundary selection (Go et al., 29 Jan 2026)), ensuring reliability.
Algorithmic or programmatic annotation is used in datasets like MPD-Set and MelodySim, where relations are constructed by applying controlled transformations to symbolic representations and tracking the augmentation lineage. In MelodySim, positive pairs consist of 10-second time-aligned segments across different augmentation versions of the same musical work, while negative pairs are cross-piece. Labels are validated through user studies, confirming that augmentations preserve melody and disrupt other musical attributes (Lu et al., 27 May 2025).
4. Data Representation, Formats, and Access
Dataset representations are optimized for both MIR development and replicability. All major datasets provide segment-aligned metadata in structured CSV or JSON files; audio is supplied in canonical formats (typically, stereo 44.1 kHz WAV), while symbolic data (MIDI or MusicXML) is available in simulation-oriented corpora.
File hierarchies are designed for efficient mapping between audio/MIDI segments and metadata:
- SMP:
/audio/original/,/audio/comparative/,annotations/smp_pairs.csv(Go et al., 29 Jan 2026, Go, 10 Sep 2025) - MelodySim:
Track_{ID}/version_{0–3}/segment_{00–NN}.wav(Lu et al., 27 May 2025)
Annotations include:
- Pair and segment indices
- Titles or file identifiers
- Aligned segment times (original and comparative)
- Relation label (e.g. Plagiarism, Remake, Variation)
- Further feature matrices (chromagrams, pianoroll, chord sequences) in some SMP versions
SMP (Go & Kim ICASSP 2026) additionally provides feature arrays in pianoroll_npz, chroma_npy, and spectrogram_npy formats for downstream analysis (Go et al., 29 Jan 2026).
5. Quantitative Similarity Metrics and Benchmarking
Similarity between segments is evaluated using a range of explicit music-domain and machine learning metrics. Notable metrics and formulations include:
- Musicological similarity:
- Pattern similarity via chromagram intersection ()
- Rhythmic correlation as Jaccard index on quantized onset grids ()
- Chord similarity:
where is the Roman-numeral harmony similarity and chord-quality similarity; are weighting hyperparameters (Go, 10 Sep 2025). - Comprehensive similarity aggregation:
with as defined above and fixed hyperparameters (Go, 10 Sep 2025).
Embedding-based similarity (deep learning):
- Cosine similarity between learned segment embeddings in a Siamese or triplet-loss architecture:
Graph-based matching (MPD-Set):
- Bipartite maximum matching between overlapping segment pairs, with musical edit distances transformed to similarity scores and summed to provide global pair plagiarism degree (Liu et al., 2021).
- Evaluation metrics:
- Mean average precision (mAP), mean rank of first correct match (MR1), segment-level recall-at-K (e.g., Rec.1s@K), Average Ranking Index (ARI), and classification accuracy.
- Consensus and alignment:
- For symbolic datasets, melodic alignment via dynamic programming (e.g., Needleman–Wunsch, JAZZVAR (Row et al., 2023)) and scoring by average deviation.
6. Construction, Limitations, and Best Practices
The construction of a Similar Music Pair dataset typically involves principles of authenticity (real world, expert or case-based selection), diversity of genre and transformation, and segment-level annotation.
Limitations observed across current datasets include:
- Restricted number of real-world expert-annotated pairs in SMP (70–72 pairs (Go, 10 Sep 2025, Go et al., 29 Jan 2026)), limiting scalability for deep learning.
- Categorical (binary) relational tags without continuous similarity gradation.
- Incomplete genre representation; non-Western pop, jazz, and experimental styles are less frequent.
- Coarse timestamp granularity; absence of note-level alignment in some cases.
- Lack of built-in train/validation/test splits; users must employ cross-validation or custom partitioning (Go, 10 Sep 2025, Go et al., 29 Jan 2026).
- Instability in chaining MIR tools during segment extraction; recommended practice is to fix external tool versions for reproducibility.
Augmented or simulated-datasets (e.g., MPD-Set, MelodySim) offer scale and controlled variation but require careful validation to ensure musical plausibility, as performed in MelodySim via multi-aspect human ratings (Lu et al., 27 May 2025).
Best practices for dataset usage and extension include integrating Similar Music Pair datasets with complementary corpora (e.g., Covers80 for cover songs), and supplementing algorithmic detection with human-in-the-loop verification for adjudicating ambiguous cases.
7. Applications and Research Impact
Similar Music Pair datasets support a broad spectrum of MIR and musical AI research, including:
- Segment-level music plagiarism detection and copyright risk assessment (Go, 10 Sep 2025, Go et al., 29 Jan 2026)
- Melodic and harmonic similarity learning
- Generative modeling of musical style transfer and variation (e.g., music overpainting, JAZZVAR (Row et al., 2023))
- Fine-grained music retrieval, cover/arrangement detection, and query-by-example
- Benchmark evaluation for deep metric learning, embedding models, and MIR pipelines
Such datasets are critical for robust, comparable assessment of machine-based similarity, enabling reproducible research and establishing objective criteria for musicological and legal evaluation in music retrieval and copyright adjudication (Liu et al., 2021, Lu et al., 27 May 2025).