HookTheory Lead Sheet Dataset

Updated 29 January 2026

HookTheory Dataset is a curated collection of 18,843 annotated music sections for melody harmonization and structure analysis.
It includes symbolic lead-sheet data, chord symbols, key, mode, and precise timestamps with padded audio excerpts for improved boundary detection.
The dataset, integrated in MuSFA, enhances section labeling and boundary detection metrics across multi-corpus training regimes.

The HookTheory Lead Sheet Dataset (HLSD) is a collection of 18,843 annotated music sections originally aggregated by the HookTheory community to facilitate research in automatic melody harmonization. HLSD contains both symbolic lead-sheet data—encompassing melody, chord, key, and mode information—and temporal alignments to commercial audio recordings. Each entry reflects a musically coherent segment, such as a verse or chorus, and is supplied with onset and offset timestamps. The dataset’s structure allows it to be exploited for supervised learning tasks in music structure analysis (MSA), as exemplified by its recent application in the Music Structural Function Analysis (MuSFA) framework, where it informs section labeling and boundary detection directly from audio. HLSD is distributed as open annotation metadata via a public repository, with audio obtained separately due to copyright restrictions (Wang et al., 2022).

1. Dataset Structure and Musical Content

HLSD comprises 18,843 distinct music "sections," each representing a self-contained excerpt (commonly a verse, chorus, or intro) contributed by HookTheory users. The dataset’s primary use case was originally for studies in automatic melody harmonization (Yeh et al., JNMR 2021). Each record provides:

Symbolic lead-sheet data (melody, chord symbols, key, mode)
Section label (free-form, e.g., "verse-and-pre-chorus")
Start and end timestamps referencing a commercial audio recording

The HLSD metadata repository (https://github.com/wayne391/lead-sheet-dataset) includes XML/PDF lead sheets and a CSV index with annotations. Only the relevant excerpt within each song is annotated; full-song segmentations are not present. No MIDI or notation reprocessing is performed by downstream frameworks such as MuSFA—audio clips are extracted using the provided timestamps.

To conform with the fixed-length chunking protocols of deep models, each section is padded with a randomly sampled 8–12 seconds of audio on both sides (not exceeding song boundaries), yielding excerpts of 24–36 seconds, and always at least 30 seconds long. Padding regions are unlabeled, serving exclusively for boundary detection and excluded from section-function loss computation.

2. Label Mapping and Vocabulary Collapsing

HLSD’s original section labels are free-form and diverse, necessitating mapping to a target vocabulary for MSA tasks. In MuSFA, a unified seven-class taxonomy is utilized:

intro
verse
chorus
bridge
instrumental ("inst")
outro
other

Label mapping is performed as follows:

HLSD Label Examples	Mapped MuSFA Class
chorus, chorus-lead-out, theme, verse-and-chorus	chorus
verse, development, verse-and-pre-chorus, pre-chorus	verse
instrumental, lead-in-alt, lead-in, loop, solo	inst
bridge, variation	bridge
intro, intro-and-chorus, intro-and-verse	intro
outro, pre-outro	outro

Spans outside mapped sections (arising from padding) are considered "unknown" for function-loss purposes. Any HLSD annotation not matching the mapping is either discarded or assigned "other," with such cases omitted from MuSFA training.

3. Dataset Integration and Sampling in Multi-Corpus Training

MuSFA combines HLSD with multiple fully annotated song datasets: Harmonix (912 songs), SALAMI-pop (274), RWC-Pop (100), and Isophonics (277), totaling approximately 1,563 complete songs. Each HLSD excerpt is treated as an independent, partially labeled song for model training. Sampling is implemented through uniform mini-batch mixing of HLSD excerpts and full-song excerpts from the other corpora, with no explicit up- or down-sampling. The random padding of HLSD sections diversifies the temporal context available for boundary detection, reducing overfitting to predictable boundary positions.

4. Training Protocols, Loss Functions, and Evaluation Metrics

MuSFA employs a SpecTNT time–frequency Transformer backbone to process fixed-length audio segments (24 or 36 seconds), with two parallel prediction heads:

Boundary activation curve, $\hat{b}(t) \in [0,1]$ , trained using binary cross-entropy against reference boundaries $b(t)$ .
Section-function activation, $\hat{f}(t, c)$ over $C=7$ classes, trained via multi-class cross-entropy on one-hot section labels $y(t,c)$ .

The loss functions are as follows:

Boundary: $L_\text{boundary} = -\sum_t [b(t) \log \hat{b}(t) + (1-b(t)) \log(1-\hat{b}(t))]$
Section-function: $L_\text{function} = -\sum_{t \in \text{Labeled}} \sum_c y(t,c) \log \hat{f}(t,c)$ , where the labeled index set excludes padded regions
Total: $L = L_\text{boundary} + \lambda L_\text{function}$ , with $\lambda=1$

Evaluation utilizes:

HR.5F: F₁ score of boundaries detected within ±0.5s tolerance
ACC: frame-wise accuracy of function class predictions, $\text{ACC} = (1/T) \sum_t \delta[\hat{c}(t) = c^*(t)]$
CHR.5F: F₁ of chorus boundary detection (±0.5s)
CF1: F₁ over frame-pairs for chorus/non-chorus per Bellman & Paulus (2021)

5. Quantitative Impact and Cross-Dataset Gains

Incorporating HLSD into MuSFA training yields measurable improvements across all major structure analysis metrics. On four-fold cross-validation of Harmonix:

Metric	Baseline	+HLSD	Absolute Gain
HR.5F	0.570	0.595	+0.025 (4.4%)
ACC	0.701	0.714	+0.013 (1.3%)
CHR.5F	0.501	0.512	+0.011 (1.1%)
CF1	0.815	0.820	+0.005 (0.6%)

Cross-corpus averages (SALAMI-pop, RWC-Pop, Isophonics) report absolute boundary detection gains of approximately 3% and section labeling improvements near 1%. This suggests HLSD’s large volume of partially labeled examples augments generalization, especially for boundary detection, without requiring exhaustive full-song annotation (Wang et al., 2022).

6. Data Access, Reproducibility, and Licensing Constraints

HLSD’s annotation metadata is publicly accessible at https://github.com/wayne391/lead-sheet-dataset, containing 18,843 lines in the form: song_id, section_label, start_time, end_time, and audio reference or YouTube link. To reproduce the HLSD splits used in MuSFA, the same 8–12 s random padding and minimum excerpt duration (30 s) protocols must be applied to the audio sourced from originals. Only annotation metadata is distributed; researchers must procure audio via the cited recordings, in adherence to copyright law and HookTheory’s TheoryTab Terms of Service. For research, the metadata is open under a CC-BY style license, with audio rights remaining with the respective rights-holders.

7. Significance and Reusability in Music Structure Analysis

The HLSD exemplifies a pragmatic approach to curating large-scale, partially annotated musical datasets that are directly applicable to contemporary deep supervised frameworks. By enabling granular section-level labeling anchored to commercial audio and standardized mapping protocols, HLSD facilitates data augmentation, label unification, and improved cross-corpus generalization in music structure analysis. A plausible implication is that similarly constructed partial-annotation datasets, when appropriately mapped and padded, can offer substantial utility for semi-supervised or weakly supervised learning approaches in Music Information Retrieval (MIR). Future expansions in annotation coverage or the integration of full-song boundary information may additionally enhance the dataset’s value as a community standard (Wang et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

MuSFA: Improving Music Structural Function Analysis with Partially Labeled Data (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to HookTheory Dataset.