Verse-Bench: Semantic Music Structure Analysis

Updated 6 January 2026

Verse-Bench is a standardized benchmark framework designed to evaluate semantic music structure by identifying song segments like verse, chorus, and more.
It consolidates diverse public corpora using a unified 7-class taxonomy and a substring-matching algorithm achieving approximately 99.3% annotation coverage.
SpecTNT, a hierarchical Transformer model, jointly processes spectral and temporal features, outperforming previous baselines in boundary and chorus detection.

Verse-Bench is a standardized benchmark framework designed for evaluating algorithmic comprehension and analysis of semantic musical structure, with particular focus on the identification of functionally meaningful song segments such as verse and chorus directly from audio. It formalizes a joint prediction task and provides unified data, taxonomy, and evaluation protocols for reproducible, cross-dataset assessment of semantic music structure analysis systems (Wang et al., 2022).

1. Data Sources and Taxonomy

Verse-Bench consolidates annotations from four principal public corpora: Harmonix Set (912 songs), SALAMI-pop (274 songs, 445 annotations), RWC-Pop (100 songs), and Isophonics (174 Beatles songs). These cover Western popular and Beatles music with various annotation conventions—beat/downbeat, segment boundaries, and functional segment labels. To enable joint, semantic modeling, Verse-Bench adopts a unified 7-class taxonomy: intro, verse, chorus, bridge, instrumental, outro, and silence. A substring-matching algorithm (Algorithm 1) harmonizes disparate raw segment labels into these canonical classes, with a coverage rate of ∼99.3% over all corpus annotations. Instrument descriptors default to "instrumental," and rare/compound segment types (e.g., pre-chorus, fade-out) may be mapped heuristically.

2. Preprocessing and Annotation Targets

Segmentation targets and boundaries are processed at 0.192 s frame intervals (about 5.2 Hz), ensuring fine temporal granularity. For each class $c$ , binary activation curves $y_c(t)$ are created using annotation spans, with onset/offset transitions smoothed by convolution with a 2 s Hann window (1 s rise/fall). A separate "boundaryness" curve marks ±0.3 s intervals around annotated segment boundaries, following the protocol of Ullrich et al. (2014). Audio front-end features are harmonic spectrograms built on STFT with 1024-sample windows and 512-hop at 16 kHz sampling [Won et al., 2020].

3. Model Architecture: SpecTNT

Central to Verse-Bench is the SpecTNT (Spectral-Temporal Transformer), a multi-point, hierarchical Transformer that jointly models boundary and functional segment class predictions. Input audio is chunked into 24 s segments, transformed into harmonic feature tensors, and encoded by a 2D ResNet to produce intermediate spectral representations. SpecTNT’s architecture consists of five blocks alternating between spectral encoder (per time-step frequency self-attention with Frequency Class Tokens) and temporal encoder (inter-frame self-attention to capture song-level repetition/novelty). Output heads produce both boundary activation and 7-class function scores per frame. Sinusoidal positional embeddings and layer normalization are standardized throughout all attention and feed-forward sub-layers.

4. Loss Functions and Training Protocols

Multi-task training uses a composite objective: boundary BCE ( $L_b$ ), functional class BCE ( $L_f$ ), and a Connectionist Temporal Localization (CTL) loss ( $L_{CTL}$ ) that enforces sequential and temporal consistency, accounting for soft label alignments [Wang & Metze, 2019]. The total loss is $L_{total} = \alpha L_{b} + \beta L_{f} + \gamma L_{CTL}$ , typically with tuned weights $\alpha=0.9, \beta=0.1, \gamma=1.0$ . Training is conducted on 24 s chunks (hop size 3 s, batch size 128), with data augmentation (Gaussian noise, gain, filtering via torchaudio_augmentations) and Adam optimizer (lr= $5\times10^{-4}$ , weight_decay= $1\times10^{-4}$ ). The regime includes early stopping and learning rate reduction on validation plateaus, implemented in PyTorch 1.8.

5. Evaluation Suite and Results

Evaluation encompasses boundary hit-rate F1 (HR.5F, ±0.5 s tolerance), frame-wise accuracy (ACC) for all seven segment classes, pairwise frame F1 (PWF) and normalized entropy ( $S_f$ ) for segmentation, and chorus-specific metrics (CHR.5F, CF1). Cross-dataset evaluation is conducted by training on three corpora and testing on the fourth (10% validation set). Comparative baselines include Scluster (spectral clustering), DSF+Scluster (deep structure features + clustering), CNN-Chorus, GS3, and top MIREX submissions. SpecTNT+CTL outperforms prior state-of-the-art: for example, on RWC-Pop, HR.5F=0.623, ACC=0.675, CF1=0.847. Improvements over existing methods are consistent—average ACC increases by ∼3–4 percentage points, HR.5F by ∼5–10, CF1 by ∼3–7. The system generalizes robustly across datasets, including stylistically divergent material (Beatles vs. contemporary pop).

Dataset	HR.5F	ACC	PWF	Sf	CHR.5F	CF1
SALAMI-pop	0.490	0.544	0.651	0.632	0.357	0.811
RWC-Pop	0.623	0.675	0.749	0.728	0.465	0.847
Isophonics	0.590	0.550	0.635	0.614	0.401	0.733

6. Limitations and Extensions

Current taxonomy mapping is heuristic and may mislabel compound or rare functional segments (pre-chorus, fade-outs). Fixed 24 s context windows may be inadequate for extreme song lengths. The "silence" class can conflate actual silence and non-musical interludes. SpecTNT does not model hierarchical segment relationships (e.g., verse → pre-chorus → chorus). A plausible implication is that further refinement may involve hierarchical annotation protocols and more granular functional taxonomies.

Future extensions for Verse-Bench include incorporation of additional corpora (e.g., SALAMI classical), expansion to finer-grained and hierarchical labels, and new evaluation protocols. Standardization proposals include fixed train/val/test splits, conversion code, a common evaluation suite, baseline implementations, leaderboard, and open-source scripts for data preprocessing, training, and inference. This suggests a pathway for reproducible, transparent benchmarking in semantic music structure analysis, enabling rigorous comparative studies.

7. Significance and Prospects

Verse-Bench establishes a common ground truth and reproducible baseline for function-aware music structure segmentation, supporting systematic evaluation across diverse datasets and annotation conventions. SpecTNT’s hierarchical attention provides efficient joint modeling of spectral and temporal dependencies, improving upon both chorus and boundary detection. The benchmark facilitates robust cross-dataset generalization and makes explicit the semantic interpretation of musical segments, moving beyond abstract boundary or clustering approaches. By formalizing data, taxonomy, and protocols, Verse-Bench advances research toward deeper semantic annotation of audio—particularly verse detection and functional segment identification—enabling comparable metrics and progress in music informatics (Wang et al., 2022).

Markdown Report Issue Upgrade to Chat

References (1)

To catch a chorus, verse, intro, or anything else: Analyzing a song with structural functions (2022)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Verse-Bench.