ICASSP 2026: SongEval Benchmark

Updated 1 December 2025

The paper introduces a novel, multi-dimensional evaluation benchmark that standardizes human judgments for assessing song aesthetics across diverse genres.
It employs multi-scale feature extraction and hierarchical data augmentation, achieving significant improvements in correlation and top-tier F1 (TTC) metrics over baseline models.
The benchmark provides a comprehensive 140-hour dataset with expert annotations on five aesthetic dimensions, enabling both scoring and ranking tasks for rigorous model validation.

The ICASSP 2026 SongEval Benchmark is a large-scale, multi-dimensional evaluation suite for song aesthetics assessment, developed as an extension of the SongEval dataset and adopted as a standard for evaluating machine learning models on the perceptual quality of generated music. It enables rigorous comparison of automatic models in replicating human judgments across key facets of musical appeal, providing established protocols, rich annotation, and a standardized evaluation pipeline (Yao et al., 16 May 2025, Liu et al., 24 Nov 2025).

1. Dataset Structure and Annotations

The core of the ICASSP 2026 SongEval Benchmark is a curated dataset of 2,399 full-length song tracks spanning approximately 140 hours of audio content. Songs are distributed across nine mainstream genres with recordings in both English and Mandarin Chinese. Each track is treated as a discrete datum and annotated by multiple expert listeners under controlled listening conditions (Yao et al., 16 May 2025, Liu et al., 24 Nov 2025).

Five orthogonal aesthetic dimensions are rated for each song:

Overall coherence
Memorability
Naturalness of vocal breathing and phrasing
Clarity of song structure
Overall musicality

Ratings are assigned on a five-point Likert scale (1 = very poor, 5 = excellent). Each song is rated by four annotators, and the dimension-wise ground-truth score is the mean of the annotators’ ratings. Segment-level representations are extracted dynamically via pre-trained front-ends such as MuQ, with segment counts per song scaled to song length (e.g., 100 segments for a 30s excerpt). The validation and test protocols maintain no explicit genre or demographic stratification, making the splits both diverse and representative.

Table 1: SongEval Genre Distribution

Genre	Languages	Samples	Male / Female (%)
Pop	EN, ZH	459	54 / 46
Rock	EN, ZH	324	63 / 37
Electronic	EN, ZH	249	53 / 47
Blues	EN, ZH	195	71 / 29
World Music	EN, ZH	228	57 / 43
Hip-hop/Rap	EN, ZH	145	70 / 30
Country	EN, ZH	155	58 / 42
Jazz	EN, ZH	133	54 / 46
Classical	EN, ZH	105	41 / 59

2. Task Definitions and Evaluation Metrics

SongEval supports two primary tasks:

Scoring: Predict the absolute aesthetic score $\hat y_{i,d}$ for each track $i$ and each dimension $d$ .
Ranking: Identify the highest-quality (“top-tier”) songs, usually the top $k\%$ per dimension.

Official evaluation metrics—applied for both development and leaderboard ranking—include:

Linear Correlation Coefficient (LCC, Pearson’s $\rho$ ):

$\rho = \frac{\sum_{i=1}^n (y_i - \bar{y})(\hat{y}_i - \bar{\hat{y}})}{ \sqrt{ \sum_i (y_i - \bar y)^2 \sum_i (\hat y_i - \bar{\hat y})^2 } }$

Spearman’s Rank Correlation Coefficient (SRCC):

$\rho_s = 1 - \frac{6\sum_{i=1}^n d_i^2}{n(n^2-1)}, \quad d_i = \mathrm{rank}(y_i) - \mathrm{rank}(\hat{y}_i)$

Kendall’s Rank Correlation Coefficient (KRCC):

$\tau = \frac{2(P-Q)}{n(n-1)}$

where $P$ and $Q$ are the counts of concordant and discordant pairs.

Top-Tier F1 (TTC): Binary classification of songs as “top” or “non-top” by thresholds, with F1 computed on top-tier prediction:

$\mathrm{TTC} = \frac{2\,\mathrm{Precision_{top}}\,\mathrm{Recall_{top}}}{\mathrm{Precision_{top}} + \mathrm{Recall_{top}}}$

Secondary metrics include MSE and system-level correlation coefficients, allowing comparison to both human judgment and baseline perceptual proxies such as AudioBox scores or vocal range (Yao et al., 16 May 2025).

3. Model Architectures and Training Protocols

State-of-the-art models benchmarked on SongEval employ multi-source, multi-scale feature extraction to account for the hierarchical, polyphonic structure of music. Two principal extractor backbones are used:

MuQ: Segment-level, self-supervised audio encoder
MusicFM: Track-global feature extractor

These are integrated via Multi-Query Multi-Head Attention Statistical Pooling (MQMHASTP), aggregating temporally aligned embeddings into a fixed-length, dimension-aware vector with 8 attention heads and 512 output dimensions (Liu et al., 24 Nov 2025).

Hierarchical Data Augmentation

The SongEval leaderboard’s top submissions leverage a dual-level data augmentation pipeline:

Audio-level augmentations (applied per listen): resampling to 24 kHz, gain/dynamics randomization, high-SNR Gaussian noise, pitch shifts, time-stretching and shifting, high/low-pass filtering, EQ, polarity inversion.
Feature-level C-Mixup: Semantically consistent mixup in embedding space, where partner samples are chosen via a Gaussian kernel over feature distances, and mixup coefficients are drawn from a Beta distribution.

Hybrid Loss Objectives

Model optimization combines regression (Smooth L1 loss per dimension) and ranking (ListMLE) objectives:

$L_{\mathrm{total}} = L_{\mathrm{SmoothL1}} + \alpha\,L_{\mathrm{ListMLE}}$

Smooth L1 loss penalizes per-sample scoring errors.
ListMLE enforces correct ordinal placement among songs for robust ranking in each dimension.

Loss balance $\alpha$ is tuned (e.g., 0.15 for Track 1, 0.05 for Track 2). Training proceeds with the Adam optimizer ( $\mathrm{lr}=10^{-5}$ , weight decay $=10^{-3}$ ), batch size 8, for 60 epochs, with early stopping on validation SRCC and final model selection by best TTC score (Liu et al., 24 Nov 2025).

4. Experimental Results and Benchmark Advancements

Experimental comparisons demonstrate that advanced multi-scale extraction and aug-mentation strategies, coupled with hybrid training, yield significant improvements over classic regression-based benchmarks.

Model Variant	LCC	SRCC	KRCC	TTC
Baseline Regression (4 heads)	90.83	89.67	73.56	82.57
+ MMFE Only	90.99	89.67	72.91	83.65
+ MMFE + HAA	90.99	89.66	73.30	84.64
Full (MMFE + HAA + Hybrid Loss)	91.25	90.14	73.98	85.65

Key insights:

Multi-source, multi-scale features (MMFE) substantially improve representation capacity and translation to higher correlation and TTC.
Hierarchical audio augmentation (HAA) confers additional robustness, notably increasing TTC.
The hybrid regression-ranking objective further boosts top-song detection and ranking stability.

Ablation studies confirm that each component delivers a measurable, cumulative benefit, establishing new state-of-the-art on the SongEval benchmark.

5. Implementation, Reproducibility, and Standardization

Audio is uniformly resampled to 24 kHz with all augmentations applied on-the-fly using audiomentations. Embedding-level C-Mixup ensures semantic consistency in data mixing, with partner selection based on feature similarity. All models utilize off-the-shelf MuQ and MusicFM checkpoints with MQMHASTP pooling. Evaluation adheres to scripts released by the ICASSP SongEval organizers, guaranteeing compatibility and comparability across submissions.

Randomized test sets, unreleased to participants, and strict adherence to splitting protocols enforce fair leaderboard conditions.

6. Limitations and Prospects

Dimension independence is not fully realized—inter-dependence is observed between, for example, "overall coherence" and "structure clarity" or "musicality" and "memorability." While written guidelines and representative demo tracks are employed to reduce crosstalk, perfect orthogonality among perceptual categories remains theoretically unattainable. Future directions identified include:

Development of architectures targeting subtler, subjective cues.
Expansion of SongEval to additional languages and under-represented music cultures.
Active learning and preference-based annotation to further refine dimensionality (Yao et al., 16 May 2025, Liu et al., 24 Nov 2025).

A plausible implication is that continued evolution of the benchmark will both enable and require richer, more robust modeling paradigms.

7. Context within Automatic Music Evaluation

The ICASSP 2026 SongEval Benchmark represents the canonical reference for human-aligned song aesthetic evaluation, addressing the inadequacy of objective metrics (e.g., embedding or production-based proxies) for characterizing musical appeal. Historical comparison reveals a consistent performance gap between these and dedicated, multi-dimensional subjective models, with SongEval-trained systems surpassing AudioBox and vocal range baselines by a significant correlation margin. Adoption of SongEval by the research community is expected to standardize evaluation of generative and enhancement models for music, driving progress in perceptual modeling and synthesis quality.