ICASSP 2026: SongEval Benchmark
- The paper introduces a novel, multi-dimensional evaluation benchmark that standardizes human judgments for assessing song aesthetics across diverse genres.
- It employs multi-scale feature extraction and hierarchical data augmentation, achieving significant improvements in correlation and top-tier F1 (TTC) metrics over baseline models.
- The benchmark provides a comprehensive 140-hour dataset with expert annotations on five aesthetic dimensions, enabling both scoring and ranking tasks for rigorous model validation.
The ICASSP 2026 SongEval Benchmark is a large-scale, multi-dimensional evaluation suite for song aesthetics assessment, developed as an extension of the SongEval dataset and adopted as a standard for evaluating machine learning models on the perceptual quality of generated music. It enables rigorous comparison of automatic models in replicating human judgments across key facets of musical appeal, providing established protocols, rich annotation, and a standardized evaluation pipeline (Yao et al., 16 May 2025, Liu et al., 24 Nov 2025).
1. Dataset Structure and Annotations
The core of the ICASSP 2026 SongEval Benchmark is a curated dataset of 2,399 full-length song tracks spanning approximately 140 hours of audio content. Songs are distributed across nine mainstream genres with recordings in both English and Mandarin Chinese. Each track is treated as a discrete datum and annotated by multiple expert listeners under controlled listening conditions (Yao et al., 16 May 2025, Liu et al., 24 Nov 2025).
Five orthogonal aesthetic dimensions are rated for each song:
- Overall coherence
- Memorability
- Naturalness of vocal breathing and phrasing
- Clarity of song structure
- Overall musicality
Ratings are assigned on a five-point Likert scale (1 = very poor, 5 = excellent). Each song is rated by four annotators, and the dimension-wise ground-truth score is the mean of the annotators’ ratings. Segment-level representations are extracted dynamically via pre-trained front-ends such as MuQ, with segment counts per song scaled to song length (e.g., 100 segments for a 30s excerpt). The validation and test protocols maintain no explicit genre or demographic stratification, making the splits both diverse and representative.
Table 1: SongEval Genre Distribution
| Genre | Languages | Samples | Male / Female (%) |
|---|---|---|---|
| Pop | EN, ZH | 459 | 54 / 46 |
| Rock | EN, ZH | 324 | 63 / 37 |
| Electronic | EN, ZH | 249 | 53 / 47 |
| Blues | EN, ZH | 195 | 71 / 29 |
| World Music | EN, ZH | 228 | 57 / 43 |
| Hip-hop/Rap | EN, ZH | 145 | 70 / 30 |
| Country | EN, ZH | 155 | 58 / 42 |
| Jazz | EN, ZH | 133 | 54 / 46 |
| Classical | EN, ZH | 105 | 41 / 59 |
2. Task Definitions and Evaluation Metrics
SongEval supports two primary tasks:
- Scoring: Predict the absolute aesthetic score for each track and each dimension .
- Ranking: Identify the highest-quality (“top-tier”) songs, usually the top per dimension.
Official evaluation metrics—applied for both development and leaderboard ranking—include:
- Linear Correlation Coefficient (LCC, Pearson’s ):
- Spearman’s Rank Correlation Coefficient (SRCC):
- Kendall’s Rank Correlation Coefficient (KRCC):
where and are the counts of concordant and discordant pairs.
- Top-Tier F1 (TTC): Binary classification of songs as “top” or “non-top” by thresholds, with F1 computed on top-tier prediction:
Secondary metrics include MSE and system-level correlation coefficients, allowing comparison to both human judgment and baseline perceptual proxies such as AudioBox scores or vocal range (Yao et al., 16 May 2025).
3. Model Architectures and Training Protocols
State-of-the-art models benchmarked on SongEval employ multi-source, multi-scale feature extraction to account for the hierarchical, polyphonic structure of music. Two principal extractor backbones are used:
- MuQ: Segment-level, self-supervised audio encoder
- MusicFM: Track-global feature extractor
These are integrated via Multi-Query Multi-Head Attention Statistical Pooling (MQMHASTP), aggregating temporally aligned embeddings into a fixed-length, dimension-aware vector with 8 attention heads and 512 output dimensions (Liu et al., 24 Nov 2025).
Hierarchical Data Augmentation
The SongEval leaderboard’s top submissions leverage a dual-level data augmentation pipeline:
- Audio-level augmentations (applied per listen): resampling to 24 kHz, gain/dynamics randomization, high-SNR Gaussian noise, pitch shifts, time-stretching and shifting, high/low-pass filtering, EQ, polarity inversion.
- Feature-level C-Mixup: Semantically consistent mixup in embedding space, where partner samples are chosen via a Gaussian kernel over feature distances, and mixup coefficients are drawn from a Beta distribution.
Hybrid Loss Objectives
Model optimization combines regression (Smooth L1 loss per dimension) and ranking (ListMLE) objectives:
- Smooth L1 loss penalizes per-sample scoring errors.
- ListMLE enforces correct ordinal placement among songs for robust ranking in each dimension.
Loss balance is tuned (e.g., 0.15 for Track 1, 0.05 for Track 2). Training proceeds with the Adam optimizer (, weight decay ), batch size 8, for 60 epochs, with early stopping on validation SRCC and final model selection by best TTC score (Liu et al., 24 Nov 2025).
4. Experimental Results and Benchmark Advancements
Experimental comparisons demonstrate that advanced multi-scale extraction and aug-mentation strategies, coupled with hybrid training, yield significant improvements over classic regression-based benchmarks.
| Model Variant | LCC | SRCC | KRCC | TTC |
|---|---|---|---|---|
| Baseline Regression (4 heads) | 90.83 | 89.67 | 73.56 | 82.57 |
| + MMFE Only | 90.99 | 89.67 | 72.91 | 83.65 |
| + MMFE + HAA | 90.99 | 89.66 | 73.30 | 84.64 |
| Full (MMFE + HAA + Hybrid Loss) | 91.25 | 90.14 | 73.98 | 85.65 |
Key insights:
- Multi-source, multi-scale features (MMFE) substantially improve representation capacity and translation to higher correlation and TTC.
- Hierarchical audio augmentation (HAA) confers additional robustness, notably increasing TTC.
- The hybrid regression-ranking objective further boosts top-song detection and ranking stability.
Ablation studies confirm that each component delivers a measurable, cumulative benefit, establishing new state-of-the-art on the SongEval benchmark.
5. Implementation, Reproducibility, and Standardization
Audio is uniformly resampled to 24 kHz with all augmentations applied on-the-fly using audiomentations. Embedding-level C-Mixup ensures semantic consistency in data mixing, with partner selection based on feature similarity. All models utilize off-the-shelf MuQ and MusicFM checkpoints with MQMHASTP pooling. Evaluation adheres to scripts released by the ICASSP SongEval organizers, guaranteeing compatibility and comparability across submissions.
Randomized test sets, unreleased to participants, and strict adherence to splitting protocols enforce fair leaderboard conditions.
6. Limitations and Prospects
Dimension independence is not fully realized—inter-dependence is observed between, for example, "overall coherence" and "structure clarity" or "musicality" and "memorability." While written guidelines and representative demo tracks are employed to reduce crosstalk, perfect orthogonality among perceptual categories remains theoretically unattainable. Future directions identified include:
- Development of architectures targeting subtler, subjective cues.
- Expansion of SongEval to additional languages and under-represented music cultures.
- Active learning and preference-based annotation to further refine dimensionality (Yao et al., 16 May 2025, Liu et al., 24 Nov 2025).
A plausible implication is that continued evolution of the benchmark will both enable and require richer, more robust modeling paradigms.
7. Context within Automatic Music Evaluation
The ICASSP 2026 SongEval Benchmark represents the canonical reference for human-aligned song aesthetic evaluation, addressing the inadequacy of objective metrics (e.g., embedding or production-based proxies) for characterizing musical appeal. Historical comparison reveals a consistent performance gap between these and dedicated, multi-dimensional subjective models, with SongEval-trained systems surpassing AudioBox and vocal range baselines by a significant correlation margin. Adoption of SongEval by the research community is expected to standardize evaluation of generative and enhancement models for music, driving progress in perceptual modeling and synthesis quality.