SongEval Dataset: Multi-Dimensional Aesthetics

Updated 25 January 2026

The paper introduces SongEval, leveraging hierarchical uncertainty modeling and multi-stem attention to capture nuanced human judgments in music.
SongEval is a dataset of AI-generated songs annotated with multi-granularity probabilistic ratings, enabling precise benchmarking of automated musical evaluation.
The dataset supports uncertainty-aware modeling and practical applications in generative music evaluation and preference modeling.

The SongEval dataset is designed as an AI-generated collection for multi-dimensional song aesthetics evaluation, anchored in the context of burgeoning music generative AI and the need for scalable, automated assessment of musical content. Unlike prior audio or speech-centric datasets, SongEval specifically targets song-level aesthetics, enabling precise benchmarking and development of models that capture nuanced human judgments in music. Its conception and utilization are detailed in "Song Aesthetics Evaluation with Multi-Stem Attention and Hierarchical Uncertainty Modeling" (Lv et al., 18 Jan 2026).

1. Motivation and Design Principles

The impetus for SongEval arises from limitations inherent in legacy approaches to music assessment, which traditionally emphasize speech or singing quality and employ single-score regression techniques that fail to encapsulate the multi-dimensional and uncertain nature of human song perception. Human raters, as shown through careful empirical observation, rarely assign a precise score instantly; rather, they operate via coarse-to-fine reasoning, first identifying a plausible interval (e.g., "60–70 out of 100") and then refining their judgment within that span. The SongEval dataset is constructed to catalyze the development of frameworks that accommodate such gradual, multi-granularity reasoning and uncertainty modeling.

2. Structural Composition and Annotation Schema

SongEval comprises full-length, AI-generated songs, annotated for five core dimensions of aesthetics. Each song's annotation involves segmenting the audio and applying human raters’ coarse-to-fine labeling protocol, further operationalized by probability distributions over discretized rating intervals at multiple granularities.

Segmentation: Songs are segmented (e.g., into 10 s intervals) to facilitate temporal localization of aesthetic judgments.
Multi-dimensional Ratings: Each dimension receives multi-granularity probabilistic annotations, reflecting possible human annotator uncertainty.
Interval Encoding: Annotations are encoded as bin intervals spanning the continuous aesthetic score range (linearly scaled to $[0,1]$ ), supporting hierarchical interval aggregation.

This protocol enables the dataset to serve as ground truth for not only scalar regression targets but also multi-level probabilistic distributions over intervals, facilitating advanced uncertainty-aware modeling.

3. Integration with Multi-Stem Attention and HiGIA Framework

SongEval is explicitly paired with the Multi-Stem Attention Fusion (MSAF) and Hierarchical Granularity-Aware Interval Aggregation (HiGIA) modules (Lv et al., 18 Jan 2026), serving both as a development benchmark and as empirical validation for these architectural innovations.

MSAF Backbone: Input song waveforms undergo source separation; mixture-vocal and mixture-accompaniment pairs are processed via MuQ encoders and CBAM fusion, followed by stacked transformers, culminating in the shared representation $h$ .
HiGIA Uncertainty Modeling: From $h$ , parallel classifier heads predict score distributions at three levels of granularity; bins with sufficient probability mass form interval candidates, which are aggregated into a consensus interval $[L,U]$ . Regression within $[L,U]$ produces final scores, closely mimicking human annotation logic.

4. Probabilistic Formulation and Annotation Handling

For each annotation granularity $l$ ( $L=3$ in practice), classifier heads $C^{(l)}$ output logits $\mathbf{z}^{(l)}$ over $K_l$ bins. These are transformed to probabilities $p^{(l)}_k$ via softmax and mapped to intervals $r(c^{(l)}_k)$ within $[0,1]$ .

Candidate Bin Selection: Bins where $p^{(l)}_k > 1/K_l$ are aggregated per granularity.
Interval Aggregation: A global set $S$ of candidate bins is partitioned into overlap $O$ and isolated $I$ subsets, informing interval determination via averaging and weighted extrema depending on classifier agreement.
Final Score: A two-layer MLP predicts an interpolation coefficient $\alpha$ , yielding $\hat y = (1-\alpha)L + \alpha U$ .

Training optimizes the sum of cross-entropy losses for each classifier and a mean-squared error for the regressor, balanced by hyperparameter $\lambda$ .

5. Empirical Performance and Ablation Studies

Ablation experiments on SongEval and human-created internal datasets demonstrate the impact of hierarchical interval-based modeling. Removal of the HiGIA module leads to marked degradation in evaluation metrics for multi-dimensional song aesthetics prediction:

Dataset	MSE (w/o HiGIA)	MSE (Full)	LCC (w/o HiGIA)	LCC (Full)	SRCC (w/o HiGIA)	SRCC (Full)	KTAU (w/o HiGIA)	KTAU (Full)
SongEval	0.293	0.266	0.889	0.892	0.886	0.890	0.715	0.721
Internal	30.3	23.7	0.854	0.877	0.857	0.878	0.674	0.705

These results substantiate the efficacy of SongEval as a high-precision resource for uncertainty-aware song aesthetics modeling.

6. Comparative Context and Theoretical Implications

While HiGIA was originally developed for hierarchical pooling in video action recognition (Mazari et al., 2020), its adaptation to SongEval underscores the generality of coarse-to-fine, multi-granularity aggregation strategies for uncertainty modeling in human annotation tasks. SongEval’s annotation and evaluation protocol aligns with contemporary probabilistic frameworks, extending their utility to the music domain.

A plausible implication is that SongEval can serve as a blueprint for analogous datasets in related generative content domains where scalar evaluation is insufficient and human perception is inherently ambiguous.

7. Applications and Prospective Influence

SongEval is positioned to advance research in automated music evaluation, generative AI model benchmarking, preference modeling, and explainable machine learning for creative domains. Its interval-distribution annotation schema enables:

Benchmarking architectural innovations in song aesthetics modeling.
Training models to express and calibrate uncertainty over interval predictions.
Facilitating cross-domain adaptations of hierarchical interval aggregation schemes.

This suggests that SongEval may become foundational for future work on principled, uncertainty-aware generative art evaluation in music and beyond.

Markdown Report Issue Upgrade to Chat

References (2)

Song Aesthetics Evaluation with Multi-Stem Attention and Hierarchical Uncertainty Modeling (2026)

Deep hierarchical pooling design for cross-granularity action recognition (2020)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to SongEval Dataset.