HEAREval Benchmark: Unified Audio Evaluation
- HEAREval Benchmark is a unified framework that systematically evaluates general-purpose audio representations across six canonical tasks using standardized datasets and z-score normalization.
- It ranks state-of-the-art models based on metrics like accuracy and mAP, highlighting the benefits of self-supervised pretraining and diverse multi-domain audio corpora.
- The framework aligns model activations with human brain activity through regression and representational similarity analysis, providing insights into the emergence of brain-like representations.
The HEAREval benchmark is a unified evaluation framework for general-purpose audio representations, designed to systematically assess the capacity of neural audio models to support downstream tasks spanning music, speech, and environmental sound analysis. Developed to facilitate both reproducible benchmarking and neuroscientific comparisons, HEAREval provides a concise yet rigorous suite of six canonical audio classification and detection tasks, each accompanied by standardized datasets and robust evaluation metrics. As a focused descendant of the broader HEAR (Holistic Evaluation of Audio Representations) suite, HEAREval serves as a testbed for probing the “holisticity” of audio embeddings and for relating model performance to human brain activity (Pepino et al., 20 Nov 2025, Turian et al., 2022).
1. Benchmark Structure: Tasks, Datasets, and Metrics
HEAREval operationalizes six tasks, each targeting a distinct functional domain in audio processing:
| Task | Dataset | Task Type | Metric |
|---|---|---|---|
| Music Note Classification (NS) | NSynth | Single-label, 88 classes | Accuracy |
| Music Genre Classification (GC) | GTZAN | Single-label, 10 genres | Accuracy |
| Speech Commands Recognition (SC) | Speech Commands | Single-label, 12 classes | Accuracy |
| Speech Emotion Recognition (ER) | CREMA-D | Single-label, 6 emotions | Accuracy |
| Acoustic Event Detection (FSD) | FSD-50K | Multi-label, 200 classes | mAP |
| Environmental Sound Classification (ESC) | ESC-50 | Single-label, 50 classes | Accuracy |
For each task, a lightweight multilayer perceptron (MLP) downstream model is trained with minimal hyperparameter tuning. Task performance is scored using either top-1 accuracy or mean average precision (mAP) as appropriate. To ensure cross-task comparability, each task score is converted to a z-score, and the overall HEAREval score for a model is defined as the mean of its six z-scores:
2. Model Rankings and Performance Profile
Evaluation of 36 state-of-the-art audio models on HEAREval reveals a clear stratification based on architecture and training regime. The highest-performing models are large, self-supervised systems pretrained on diverse multi-domain audio corpora with generative objectives. A representative ranking is as follows:
| Model | Global z-score |
|---|---|
| EnCodecMAE (Mixture, Large) | +1.35 |
| BEATs (iteration 3) | +1.12 |
| Dasheng 1.2B | +0.94 |
| AST (Audio Spectrogram Transformer) | +0.55 |
| CochResNet50 (multi-task) | +0.31 |
| VGGish | −0.02 |
| DeepSpeech 2 (ASR) | −0.78 |
| MetricGAN+ (speech enhancement) | −1.12 |
| Spectro-temporal baseline | −1.45 |
Self-supervised mixture-trained models exceed high-capacity supervised classifiers and domain-specialized architectures, indicating the critical role of both learning objective and data diversity in achieving general-purpose audio representations. (Pepino et al., 20 Nov 2025)
3. Analytical Methodology for Model–Brain Alignment
HEAREval enables direct comparison of model-internal representations against human auditory cortex activity using multiple regression-based and representational similarity approaches:
- Voxel-wise Ridge Regression: For each model, subject, and voxel, ridge regression is performed to predict fMRI responses from model activations, with goodness-of-fit measured by . The best layer and regularization are selected via cross-validation; results are averaged over voxels and subjects for a single per model.
- Component-wise Regression: Using auditory cortex components derived from independent fMRI decompositions (e.g., speech, music, pitch), the same procedure yields per functional substream.
- Representation Similarity Analysis (RSA): Dissimilarity matrices of model and fMRI responses are constructed and their (vectorized) similarity is quantified by maximal layer-wise Spearman correlation .
A central result is the strong correlation between HEAREval performance and brain predictivity: Pearson correlations reach for and for RSA in the principal fMRI dataset (all ), and all remain even after excluding outliers. Thus, models with superior HEAREval scores also more accurately predict and mirror auditory cortex representations. (Pepino et al., 20 Nov 2025)
4. Emergence of Brain-like Representations During Pretraining
HEAREval supports longitudinal analyses of representational dynamics. For EnCodecMAE, model checkpoints sampled during self-supervised pretraining reveal that:
- Brain similarity, as measured by layer-wise RSA against fMRI, increases sharply within the first 50–100k pretraining steps, preceding any task-specific fine-tuning.
- Higher network layers develop stronger brain-like similarity than lower layers.
- Hierarchical differentiation is observable early: higher layers decouple from primary auditory cortex but align with higher-order auditory regions, recapitulating biological auditory organization.
This emergence follows a simple logarithmic growth law: with , indicating that predictive self-supervised objectives foster progressively more brain-like features in the absence of explicit neuroscientific supervision. (Pepino et al., 20 Nov 2025)
5. Interpretive Synthesis and Scientific Implications
HEAREval benchmarks reveal several key principles:
- Common Task Pressure and Representational Convergence: Models optimized for a suite of human-relevant audio tasks (as in HEAREval) naturally converge on internal codes that are predictive of, and structurally aligned with, human auditory cortex activity. This supports the Platonic Representation Hypothesis—that shared task constraints drive both artificial and biological representations towards similar solutions.
- Importance of Data Diversity and Self-Supervision: Models pretrained on broad, multi-domain audio distributions and optimized via masked audio modeling or predictive objectives consistently outperform task- or domain-specific models in both downstream transfer and brain-alignment metrics.
- Practical Proxy for Model Selection: High correlation between HEAREval performance and neuroscience benchmarks (e.g., RSA on small fMRI sets) suggests that lightweight neural predictivity scores can serve as effective meta-criteria for selecting or tuning general-purpose audio models.
A plausible implication is that augmenting HEAREval with new tasks—such as speech-in-noise, multi-source transcription, and cross-modal grounding—will further refine the mapping between artificial and biological auditory systems, and that future models may directly incorporate brain-based regularization. (Pepino et al., 20 Nov 2025)
6. Relation to the Broader HEAR Benchmark
The HEAREval suite is a focused subset of the broader HEAR (Holistic Evaluation of Audio Representations) challenge, which assesses models on a wider array of 19 tasks (across 16 datasets) and supports both scene-level and timestamped embeddings (Turian et al., 2022). HEAR models are evaluated via a unified API (with standardized model wrappers), and tasks encompass classification, multilabel tagging, event transcription, and language ID. Both HEAR and HEAREval prioritize open benchmarks, reproducibility, and extensibility, providing a foundation for longitudinal research on general-purpose audio representations.
HEAREval, by distilling HEAR to six canonical tasks, achieves a balance between holism and computational feasibility, facilitating both large-scale model comparison and neuroscientific investigation. The persistent challenge remains: can a single audio architecture ultimately match the holistic performance of the human auditory system across all real-world sound domains? (Turian et al., 2022)