Holistic Audio Benchmark (HEAR)

Updated 9 April 2026

HEAR is a holistic benchmarking framework that evaluates audio representations across speech, music, and environmental tasks.
It employs frozen embeddings with shallow classifiers to compare model performance using metrics like mAP, accuracy, and F1.
The framework emphasizes reproducibility and has driven advances in universal audio embedding and neuro-inspired evaluation.

Holistic Evaluation of Audio Representations (HEAR) is a benchmarking framework and challenge for systematically comparing the generalizability of audio embeddings across diverse downstream tasks in speech, music, and environmental sound domains. Developed to analogize the unifying role of benchmarks like GLUE in NLP and ImageNet in vision, HEAR provides an open, reproducible protocol for assessing whether a single “general-purpose” representation can transfer robustly to a broad array of audio classification, detection, and tagging scenarios without fine-tuning (Turian et al., 2022). Its design, evaluation philosophy, and impact have influenced the trajectory of universal audio representation research and motivated the development of subsequent, more comprehensive frameworks.

1. Motivation and Benchmark Philosophy

HEAR was motivated by the need to unify evaluation in a field fragmented by domain-specific features and embedding techniques. While deep encoders such as BERT enabled transfer across NLP tasks, and vision transformers demonstrated wide applicability, audio research relied on MFCCs, mel-spectrograms, or architectures specialized to either speech, music, or sound-event application. HEAR posed the open question: can a single feature extractor support transfer learning across the diversity of real-world audio—matching, in breadth, the capabilities of the human auditory system? (Turian et al., 2022)

The initiative was launched as a NeurIPS 2021 shared competition. It emphasized complete transparency: pip-installable code, open datasets, standardized evaluation, and a simple API for embedding extraction. This combination was designed to facilitate reproducibility, model comparison, and longitudinal studies.

2. Task Suite and Evaluation Protocol

HEAR’s benchmark suite comprises nineteen tasks from sixteen datasets, ensuring broad coverage:

Speech and paralinguistics: keyword spotting (Speech Commands v2), speech emotion (CREMA-D), language ID (VoxLingua107), speaker counting (LibriCount), vocal imitation.
Environmental sound: ESC-50, FSD50K, gunshot triangulation, beehive health, DCASE 2016 office-event detection.
Music: GTZAN genre and music/speech discrimination, NSynth instrument/pitch classification, MAESTRO piano transcription, Beijing Opera percussion, Mridangam recognition.

Models are evaluated in a transfer-learning regime: embeddings are frozen and only a shallow classifier (typically one- or two-layer MLP) is trained per task (Turian et al., 2022). Two APIs are required from submissions: compute_scene_embedding (produces a fixed-size vector per clip) and compute_timestamp_embedding (produces a time-ordered sequence for timestamp tasks). Scene-level and timestamp-level metrics are chosen to match task structure.

Evaluation metrics include:

Accuracy (multiclass)
Mean average precision (mAP, for multi-label tagging)
Event-based F1 (detection with 200 ms onset tolerance)
k-Nearest Neighbors (k-NN, as a probe for embedding structure)
Embedding similarity (cosine, Euclidean)

All evaluation is performed on held-out test splits with fixed hyperparameter grids under a single seed, preventing redeployment or tuning on test data.

3. Model Submissions and Representation Approaches

HEAR’s 2021 challenge included 29 models from 13 external teams. Notable model classes included:

CNNs (PANNs, YAMNet, EfficientNet-B2), typically trained on AudioSet for broad environmental semantics.
Transformers (PaSST, Vision Transformer-inspired spectrogram models) leveraging large-scale supervised pretraining and architectural adaptations (e.g., patch embeddings, Patchout regularization).
Self-supervised speech models (wav2vec 2.0, HuBERT), optimized for phonetic and lexical contrasts.
Music-centric models (CREPE), producing pitch-sensitive representations.
Self-supervised and multimodal variants (BYOL-A, OpenL3 via audio-visual correspondence, Wav2CLIP, Barlow Twins, SimSiam).

Most approaches produced a sequence of timestamp embeddings, then pooled (e.g., averaged) to obtain scene-level representations. Downstream heads were deliberately minimal, in order to measure the intrinsic transferability of the encoder (Turian et al., 2022).

4. Key Empirical Findings and Analyses

No single representation universally dominated all tasks. Instead, HEAR revealed systematic trade-offs and clustering:

Pitch-sensitive nets (CREPE): Top performance on NSynth Pitch and MAESTRO note onset tasks, but inadequate for semantic tagging and speech.
AudioSet-trained models (PaSST, YAMNet, PANNs): High accuracy in semantic tagging, environmental detection, and certain music tasks. For example, PaSST achieved mAP 0.641 on FSD50K, exceeding PANNs and YAMNet (Koutini et al., 2022, Shah et al., 2023).
Self-supervised speech models: State-of-the-art on keyword spotting and other speech tasks (e.g., ~94% for BYOL-A and HuBERT+wav2vec2+CREPE for Speech Commands full), but weak on pitch-based music and less effective for environmental sound (Wu et al., 2022).
Task/extractor clustering: t-SNE and correlation analyses showed clear groupings: semantic (AudioSet-trained CNNs/transformers), pitch (CREPE), speech (wav2vec2, HuBERT), which corresponded to task clusters.

Dispersion analysis (Shah et al., 2023) showed that representations with more uniformly distributed activation across embedding dimensions (higher Gini index for covariance eigenspectra, lower top-2 PC dominance) correlated strongly with better downstream performance. The introduction of Batch Embedding Covariance Regularization (BECR), encouraging isotropic embedding dispersion, led to measurable gains for PaSST across Beijing Opera, NSynth Pitch, CREMA-D, and FSD50K, with no extra inference cost.

5. Advances, Extensions, and Methodological Insights

HEAR directly stimulated several lines of subsequent research and framework development:

Front-end exploration: Direct comparison of Constant-Q Transform (CQT) and STFT+Mel as spectrogram front ends showed that, for PaSST, STFT+Mel outperformed CQT under existing model inductive biases, due to time-frequency tiling and architectural alignment (Shah et al., 2023).
Resource efficiency: MLP-only encoders operating on MFCCs won several tasks (Speech Commands, Mridangam Tonic), with only ~0.4M parameters and sub-millisecond per-clip inference time, outperforming larger transformer and CNN rivals for those domains (Morshed et al., 2022).
Ensemble embeddings: Simple feature concatenation/aggregation across diverse backbones (speech SSL, pitch, environmental nets) achieved holistic coverage, with fused systems ranking top-2 on 12 of 18 evaluated HEAR tasks. Self-supervised speech encoders performed well on many non-speech tasks except for fine-grained music pitch/onset detection, which required explicit pitch models (CREPE) (Wu et al., 2022).
Robustness to channel effects: HEAR provided a platform to study representation stability under simulated microphone and acoustic perturbations. Fréchet Audio Distance (FAD) best predicted performance drop, but was most informative when used alongside topology-aware shifts. OpenL3 proved more robust than YAMNet under high/low-pass filtering, gain, and reverberation (Srivastava et al., 2022).

6. Impact, Broader Context, and Limitations

HEAR’s standardization of holistic audio benchmarking catalyzed the development of more comprehensive suites (e.g., X-ARES), which extended the evaluation regime to new domains, more tasks (22 in X-ARES), additional modalities, and included unparameterized k-NN evaluation for probing representation geometry (Zhang et al., 22 May 2025). These frameworks confirmed the HEAR finding that general, domain-balanced models offer the best average coverage, but domain-specific encoders are still required for leading performance in specialized settings.

Subsequent work also linked HEAR task performance to neurophysiological “brain-likeness”: models with higher average downstream accuracy/RSA on HEAR tasks exhibited stronger alignment with human auditory cortex activity (Pepino et al., 20 Nov 2025). This suggests that holistic representation quality, as measured by HEAR-style suites, may also serve as a proxy for human-like audio abstraction.

Despite its breadth, HEAR is not without limitations. The benchmark does not fully address dense-structured output tasks (e.g., ASR, source separation), sequence-to-sequence learning, or multimodal settings. Several tasks—such as fine-grained non-Western instruments and bioacoustic events—remain generally unsolved, and the question of whether a truly universal, human-level audio representation is achievable remains open (Turian et al., 2022).

7. Future Directions

Key future challenges for holistic audio evaluation and representation learning include:

Architectural adaptation: Matching front-end transforms to model inductive biases and developing architectures adaptable to varied time-frequency tilings.
Representation regularization: Broadening and refining regularization schemes like BECR to other model families and to self-supervised/unsupervised pretraining (Shah et al., 2023).
Toward universality: Expanding holistic suites to cover more real-world phenomena, including multi-modal and dense-structured prediction.
Interpreting and unifying: Understanding optimal depth/level choices for feature extraction, and developing more principled ensemble and fusion methods (Wu et al., 2022).
Neuro-inspired evaluation: Integrating neural predictivity metrics into the evaluation loop, potentially using them as auxiliary pretraining objectives (Pepino et al., 20 Nov 2025).

Overall, HEAR has established rigorous expectations and best practices for the holistic evaluation of audio representations, providing a durable foundation for ongoing research into universal, robust, and human-like auditory embeddings.