Auditory-Based Objective Perceptual Metrics

Updated 4 July 2025

Auditory-based objective perceptual metrics are computational measures that predict human sound perception by modeling auditory processing and cognitive responses.
They leverage psychoacoustic modeling, machine learning, and signal processing techniques to simulate subjective listening tests and optimize audio systems.
These metrics are applied in speech enhancement, audio coding, generative synthesis, and spatial rendering to improve real-world audio quality and system performance.

Auditory-based objective perceptual metrics are computational measures designed to model and predict human perceptions of sound quality, similarity, or salience. These metrics serve as scalable, repeatable surrogates for human listening tests, enabling audio system developers to assess or optimize signal fidelity, intelligibility, spatial realism, and perceptual equivalence in domains ranging from speech enhancement and coding to generative synthesis and advanced spatial audio rendering. Their development and application reflects a multidisciplinary integration of auditory neuroscience, psychophysics, signal processing, and machine learning.

1. Principles and Foundations

Auditory-based objective perceptual metrics draw upon the premise that computational models can approximate subjective judgments of audio quality by mimicking aspects of the human auditory system and cognitive decision-making. Foundational approaches include:

Psychoacoustic Modeling: Early frameworks such as PEAQ (ITU-R BS.1387) utilize models of masking, loudness, and disturbance in the peripheral auditory pathway to derive features (“model output values,” MOVs), which are then mapped to subjective scores (Delgado et al., 2022).
Perceptual Distance in Manifold Spaces: More recent work treats perception as a nonlinear transformation from physical signal space to a perceptual manifold, formalized as a Riemannian metric derived from the Jacobian of the transformation (Oh et al., 2020). This captures the interdependence and nonlinearities introduced by cochlear, neural, and cognitive processing.
Cognitive and Salience Modeling: Increasing evidence underscores the importance of cognitive phenomena (informational masking, perceptual streaming, semantic salience). Models such as the Cognitive Salience Model (CSM) adaptively weight distortion metrics according to data-driven cognitive effect metrics, using salience-based optimization to better parallel subjective ratings (Delgado et al., 2022, Delgado et al., 27 Nov 2024, Delgado et al., 2023).

The design of perceptual metrics thus moves beyond mere energy difference computations, instead leveraging comprehensive, often hierarchical, representations of signal transformations within the human auditory system.

2. Methodologies and Computational Frameworks

The implementation of auditory-based perceptual metrics encompasses diverse methodologies:

Signal-Derived Features: Most metrics process audio through auditory-inspired filterbanks (e.g., gammatone, ERB), measuring disturbance or similarity in domains such as loudness, modulation, or temporal envelopes (Chinen et al., 2020, Delgado et al., 2022). These features are normalized and aggregated across time/frequency, then mapped to a perceptual scale.
Machine Learning–Based Mappings: Statistical or neural models (e.g., MARS, SVR, deep CNNs) are trained to map perceptual features to subjective ratings using large listening-test datasets [2020.01.04460; (Delgado et al., 27 Nov 2024)]. Salient examples include differentiable deep metrics trained end-to-end against human JND annotations (Manocha et al., 2020) or crowdsourced MUSHRA/MOS scores (Reddy et al., 2021, Lian et al., 29 Apr 2025).
Cognitive Interaction and Adaptive Weighting: Recent models employ adaptive nonlinear weighting—detection probability weights, often sigmoidal—learned by maximizing the correlation between cognitive effect metrics (e.g., informational masking, speech/music probability) and the salience of specific distortion features (Delgado et al., 2022, Delgado et al., 27 Nov 2024).
Triplet and Contrastive Learning: To better model human discrimination over both subtle (JND-level) and gross (large-scale) differences, frameworks such as CDPAM use contrastive or triplet loss pretraining, achieving strong generalization to out-of-domain perturbations (Manocha et al., 2021).
Spectrogram/Image-Based Metrics: Motivated by analogies between visual and auditory neural processing, image perceptual metrics (MS-SSIM, NLPD) are adapted to audio by applying them to spectrogram representations, achieving state-of-the-art correlation with perceived music quality (Namgyal et al., 2023, Namgyal et al., 25 Sep 2024).

3. Major Models and Domains of Application

A diverse ecosystem of metrics targets different audio tasks and domains, each driven by distinct modeling priorities:

Domain	Representative Metrics	Key Modeling Aspects
Speech Quality	PESQ, POLQA, DNSMOS P.835, APG-MOS	Masking, speech-specific features, multidimensional (speech/noise/overall); integration with cochlear/semantic modeling (Reddy et al., 2021, Lian et al., 29 Apr 2025)
Generic Audio	ViSQOL v3, PEMO-Q, ViSQOLAudio, PEAQ	Full-reference, gammatone spectrograms, loudness, structural similarity, cognitive extensions (Chinen et al., 2020, Delgado et al., 2022, Delgado et al., 2022)
Generative/Neural	CDPAM, differentiable DNN metrics, SCOREQ	JND alignment, contrastive/pretext learning, codec-awareness, triplet learning for large/small differences (Manocha et al., 2021, Lechler et al., 1 Jun 2025)
Spatial/3D Audio	SAQAM, AEP, PBC, DRMSP	Multitask LQ/SQ, DOA estimation, perceptual externalization/coloration, MMDS-based latent alignment (Manocha et al., 2022, Zhang et al., 3 Jul 2025)

Metrics vary in reference requirements (full-reference, non-intrusive), coverage (single-task, multi-dimensional), and computational strategy (statistical model vs. deep net).

4. Experimental Calibration and Validation

Empirical validation underpins all perceptual metrics, ensuring alignment with human subjective data:

Listening Test Calibration: MUSHRA and MOS datasets, including crowdsourced and laboratory-validated scores, remain standard. Metrics are tuned to maximize Pearson or Spearman correlations with these scores (Torcoli et al., 2021, Lechler et al., 1 Jun 2025). Codec-aware metrics such as SCOREQ specifically address observed underperformance of classical models on neural codecs.
Cognitive Salience Estimation: Key advances emerge from calculating the correlation between cognitive effect measures and subjective salience of distortions, ensuring model structure and weighting are data-driven (Delgado et al., 2022, Delgado et al., 27 Nov 2024). Detection probability weighting and interaction cost functions formalize this procedure.

For spatial audio, alignment of latent HRTF distances with perceptual predictions via metric multidimensional scaling (MMDS) has been demonstrated to yield stronger correlation with subjective similarity, coloration, and externalization (Zhang et al., 3 Jul 2025).

5. Practical Implications, Limitations, and Deployment Strategies

Auditory-based objective perceptual metrics are central to the efficient development, tuning, and evaluation of modern audio systems:

Real-Time and Resource-Constrained Contexts: Perceptual metrics such as P-Reverb optimize computational expense in interactive sound propagation by localizing the need for expensive reverberation computations to perceptually relevant regions (Rungta et al., 2019).
Codec Development and Benchmarking: Metrics calibrated across both DSP and neural/generative codecs enable more accurate and fair benchmarking, especially as neural codecs introduce artifacts not anticipated by classic models (Torcoli et al., 2021, Lechler et al., 1 Jun 2025).
End-to-End Optimization: Differentiable metrics enable their use as loss functions for training enhancement, synthesis, and rendering models, directly improving perceptual quality (for instance, in denoising or vocoding tasks) (Manocha et al., 2020, Manocha et al., 2021, Manocha et al., 2022).
Domain Independence and Universality: Strategies built on robust perceptual models and trained on diverse datasets (e.g., the 2f-model/SI-SA2f) achieve high cross-domain correlation, crucial for standardized, scalable assessment (Torcoli et al., 2021).

Limitations and challenges include sensitivity to reference quality (in full-reference settings), the need for continually updated training data for generalization, and the risk of misalignment with listener preference in domains with insufficiently captured cognitive or semantic features. Many recent advances specifically address the challenge of handling large or "out-of-distribution" differences, multi-language/multicondition settings, and music or spatial attributes.

6. Future Directions and Open Problems

Active research directions in auditory-based objective perceptual metrics include:

Expanded Cognitive Modeling: There is a growing emphasis on integrating central/cognitive phenomena—informational masking, perceptual streaming, auditory scene analysis—as explicit, separately weighted dimensions in metric design, using data-driven interaction analysis (Delgado et al., 2022, Delgado et al., 27 Nov 2024, Delgado et al., 2023).
Biologically Informed Deep Networks: State-of-the-art systems increasingly fuse physiological modeling (e.g., cochlear-derived representations in APG-MOS), semantic analysis, and deep attention architectures to improve interpretability and perceptual alignment (Lian et al., 29 Apr 2025).
Universal and Domain-Adaptive Metrics: Enhanced generalization through broad training/validation datasets, explicit modeling of codec/domain effects, and flexible, extensible model architectures is enabling deployment in emerging domains such as generative synthesis and spatial audio (Torcoli et al., 2021, Zhang et al., 3 Jul 2025).
Crowdsourcing and Subjective Protocol Scalability: Adapted subjective test procedures, combined with modern codec-aware metrics, support scalable, repeatable evaluation workflows across platforms with non-expert listeners, preserving alignment with expert judgments (Lechler et al., 1 Jun 2025).
Integration with Image/Multimodal Models: Cross-modal transfer (from image to audio) via perceptual metrics (NLPD, MS-SSIM) on spectrograms, supported by underlying neural mechanisms, further expands accessible tools and inspires new model hybrids (Namgyal et al., 2023, Namgyal et al., 25 Sep 2024).
Standardization: The field remains actively engaged in defining open datasets, benchmarks, and interpretability standards to facilitate reliable, comparable, and explainable evaluation across research and industry applications (Vinay et al., 2022).

7. Summary Table: Representative Metric Characteristics

Metric/Approach	Model Basis	Targeted Domain/Impact	Reference
PEAQ	Masking/loudness MOVs + original regression	Classical audio codec eval	(Delgado et al., 2022)
ViSQOL v3	Gammatone+NSIM+SVR	Speech, music, production	(Chinen et al., 2020)
2f-model, SI-SA2f	PEAQ-derived features + cross-domain regression	Universal (audio coding, separation)	(Torcoli et al., 2021)
DNSMOS P.835	Deep CNN + P.835 labels (SIG/BAK/OVRL)	Speech enhancement, non-intrusive	(Reddy et al., 2021)
CDPAM, DPAM	Deep metric learning, contrastive/triplet/self-supervised	Differentiable, robust, broad	(Manocha et al., 2020, Manocha et al., 2021)
SAQAM	Deep CNN+multi-task SQ/LQ	Spatial audio, binaural, AR/VR	(Manocha et al., 2022)
CSM, CSM+	DM/CEM adaptive weighting, salience	Audio coding, BSS, future-proof	(Delgado et al., 2022, Delgado et al., 27 Nov 2024)
APG-MOS	Biologically grounded cochlear+semantic	Synthetic speech (MOS ranking)	(Lian et al., 29 Apr 2025)

Auditory-based objective perceptual metrics continue to evolve as essential tools for aligning computational audio quality assessment with complex realities of human perception. The field's trajectory is toward increasingly nuanced, biological, and cognitively informed models, leveraging large empirical datasets and cross-domain insights, and enabling both efficient automated assessment and end-to-end optimization across emerging audio technologies.