Objective Audio Quality Metrics
- Objective audio quality metrics are computational algorithms that estimate perceptual audio quality by emulating human auditory processes and modeling masking and modulation effects.
- They encompass full-reference, non-intrusive, and hybrid approaches, each designed for applications such as codec evaluation, speech enhancement, and source separation.
- Recent advancements integrate deep learning with psychoacoustic and cognitive salience models to improve accuracy, interpretability, and cross-domain robustness.
Objective audio quality metrics are computational algorithms designed to estimate the perceptual quality of audio signals, serving as proxies for subjective listening tests such as MUSHRA or MOS. These metrics are critical across domains—speech telephony, audio coding, enhancement, source separation, generative models, and streaming—by enabling efficient, standardized, and reproducible quantitative evaluation of system performance. Modern objective metrics range from psychoacoustically grounded models with interpretable features to deep neural non-intrusive predictors and hybrid data-driven cognitive salience architectures.
1. Taxonomy of Objective Audio Quality Metrics
Objective audio quality metrics can be structured along two primary axes: reference dependence and theoretical grounding.
- Full-Reference (Intrusive): Require both a clean reference and a test signal. Examples include PEAQ (Delgado et al., 2022), PESQ (Monjur et al., 9 Dec 2025), POLQA, ViSQOL (Chinen et al., 2020), 2f-model (Torcoli et al., 2021), and PEMO-Q.
- Non-Intrusive (Single-Ended/No-Reference): Only the processed signal is needed. These are typically DNN-based (e.g., DNSMOS (Reddy et al., 2021), NISQA, UTMOS, SCOREQ_NR (Mack et al., 29 Sep 2025)).
- Hybrid/Contextual: Combine psychoacoustic front-ends with learned mapping or cognitive modules (e.g., PEAQ-CSM (Delgado et al., 27 Nov 2024, Delgado et al., 2022), SMAQ).
Additionally, metrics are distinguished by their emphasis on different perceptual or signal-theoretic aspects: distortion energy, audibility/masking, modulation, source separation artifacts, or MOS prediction.
| Type | Reference Needed | Typical Application |
|---|---|---|
| Intrusive | Clean + Test | Codec eval, enhancement |
| Non-Intrusive | Test only | Real-world/stream surveillance |
| Hybrid/Cognitive | Usually both | Challenging, unseen artifact |
2. Core Psychoacoustic and Statistical Principles
Most leading intrusive metrics implement a block diagram comprising:
1. Psychoacoustic front-end: Simulate human auditory processing via filterbanks (critical-band, gammatone, ERB), auditory masking, and temporal integration. For instance, PEAQ constructs Model Output Variables (MOVs) by mapping the reference and test through Bark-scaled filterbanks and extracting band energies, masking thresholds, and modulation characteristics (Abanto-Leon et al., 2019, Delgado et al., 2022).
2. Disturbance measures: Feature domain subtractions or masks quantify types of degradation:
- Average Linear Distortion (loudness differences in overlapping TF regions)
- Additive Noise Loudness (loudness increase in test signal)
- Missing Component Loudness (masked/removed energy from reference)
- Noise-to-Mask Ratio (NMR) and modulation-difference measures reflect detectability thresholds.
3. Cognitive weighting and adaptation: Models such as PEAQ-CSM introduce Cognitive Effect Metrics (CEMs)—informational masking, perceptual streaming, speech/music priors—that modulate the role of each disturbance metric via detection-probability weights or basis functions (Delgado et al., 27 Nov 2024, Delgado et al., 2022).
4. Mapping to scalar quality index (ODG, MOS, BAQ): Regression via small neural networks (PEAQ, PEMO-Q), multivariate regression splines, or simply a weighted sum.
| Model | Perceptual Front-End | Cognitive Model | Mapping to Score |
|---|---|---|---|
| PEAQ | Bark filterbank | Simple asymmetry | Small NN from MOVs to ODG |
| PEMO-Q | Gammatone, modulation | None | Regression to ODG |
| ViSQOLAudio | Gammatone/ERB/SSIM | None | SVR regression to MOS |
| PEAQ-CSM | Bark+modulation | DPW salience model | Linear regression (salience fns) |
3. Deep Learning-Based Metrics and Non-Intrusive Evaluation
Neural MOS predictors have emerged as state-of-the-art for non-intrusive assessment:
- CNN-based MOS regressors (e.g., DNSMOS (Reddy et al., 2021), OMOQSE (Roberts et al., 2020)): Process spectrogram (log-Mel, MFCC, or raw wave) features, outputting MOS or quality dimension scores (e.g., Sig/BAK/OVRL in DNSMOS P.835).
- Attention-based or transformer architectures (NISQA, UTMOS, SpeechQualityLLM (Monjur et al., 9 Dec 2025)): Allow multi-dimensional scoring (noisiness, coloration, discontinuity, loudness) and, in the case of SpeechQualityLLM, generative language explanations and user prompt modulation.
- Contrastive and task-specific DNNs (SCOREQ, NORESQA (Mack et al., 29 Sep 2025)): Trained on relative MOS ranking or embedding similarity.
Neural metrics excel at rapid screening and scale but have saturation and generalization limits—often flattening discrimination at high quality and sensitivity to mismatched artifact domains (Mack et al., 29 Sep 2025, Lanzendörfer et al., 24 Nov 2025).
4. Benchmarking, Correlation with Human Ratings, and Ongoing Limitations
Systematic benchmarking efforts with open datasets (ODAQ (Dick et al., 1 Apr 2025), USAC VT, SASSEC, NISQA) have clarified performance trends:
- Correlations: Metrics such as PEAQ-CSM, NMR, and the 2f-model have demonstrated Pearson correlation coefficients to subjective MUSHRA scores across diverse degradations (Dick et al., 1 Apr 2025, Torcoli et al., 2021, Delgado et al., 27 Nov 2024). State-of-the-art non-intrusive neural metrics such as SCOREQ and UTMOS achieve –$0.87$ on standard corpora (Mack et al., 29 Sep 2025, Lanzendörfer et al., 24 Nov 2025).
- Metric failure modes: SI-SDR, SNR, SDR, and L2/MAE energy-based measures routinely underperform on perceptual datasets (|r| < 0.4), lacking robustness to masking and artifact audibility (Vinay et al., 2022, Dick et al., 1 Apr 2025, Lanzendörfer et al., 24 Nov 2025).
- Tasks and domains: Intrusive metrics (PEAQ, PESQ, POLQA, ViSQOL, 2f-model) remain top-performers in controlled codec and separation contexts. DNN-based no-reference models excel for monitoring, but saturate in fine discrimination, especially above MOS ≈ 80 (Mack et al., 29 Sep 2025).
| Metric | Speech Only | Music/Mixed | Source Sep. | General Audio |
|---|---|---|---|---|
| PESQ/POLQA | r ≈ 0.88–0.90 | up to 0.74 | <0.75 | N/A |
| PEAQ-CSM/2f-model | ≥0.86 | ≥0.87 | ≥0.86 | 0.87+ |
| SMAQ, ViSQOLAudio | ≈0.77 | 0.77–0.83 | Lower | High |
| DNSMOS/NISQA/UTMOS | ≈0.83 | Saturate | Drop as artifacts diverge | |
| SI-SDR, SNR | ≈0.44 | Unreliable | Poor | → Not suitable |
5. Specialized Metrics for Emerging Domains
Generative audio/texture synthesis:
- FAD, Inception Score, KID: Rely on neural embeddings or classifier diversity (Vinay et al., 2022). They track distributional similarity, but lack perceptual validity in music and complex sound textures.
- Deep-feature Gram and cochlear-model metrics: Quantify parameter sensitivity by matching feature covariances—Gram-based metrics excel in discriminating temporal or spectral manipulations in audio textures. No single metric is universal; choice depends on parameter variation axis (Gupta et al., 2022).
Spatial and stereo processing:
- Mono-intrusive models fail in mid/side and hard-panned stereo conditions due to naive channel averaging. Binaural feature extensions (ITD, IACC, ILD) marginally help but often degrade prediction. Next-generation models are called to jointly model timbral and spatial salience, and presentation context (Delgado et al., 11 Dec 2025).
Audio enhancement and echo suppression:
- DSML and RESL separate speech-distortion and echo suppression tradeoffs in double-talk, outperforming standard SDR, and correlating tightly with DNSMOS-based MOS (Ivry et al., 2021).
Time-scale modification (TSM):
- Advanced intrusive and non-intrusive ML-based models employing MFCC, phase, and transient features—BGRU and CNN variants—offer MOS prediction approaching DE measures with (Roberts et al., 2020, Roberts et al., 2020).
6. Data-Driven Cognitive Models and Next-Generation Metric Recommendations
Data-driven cognitive salience models, notably PEAQ-CSM and PEAQ-CSM+, augment classical perceptual pipelines with low-parameter, signal-dependent weighting of distortion features modulated by perceptual and informational masking, streaming, and speech/music priors (Delgado et al., 27 Nov 2024, Delgado et al., 2022). Compared with generic black-box learning, these hybrid architectures offer:
- Robust cross-domain generalization (R > 0.80 on unseen codecs and artifacts)
- Interpretability (feature contributions aligned to psychophysics)
- Extensibility (modular addition of new features or cognitive metrics)
- Superior adaptation to new artifact types (parametric codecs, BSS, dialogue enhancement).
Practical guidance from recent benchmarks:
- For high-fidelity neural codec evaluation and sensitive monitoring, reference-based psychoacoustic metrics (SCOREQ, PESQ, WARP-Q) remain necessary to avoid MOS flattening (Mack et al., 29 Sep 2025, Lanzendörfer et al., 24 Nov 2025).
- For rapid, large-scale screening (low-MOS to mid-MOS), use neural non-intrusive metrics with periodic intrusive validation.
- Metric selection must be artifact- and domain-aware; hybrid approaches yield the best results in mixed or emerging application scenarios (Dick et al., 1 Apr 2025, Delgado et al., 11 Dec 2025).
7. Open Challenges and Future Directions
Reliable, universally applicable objective audio quality metrics must address:
- Broader training sets capturing real-world distortions, source separations, and generative artifacts.
- Explicit modeling of contextual and listener effects (e.g., using context vectors or listener profiling as in SpeechQualityLLM (Monjur et al., 9 Dec 2025)).
- Audio-visual joint metrics and multi-modal semantic quality assessment for streaming, conferencing, and multimedia applications.
- Continual alignment with open subjective datasets (ODAQ, NISQA, SMOS) to recalibrate mapping and prevent domain drift.
- Further integration of spatial (binaural) and temporal (modulation) features, and explainable mapping functions for enhanced interpretability.
These priorities are expected to accelerate convergence toward robust, generalizable, and interpretable objective audio quality metrics able to underwrite audio processing research and deployment across both established and emerging signal processing paradigms.