Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 150 tok/s
Gemini 2.5 Pro 42 tok/s Pro
GPT-5 Medium 23 tok/s Pro
GPT-5 High 21 tok/s Pro
GPT-4o 87 tok/s Pro
Kimi K2 195 tok/s Pro
GPT OSS 120B 443 tok/s Pro
Claude Sonnet 4.5 34 tok/s Pro
2000 character limit reached

PESQ: Perceptual Evaluation of Speech Quality

Updated 16 October 2025
  • PESQ is a reference-based objective measure that estimates speech quality by comparing a degraded signal with its clean reference using auditory modeling.
  • The methodology employs filtering, loudness mapping, and disturbance processing to transform raw scores into MOS-LQO values for diverse codec scenarios.
  • Recent advances integrate PESQ with deep learning via differentiable surrogate models and multi-metric frameworks, though its intrusive nature can limit real-world applications.

Perceptual Evaluation of Speech Quality (PESQ) is a reference-based objective measure designed to estimate the perceived audio quality of speech signals. Introduced with ITU-T Recommendation P.862 in 2001, PESQ operates by algorithmically modeling auditory perception to provide scores that closely predict subjective human Mean Opinion Score (MOS) ratings across diverse telephony and codec scenarios. Despite being superseded by POLQA, PESQ remains widely used in academia and industry as a benchmark for traditional and deep-learning-based speech enhancement, codec evaluation, and real-world quality monitoring.

1. Core Principles and Algorithmic Structure

PESQ quantifies speech quality by comparing a degraded signal with its time-aligned clean reference. The algorithm models the auditory transform through a pipeline consisting of filtering, loudness mapping (including Bark frequency mapping and Zwicker’s law-based loudness computation), time-frequency equalization, and disturbance processing. Key disturbance features include symmetric and asymmetric disturbances that reflect masking and perceptual bias due to signal degradations such as packet loss, noise, or codec artifacts.

PESQ outputs a raw score generally in the range [–0.5, 4.5]. In practical deployments, further mappings convert this to MOS-LQO (Mean Opinion Score—Listening Quality Objective), corresponding to a normalized MOS scale.

2. Versions, Implementations, and Standardization

Over its two-decade evolution, several PESQ versions and implementation variants have emerged:

PESQ Version Input Bandwidth Output Score Type Standard/Correction
P.862 (Original) Narrowband (8/16 kHz) Raw PESQ Initial ITU-T, 2001
P.862.1 Narrowband MOS-LQO (mapped) ITU-T Extension
P.862.2 Wideband (16 kHz) MOS-LQO ITU-T Extension, 2005
P.862.2 Corrigendum 2 Wideband (16 kHz) MOS-LQO (corrected) Filter fix, systematic bias

The Corrigendum 2 update corrects filter coefficients, notably reducing systematic under-prediction (average score difference ~0.8 on ITU listening test data, with RMSE up to 0.52 and maximum differences up to 1.30) (Torcoli et al., 26 May 2025).

Multiple open-source implementations exist, including the ITU reference ANSI-C code and more recent Python/PyTorch toolkits (e.g., audiolabs/PESQ implements Corrigendum 2). Not all implement the latest corrections, so reporting the precise PESQ version and repository is essential for replicability and comparability across studies.

3. Practical Usage and Limitations

PESQ’s requirement for a clean reference (“intrusive” measurement) restricts its real-time and real-world applicability, especially in contexts where the reference signal is unavailable (such as network monitoring, streaming, or in-the-wild dataset evaluation). Traditional PESQ can be computationally intensive and only supports mono or specific stereo processing paradigms. The standard ITU reference implementation interleaves all channels into a single mono sequence during calculation. In multi-channel settings, alternatives include mono-downmixing or independent scoring followed by averaging, with each approach exhibiting different correlations with subjective evaluations (Torcoli et al., 26 May 2025).

Although PESQ correlates well with MOS in many scenarios, there are situations—particularly when encountering novel degradations, enhancement artifacts, or deep learning “fooling” effects—where the PESQ metric diverges from human judgment. As a result, over-reliance on PESQ (for example, using it as the sole loss function or success criterion in model development) may fail to ensure actual subjective quality improvements.

4. Non-Intrusive PESQ Estimation and Surrogate Models

To overcome the requirement for a clean reference and enable real-time quality monitoring, several approaches substitute conventional PESQ with estimators or learned surrogates.

Neural network-based predictors (e.g., Quality-Net, PESQNet, PESQ-DNN) are trained using paired degraded and reference speech alongside ground-truth PESQ labels (Fu et al., 2018, Xu et al., 2021, Xu et al., 2023). Architectures include BLSTM-based models, convolutional encoders with self-attention aggregators, and fully convolutional recurrent networks. These models are capable of producing non-intrusive (“single-sided”) PESQ estimates, with Mean Absolute Error (MAE) as low as 0.11 and linear correlation coefficients up to 0.92 between predicted and reference PESQ (Xu et al., 2023).

A notable insight is that the performance gap between intrusive and non-intrusive surrogate models is often small, especially when using robust architectures and large, diverse training sets (Xu et al., 2022).

5. Integrating PESQ into Deep Learning: Differentiable Surrogates and Loss Construction

Because the original PESQ algorithm is non-differentiable and cannot supply gradients, direct loss integration in neural network training is precluded. Several strategies address this:

  1. Learning a differentiable surrogate: Supervised models (e.g., Wavenet-based or CNN surrogates) are trained to approximate PESQ, enabling their use as loss functions in end-to-end enhancement networks (Elbaz et al., 2017, Fu et al., 2019). These surrogates allow gradients to be propagated back to the enhancement models.
  2. Reinforcement learning or adversarial stabilization: Alternating actor-critic-like training (where an enhancement network is optimized using “critics” approximating PESQ) stabilizes learning and prevents adversarial “fooling” of the surrogate (Kawanaka et al., 2020).
  3. Multi-task or pseudo-label learning: PESQ is used as a pseudo or auxiliary label in semi-supervised multi-task frameworks to regularize training (e.g., in MTQ-Net with other metrics like STOI and SDI (Zezario et al., 2023)), guiding the encoder toward perceptually salient features.

When used as a loss, surrogate-based or pseudo-labeled PESQ dramatically improves denoising performance, reducing the mismatch between optimization objectives and perceptual quality, as evidenced by typical PESQ gains of 0.1–0.2 points (above MSE-trained models) and enhanced MOS alignment.

6. Advances in Unified and Robust Speech Quality Assessment

Recent advances focus on unified, multi-metric, and robust frameworks that overcome traditional PESQ limitations:

  • ARECHO provides a chain-based autoregressive prediction system for simultaneously estimating PESQ and other metrics (e.g., STOI, MOS), using quantization and dynamic classifier chaining to handle heterogeneous metric types and interdependencies. Confidence-oriented decoding improves reliability, and experimental results show improved regression accuracy for PESQ and cross-metric interpretability (Shi et al., 30 May 2025).
  • Comparative benchmarks on real-world noisy environments show that, in practice, perceived enhancement quality (as measured by PESQ) can diverge from simple noise suppression improvements (Khondkar et al., 17 Jun 2025). For example, CMGAN achieves PESQ scores up to 4.04 (SpEAR dataset) by prioritizing perceptual naturalness over maximum SNR gain, whereas U-Net’s aggressive denoising achieves substantially lower PESQ. These trade-offs demonstrate PESQ’s utility in highlighting perceptual artifacts and indicate that maximizing PESQ may not always align with raw signal distortion metrics.

7. Implications for Research and Standardization

The continued usage of PESQ—despite its official withdrawal—derives from its robust performance, extensive evaluation base, and wide integration in speech processing pipelines. Current research underscores the need for:

  • Precise documentation of the PESQ version and implementation in all publications (including whether Corrigendum 2 is used and how multi-channel signals are handled) (Torcoli et al., 26 May 2025), as variations can yield non-negligible score differences.
  • Careful interpretation of PESQ results, especially in machine-learning-driven system development, due to potential “fooling” scenarios and the metric’s incomplete coverage of all perceptual factors.
  • Ongoing development of non-intrusive, data-driven, and multi-metric quality assessment frameworks that combine or supersede PESQ for emerging real-world and deep learning scenarios.

PESQ remains foundational in objective speech quality measurement, but best practices now integrate well-documented usage, non-intrusive estimation, surrogate-based losses, and careful multi-metric evaluation to ensure perceptual relevance and scientific rigor.

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Perceptual Evaluation of Speech Quality (PESQ).

Don't miss out on important new AI/ML research

See which papers are being discussed right now on X, Reddit, and more:

“Emergent Mind helps me see which AI papers have caught fire online.”

Philip

Philip

Creator, AI Explained on YouTube