Scoreq: Neural Speech Quality Metric

Updated 6 October 2025

Scoreq is a neural-based objective speech quality metric designed to predict perceptual ratings by closely aligning with human Mean Opinion Scores.
It leverages deep learning on time-frequency representations, with scoreq_ref achieving a high Pearson correlation (~0.87) compared to traditional metrics.
While the intrusive scoreq_ref maintains linear performance across quality ranges, the non-intrusive scoreq_nr may saturate at high MOS, limiting its discrimination in top-tier evaluations.

Scoreq is a neural-based objective speech quality assessment metric designed to predict perceptual quality ratings, especially in evaluating modern neural audio codecs. It exists in both intrusive (“scoreq_ref”, requiring a reference signal) and non-intrusive (“scoreq_nr”, operating reference-free) forms. Scoreq’s utility arises from its high correlation with human subjective listening scores across a range of codec conditions, outperforming traditional metrics such as PESQ and warpq. The correlation analysis of scoreq-ref demonstrates it closely approximates human Mean Opinion Scores (MOS) on clean speech with neural codecs.

1. Foundations and Metric Definition

Scoreq is classified as an objective speech-quality metric for neural audio signals. Like other intrusive metrics, scoreq_ref requires access to both the original and processed speech signals. The metric’s architecture, while not exhaustively detailed in the evaluation paper (Mack et al., 29 Sep 2025), aligns with recent trends in deep learning-based perceptual modeling, leveraging neural encoders trained to regress to perceptual quality ratings based on time-frequency representations of the signals. The metric output is a scalar quality score for each audio file.

The effectiveness of scoreq is quantified primarily through correlation with human ratings: such ratings are typically on a continuous scale (e.g., the MUSHRA scale), directly aligning the goal of scoreq to produce a monotonic mapping with subjective perception.

2. Experimental Evaluation Protocol

The assessment of scoreq was conducted on a dataset comprising 17 distinct neural codec processing conditions, each evaluated with 100 separate audio files. Each file thus yielded both a subjective quality score (obtained through formal listening tests, e.g., the MUSHRA-1S protocol) and an objective quality estimate from scoreq and other metrics.

The reliability and perceptual validity of scoreq were measured using three statistical correlation measures:

Pearson correlation coefficient ( $\rho_{x,y}$ ):

$\rho_{x,y} = \frac{\sum_{i=1}^{N} (x_i - \bar{x})(y_i - \bar{y})}{\sqrt{\sum_{i=1}^{N}(x_i - \bar{x})^2} \sqrt{\sum_{i=1}^{N}(y_i - \bar{y})^2}}$

where $x_i$ is the scoreq estimate, $y_i$ the corresponding subjective score, and $N$ is the total number of samples (here, $N=1700$ ).

Spearman and Kendall correlations: To measure monotonicity and rank order agreement, used as secondary validation of the metric’s robustness.

Processing and correlation calculations were automated using the VERSA evaluation toolkit.

3. Comparative Performance and Results

The summary of findings demonstrates that:

Scoreq_ref achieved the highest Pearson correlation (~0.87) among 45 evaluated metrics, including both traditional (e.g., PESQ, warpq) and neural alternatives (e.g., utmos).
Classical metrics like PESQ and warpq reached only ~0.73 in Pearson correlation, indicating a notable gap in alignment with human scores compared to scoreq_ref.
Spearman and Kendall metrics for scoreq_ref were similarly strong, suggesting the ordering of processed files by perceptual quality closely matches human judges.

A tabular summary is provided below:

Metric	Type	Pearson Correlation (Estimate)	Notable Trend
scoreq_ref	Intrusive	~0.87	Closest match to human ratings
PESQ	Intrusive	~0.73	Underperforms scoreq-ref
utmos	Non-intrusive	[not specified]	High, close to scoreq
scoreq_nr	Non-intrusive	[not specified]	Tends to saturate at high MOS

The high correlation of scoreq_ref demonstrates its efficacy as a surrogate for human evaluators in the task of codec quality assessment.

4. Reliability Across Quality Ranges

A notable property of scoreq is the behavior exhibited at various subjective quality levels:

Scoreq_ref remains linear and discriminative across low to high quality conditions, maintaining the ability to separate good from bad codec outputs accurately.
Non-intrusive variants (scoreq_nr) display saturation at the high end of the MOS range: as subjective scores approach the upper bound (excellent quality), the metric’s responses tend to plateau, reducing their discriminative utility among high-performing codecs.
This effect was further analyzed through confidence intervals based on file count. Scoreq_ref produced narrow confidence intervals for high-quality codecs, indicating stable estimation; scoreq_nr’s intervals widened or plateaued, reflecting its lower differentiation in this region.

This suggests that for high-end codec evaluation, the intrusive version of scoreq is more suitable due to its linearity and finer granularity in mapping to listener perception.

5. Evaluation Methodology and Context

The evaluation involved the following methodology:

Objective and subjective scores computed for all files in all codec conditions.
Application of standard correlation formulas (see above) for Pearson, Spearman, and Kendall measures across the entire result set.
Comparison of scoreq’s rank and value against all other metrics using the same test protocol, ensuring fair and direct benchmarking.

The analysis further stratified the data by quality ranges (low, medium, high) and assessed the trends and confidence intervals, an approach necessary due to the risk of nonlinearities or saturation at boundary conditions in non-intrusive metrics.

6. Comparative Analysis Against Other Metrics

Within the paper, scoreq’s variants (“scoreq_ref” and “scoreq_nr”) and other metrics (such as PESQ, warpq, utmos) were compared directly. The findings established that:

Scoreq_ref consistently yielded the highest or among the highest correlations for intrusive metrics.
Scoreq_nr, while still competitive, was subject to the aforementioned saturation phenomenon, limiting its practical use for differentiating among state-of-the-art codecs at very high subjective quality.
Traditional metrics—while offering some discriminative power—were outperformed, especially in the context of neural codec evaluation where standard artifacts differ from legacy systems.

7. Practical Impact and Limitations

Scoreq_ref offers state-of-the-art correspondence to subjective listening tests for the evaluation of neural audio codecs under clean speech conditions, making it well-suited for codec benchmarking and algorithm development.

However, its dependence on a reference signal (intrusive evaluation) limits real-world in-service monitoring compared to non-intrusive approaches, which, while more generalizable, show reduced granularity at high quality. A plausible implication is that future work may focus on improving the non-intrusive variants’ ability to discriminate in the upper quality regime or hybridizing approaches for broader applicability.

In summary, scoreq—particularly in its intrusive form—provides a highly reliable, interpretable, and perceptually aligned solution for objective speech quality evaluation of neural audio codecs, as quantified by multiple statistical correlation measures and validated through large-scale listening test comparisons (Mack et al., 29 Sep 2025).

PDF Markdown Chat (Pro)

References (1)

Assessing speech quality metrics for evaluation of neural audio codecs under clean speech conditions (2025)

Scoreq: Neural Speech Quality Metric

1. Foundations and Metric Definition

2. Experimental Evaluation Protocol

3. Comparative Performance and Results

4. Reliability Across Quality Ranges

5. Evaluation Methodology and Context

6. Comparative Analysis Against Other Metrics

7. Practical Impact and Limitations

Whiteboard

Follow Topic

Continue Learning

Scoreq: Neural Speech Quality Metric

1. Foundations and Metric Definition

2. Experimental Evaluation Protocol

3. Comparative Performance and Results

4. Reliability Across Quality Ranges

5. Evaluation Methodology and Context

6. Comparative Analysis Against Other Metrics

7. Practical Impact and Limitations

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics