Absolute Category Rating (ACR) Overview
- ACR is a standard single-stimulus, reference-free quality-assessment method where observers rate stimuli on a fixed categorical scale to compute a Mean Opinion Score (MOS).
- It is widely applied in speech and video quality assessments using controlled lab setups and crowdsourced frameworks, following ITU-T guidelines such as P.800 and P.808.
- Recent research models ACR data with multinomial and latent variable techniques, enhancing its role as a target behavior for objective quality metrics like PESQ and PESQ-DNN.
Searching arXiv for recent and foundational papers on Absolute Category Rating (ACR) to ground the article. Absolute Category Rating (ACR) is a standard single-stimulus, reference-free subjective quality-assessment method in which observers rate the perceived quality of a stimulus on a fixed categorical scale, commonly a 5-point scale from bad to excellent, and the ratings are aggregated into a Mean Opinion Score (MOS). In contemporary research, ACR is used across speech and video quality assessment, in both laboratory settings and crowdsourcing frameworks, and it also functions as the target behavior for instrumental metrics such as PESQ that are intended to approximate human subjective judgments (Naderi et al., 2020, Xu et al., 2022, Xu et al., 2023).
1. Core definition and semantics
In the ACR protocol, each stimulus is judged independently rather than comparatively. In speech-quality studies, this is described as a listening-opinion procedure in which a participant listens to a speech stimulus and rates its perceived quality on a fixed categorical scale; in video-quality studies, the same single-stimulus logic applies to clips rated one at a time (Naderi et al., 2020, Naderi et al., 24 Sep 2025). Several cited implementations use the conventional 5-point scale
- $1 =$ bad
- $2 =$ poor
- $3 =$ fair
- $4 =$ good
- $5 =$ excellent (Naderi et al., 2021, Xu et al., 2023, Naderi et al., 24 Sep 2025).
A defining property of ACR is that it is absolute rather than comparative. In the speech-enhancement setting, ACR listening tests are explicitly characterized as reference-free: listeners hear only the degraded or enhanced speech and rate its overall quality, without comparing it to a hidden clean target (Xu et al., 2022). The multilingual P.808 study likewise describes ACR as a “direct quality estimate without reference,” with ratings later aggregated into MOS (Sach et al., 15 Jul 2025).
MOS is the arithmetic mean of valid ACR ratings for a stimulus or condition. In speech-quality reporting, MOS is typically accompanied by the number of votes, standard deviation, and a confidence interval (Naderi et al., 2020, Naderi et al., 2021). Recent distributional work emphasizes that MOS is only one parameter of the response distribution: for a 5-level ACR scale , the rating distribution for a stimulus can be represented by a multinomial probability mass function , with
where is the MOS (Saupe et al., 2024).
2. Standardized procedures and operational workflows
In laboratory speech testing, ACR is associated with ITU-T Rec. P.800 and controlled conditions. One cited comparison study describes laboratory ACR as being conducted in a controlled, quiet environment with short speech samples, typically $2 =$0–$2 =$1 seconds, and reports condition-level MOS from $2 =$2 votes per condition in one experiment (Naderi et al., 2021). In coded-speech evaluation, ACR listening tests are described as the standard design-phase method for quality assessment under controlled laboratory conditions following ITU-T P.800 (Xu et al., 2023).
For crowdsourced speech testing, ITU-T Rec. P.808 operationalizes ACR as a structured remote workflow. An open-source Amazon Mechanical Turk implementation integrates qualification directly into the main task rather than separating qualification and rating into two stages; the generated ACR assignment can include a hearing test, environment suitability test, two-eared headphone/headset check, trapping questions, and gold-standard questions (Naderi et al., 2020). The integrated qualification is reported to speed up the process by $2 =$3–$2 =$4, while a periodic environment certificate reduced participant working time by $2 =$5 (Naderi et al., 2020).
For crowdsourced video testing, an open-source extension of ITU-T Rec. P.910 implements ACR, ACR with hidden reference (ACR-HR), Degradation Category Rating (DCR), and Comparison Category Rating (CCR). The described framework includes rater, environment, hardware, and network qualifications, as well as gold and trapping questions, and it is reported to be accurate and highly reproducible compared to existing ITU-T Rec. P.910 lab studies (Naderi et al., 2022).
A multilingual P.808 adaptation further decomposes crowdsourced ACR into qualification, setup check, training, and rating. That work states that P.808 recommends at least eight ratings per stimulus, that unreliable ratings may be discarded, and that rejected samples must be resubmitted to the pool to ensure enough valid ratings (Sach et al., 15 Jul 2025). The same study localizes both text and audio instructions, replaces qualification and training materials with target-language versions, and uses synthetic speech for spoken instructions in the target language (Sach et al., 15 Jul 2025).
3. Statistical representation of ACR data
Although ACR results are often reported only as MOS, current modeling work treats the full categorical distribution as the primary object. In a 5-level ACR setting, the empirical ratings for a stimulus define a discrete distribution over categories $2 =$6 through $2 =$7, with $2 =$8 degrees of freedom because the probabilities sum to $2 =$9 (Saupe et al., 2024). The cited work argues that different rating distributions can share the same MOS while differing in spread and disagreement, so modeling only the mean can discard relevant information (Saupe et al., 2024).
One line of work parameterizes ACR distributions by mean and variance. The 2024 study on “maximum entropy and quantized metric models” examines families of multinomial distributions fitted by maximum likelihood and compares quantized metric models based on latent continuous variables, a novel discrete maximum entropy model, the generalized score distribution (GSD), and the empirical frequency distribution (Saupe et al., 2024). The quantized metric formulation assumes a latent perceived-quality variable $3 =$0 that is mapped into the five ACR categories by thresholds; the paper studies normal, logistic, logit-logistic, and beta latent distributions (Saupe et al., 2024).
The same study reports that, on the KonIQ-10k and VQEG HDTV datasets, fitted two-parameter models predict unseen ratings better than the empirical distribution, and that continuous latent models provide fine-grained estimates of quality quantiles that are relevant to service providers (Saupe et al., 2024). This suggests that, within ACR analysis, the full multinomial structure is increasingly treated as informative in its own right rather than as a mere precursor to MOS.
A separate methodological line links ACR to pairwise-comparison data by treating both as manifestations of a common latent quality structure. In that framework, ACR scores provide an initial pairwise preference matrix $3 =$1, which then initializes active pair-comparison refinement under a generalized Thurstone Case III model; the paper’s position is that ACR is simple and efficient, but pair comparison is more discriminative (Ling et al., 2020).
4. ACR as target behavior for objective quality models
ACR has a central role in the design and interpretation of objective quality metrics because many instrumental measures are constructed to predict subjective ACR outcomes. In coded speech, subjective ACR listening tests are described as the standard benchmark, but they are also characterized as impractical for online monitoring because they are time-consuming, costly, laboratory-bound, and unsuitable for real-time network operation (Xu et al., 2023).
PESQ is a canonical example of this relationship. The cited speech-enhancement study emphasizes a basic asymmetry: the ITU-T PESQ algorithm requires a clean reference signal, yet it was standardized to predict outcomes of reference-free ACR listening tests (Xu et al., 2022). This motivates learned surrogates such as PESQNet and PESQ-DNN that attempt to capture ACR-related perceptual structure without relying on a clean reference at inference time (Xu et al., 2022, Xu et al., 2023).
In coded-speech quality measurement, a non-intrusive PESQ-DNN predicts PESQ from the received coded speech alone and is evaluated by mean absolute error (MAE) and linear correlation coefficient (LCC). The reported best total performance without frame loss is $3 =$2 and $3 =$3 (Xu et al., 2023). The relevance to ACR is indirect but explicit: ACR/MOS is the subjective ground truth, PESQ is designed to approximate ACR, and PESQ-DNN approximates PESQ without a reference (Xu et al., 2023).
The same issue is investigated more directly in speech enhancement by training a fully convolutional recurrent neural network (FCRN) deep noise suppression model using PESQNet as a mediator loss. Intrusive PESQNet variants have access to clean reference information, while a non-intrusive PESQNet does not. The reported result is that the middle-fusion intrusive PESQNet improves PESQ by $3 =$4 points over the Interspeech 2021 DNS Challenge baseline and by $3 =$5 points over the same DNS model trained with MSE, but the advantage over the non-intrusive PESQNet is only about $3 =$6 PESQ points on the test set (Xu et al., 2022). The paper concludes that a clean speech reference is not necessary for a PESQ-based perceptual loss during DNS training, aligning the learned objective more closely with the reference-free nature of ACR itself (Xu et al., 2022).
5. Relation to ACR-HR, CCR, DCR, and pair comparison
ACR sits within a larger family of subjective quality-assessment protocols. In speech-quality methodology, ACR is the baseline “listening-only” test in which stimuli are presented one at a time and rated absolutely, whereas CCR presents both reference and processed stimuli and asks listeners to rate the second relative to the first on a $3 =$7 to $3 =$8 scale (Naderi et al., 2021). Under the ITU framing cited there, ACR is generally preferred when the goal is a general, unbiased judgment of overall quality, while CCR and DCR are better suited to smaller or subtler differences (Naderi et al., 2021).
The practical trade-off is cost versus discrimination. One methodological study describes ACR as easy to run, fast, and low-cost, but also prone to observer bias, inconsistency, and differing interpretations of the quality scale; by contrast, pair comparison is described as more discriminative and less affected by scale-use bias, at the expense of a much larger comparison budget, with full pair comparison over $3 =$9 stimuli requiring
$4 =$0
pairs (Ling et al., 2020). That paper’s core contribution is therefore not to replace ACR, but to use ACR results as initialization information for more targeted pair comparison (Ling et al., 2020).
In crowdsourced speech testing, CCR has been evaluated against both laboratory ACR and crowdsourced ACR. The reported correlations show strong overall agreement—CMOS from crowdsourced CCR versus laboratory MOS gives Pearson $4 =$1 and Spearman $4 =$2, while CMOS versus crowdsourced MOS gives Pearson $4 =$3 and Spearman $4 =$4—but the paper also identifies impairment-dependent divergences, especially for discontinuity and coloration (Naderi et al., 2021). ACR in the laboratory and ACR in crowdsourcing were themselves highly consistent in that study, with Pearson $4 =$5 and Spearman $4 =$6 (Naderi et al., 2021).
In crowdsourced video testing, ACR-HR introduces a hidden reference into an ACR-style protocol and computes a per-subject differential score
$4 =$7
which is then aggregated as DMOS (Naderi et al., 24 Sep 2025). A side-by-side comparison across six studies reports that ACR-HR and ACR correlate strongly at the condition level, while CCR is more sensitive and can capture improvements beyond the reference (Naderi et al., 24 Sep 2025). The same study states that ACR-HR is approximately twice as fast and cost-effective as CCR and has lower normalized variability, but that the choice of method shifts saturation points and bitrate-ladder recommendations (Naderi et al., 24 Sep 2025). In that sense, ACR remains the default absolute protocol, but not an interchangeable substitute for comparative methods when decision sensitivity is the primary objective.
6. Reliability, multilingual deployment, and current limitations
A substantial body of recent work addresses the reliability of crowdsourced ACR. In the open-source P.808 implementation, crowdsourced MOS values were validated against a laboratory P.800 study using the English subset from ITU Supplement 23. After repeated subsampling to match the laboratory vote count, the reported averages were $4 =$8, $4 =$9, and $5 =$0, improving to $5 =$1 after first-order mapping (Naderi et al., 2020). A reproducibility study on the INTERSPEECH Deep Noise Suppression Challenge test set reported average $5 =$2, average $5 =$3, $5 =$4, and $5 =$5, with $5 =$6 of submissions approved under the default filtering criteria (Naderi et al., 2020).
Those results depend on screening and post-processing. The same implementation reports that submissions passing gold-stimulus checks, environment tests, and all criteria combined were more consistent, while the headset filter did not significantly improve reliability; the authors note that headset detection succeeded in only $5 =$7 of sessions and therefore was not conclusive as an enforcement mechanism (Naderi et al., 2020). The video crowdsourcing framework based on P.910 likewise emphasizes rater, environment, hardware, and network qualification, together with gold and trapping questions, as conditions for reliable ACR data collection (Naderi et al., 2022).
Multilingual deployment introduces an additional reliability problem. In the URGENT 2025 challenge study, the localized P.808 ACR setup collected eight ratings per utterance, but the acceptance rates under reliability checks were markedly lower outside English: $5 =$8 for English, $5 =$9 for German, 0 for Chinese, and 1 for Japanese (Sach et al., 15 Jul 2025). The paper reports that rates improved somewhat after resubmission, but remained significantly more challenging for non-English conditions (Sach et al., 15 Jul 2025).
Recent work also identifies a substantive limitation of ACR as a gold standard in generative speech enhancement. The URGENT study argues that, for generative methods, subjective ACR MOS and reference-free objective metrics such as DNSMOS and NISQA may fail to penalize hallucinations or phone-level corruption; the most explicit example is model #13, which attained MOS 2, DNSMOS 3, and NISQA 4, but had 5 and 6 (Sach et al., 15 Jul 2025). The paper therefore recommends accompanying P.808 ACR MOS with an intelligibility metric such as ESTOI and especially a phone-fidelity metric such as LPS (Sach et al., 15 Jul 2025). This does not negate the status of ACR as a gold-standard subjective measure; rather, it delineates the conditions under which ACR alone may be insufficient for modern generative systems.