Quadratic Weighted Kappa Overview
- Quadratic Weighted Kappa is a metric that measures agreement for ordinal data by penalizing larger discrepancies with quadratic weights.
- It computes agreement by comparing observed and expected rating matrices while adjusting for chance and emphasizing higher penalties for greater differences.
- QWK is vital in automated essay scoring and similar applications, offering interpretable, scale-sensitive insights into human-machine evaluation.
Quadratic Weighted Kappa (QWK) is a metric for quantifying agreement between two ratings over a set of categorical, ordinal labels. It is widely utilized in human–machine evaluation pipelines where the target variable is an integer-valued grade or assessment, such as automated essay scoring, image annotation, or subjective ratings in natural language understanding. QWK extends Cohen’s kappa by incorporating a quadratic penalty for larger disagreements between raters, thus emphasizing the ordinal structure of the labels and providing interpretable, scale-sensitive agreement scores.
1. Mathematical Definition and Computation
Given two raters (or a rater and an automated system) assigning labels in ordered categories on samples, QWK is constructed as follows:
- Observed Matrix (): is the count of items with true label and predicted label .
- Marginals: (row sum for true label ), (column sum for predicted label ).
- Expected Matrix (0): 1, representing the count expected under statistical independence.
- Quadratic Weights (2): 3, imposing stronger penalties for greater rating disagreements.
QWK is then defined as:
4
This formulation represents one minus the ratio of observed weighted disagreement to the expected weighted disagreement under label independence. When agreement is perfect, the numerator is zero and 5. When observed agreement is no better than chance, 6 (Jiao et al., 31 Oct 2025, Mittal et al., 2021, Uto, 21 Apr 2026).
2. Interpretation and Properties
QWK is tailored to ordinal data and explicitly penalizes disagreements according to their magnitude: an error of 7 is penalized much less than 8. Its range is 9 in general, but is bounded in 0 for labels restricted to a finite range, as in grading tasks (Singla et al., 2021).
A QWK near 1 indicates almost perfect agreement adjusted for chance, while a QWK near 0 implies agreement no better than chance. Negative values arise if observed disagreement exceeds that expected by chance—indicating systematic disagreement.
QWK is sensitive to marginal distributions. If rater marginals differ significantly, the expected matrix 1 shifts accordingly, modulating the baseline for chance.
3. Step-by-Step Example
Suppose 2 (labels 0,1,2), 3, and the observed matrix is:
4
- Compute marginals: 5, 6.
- Weights: 7 if 8, 9 if 0, 1 if 2 (since 3).
- Compute weighted observed sum: 4.
- Compute weighted expected sum as 5.
- Then QWK: 6 (Uto, 21 Apr 2026).
4. Practical Use in Human–Machine Evaluation
QWK is the principal metric in automated essay scoring (AES) and similar tasks, being the official objective for high-profile challenges such as those on Kaggle (Jiao et al., 31 Oct 2025). In AES, for each essay, a model predicts an integer grade; massive state-of-the-art models (e.g., BERT, RoBERTa, DeBERTa) are compared using QWK against one or more human ratings. In rationalized AES evaluation, QWK is computed for ensemble models or models leveraging LLM-generated rationales, both in standalone and stacking regimes. Reported QWKs for top models on Prompt 6 of ASAP data range from approximately 0.80 to 0.87, with ensemble strategies incrementally improving agreement (Jiao et al., 31 Oct 2025).
In psychometrics, QWK is also employed to validate label proxies. For example, laughter-normalized “humour quotient” scores were validated against three human annotators, with mean QWK of 0.6, indicating that the automatic signal is as consistent as an additional human rater (Mittal et al., 2021).
5. QWK in Active and Human-in-the-Loop Evaluation
Human-labeling cost motivates hybrid scoring pipelines that intelligently allocate human rater effort. Sampling approaches prioritize items whose manual correction will most increase QWK.
- Reward Sampling: For each machine-predicted label 7, its empirical agreement distribution with human labels 8 is used to compute the expected reward 9 of relabeling a record 0. Records are sampled in proportion to 1, resulting in substantial QWK gains even with modest human annotation budgets (e.g., improvements of 25.6% with 30% re-graded samples) (Singla et al., 2021).
- Uncertainty/Importance Sampling: Selection based on cross-entropy or uncertainty in prediction.
- Estimation With Guarantees: For unbiased QWK estimation, importance sampling at the test-taker level, leveraging per-taker uncertainty weights 2, yields empirical confidence intervals. With 3, a 95% CI width of 4 is achievable.
These methods have been empirically validated across BERT and LSTM model baselines, consistently improving QWK under realistic annotation constraints (Singla et al., 2021).
6. Statistical Estimation, Significance, and Ceiling Effects
A critical methodological question concerns the achievable QWK ceiling in the presence of noisy human labels. Classical Test Theory (CTT) provides a principled approach:
- Theoretical Ceiling (5): This is the QWK that an ideal model (predicting latent true scores) can reach against noisy observed labels:
6
where 7 is the reliability of the mean score of all raters.
- Human-like Ceiling (8): This is the QWK attainable by a model whose error variance matches that of a single human:
9
where 0 is the reliability of a single rater.
- Computation: From two-rater data, compute mean squares between and within items, estimate 1 and 2, then apply the formulas above (Uto, 21 Apr 2026).
- Interpretation: Human–human QWK (3) is strictly less than both ceiling values. Using empirical 4 as the “best possible” can underestimate what a model could achieve if it matched the reliability of human raters in the data.
7. Application-Specific Findings and Limitations
- In AES, QWK is robust to class imbalance, and model improvements are reliably reflected in QWK increases, as confirmed by ensemble and stacking experiments (Jiao et al., 31 Oct 2025).
- In the context of multimodal evaluation, such as computational humor, QWK provides a nuanced measure of model–rater agreement, reflecting not just binary hits but proximity in ordinal space (Mittal et al., 2021).
- QWK does not in itself quantify statistical significance in differences; bootstrapping or related resampling schemes are necessary for confidence intervals and hypothesis testing (Jiao et al., 31 Oct 2025).
- Optimal QWK is constrained by rater reliability; as a result, further modeling gains may be unattainable unless label noise is reduced (Uto, 21 Apr 2026).
A plausible implication is that as models approach the QWK ceiling imposed by human reliability, further improvements should focus on label quality and procedural re-annotation rather than solely on improving model architecture. The explicit ceiling calculations provide actionable targets for both model selection and dataset curation in ordinal labeling tasks.