Papers
Topics
Authors
Recent
Search
2000 character limit reached

Quadratic Weighted Kappa Overview

Updated 2 May 2026
  • Quadratic Weighted Kappa is a metric that measures agreement for ordinal data by penalizing larger discrepancies with quadratic weights.
  • It computes agreement by comparing observed and expected rating matrices while adjusting for chance and emphasizing higher penalties for greater differences.
  • QWK is vital in automated essay scoring and similar applications, offering interpretable, scale-sensitive insights into human-machine evaluation.

Quadratic Weighted Kappa (QWK) is a metric for quantifying agreement between two ratings over a set of categorical, ordinal labels. It is widely utilized in human–machine evaluation pipelines where the target variable is an integer-valued grade or assessment, such as automated essay scoring, image annotation, or subjective ratings in natural language understanding. QWK extends Cohen’s kappa by incorporating a quadratic penalty for larger disagreements between raters, thus emphasizing the ordinal structure of the labels and providing interpretable, scale-sensitive agreement scores.

1. Mathematical Definition and Computation

Given two raters (or a rater and an automated system) assigning labels in KK ordered categories on NN samples, QWK is constructed as follows:

  • Observed Matrix (OO): OijO_{ij} is the count of items with true label ii and predicted label jj.
  • Marginals: Ri=jOijR_i = \sum_j O_{ij} (row sum for true label ii), Cj=iOijC_j = \sum_i O_{ij} (column sum for predicted label jj).
  • Expected Matrix (NN0): NN1, representing the count expected under statistical independence.
  • Quadratic Weights (NN2): NN3, imposing stronger penalties for greater rating disagreements.

QWK is then defined as:

NN4

This formulation represents one minus the ratio of observed weighted disagreement to the expected weighted disagreement under label independence. When agreement is perfect, the numerator is zero and NN5. When observed agreement is no better than chance, NN6 (Jiao et al., 31 Oct 2025, Mittal et al., 2021, Uto, 21 Apr 2026).

2. Interpretation and Properties

QWK is tailored to ordinal data and explicitly penalizes disagreements according to their magnitude: an error of NN7 is penalized much less than NN8. Its range is NN9 in general, but is bounded in OO0 for labels restricted to a finite range, as in grading tasks (Singla et al., 2021).

A QWK near 1 indicates almost perfect agreement adjusted for chance, while a QWK near 0 implies agreement no better than chance. Negative values arise if observed disagreement exceeds that expected by chance—indicating systematic disagreement.

QWK is sensitive to marginal distributions. If rater marginals differ significantly, the expected matrix OO1 shifts accordingly, modulating the baseline for chance.

3. Step-by-Step Example

Suppose OO2 (labels 0,1,2), OO3, and the observed matrix is:

OO4

  • Compute marginals: OO5, OO6.
  • Weights: OO7 if OO8, OO9 if OijO_{ij}0, OijO_{ij}1 if OijO_{ij}2 (since OijO_{ij}3).
  • Compute weighted observed sum: OijO_{ij}4.
  • Compute weighted expected sum as OijO_{ij}5.
  • Then QWK: OijO_{ij}6 (Uto, 21 Apr 2026).

4. Practical Use in Human–Machine Evaluation

QWK is the principal metric in automated essay scoring (AES) and similar tasks, being the official objective for high-profile challenges such as those on Kaggle (Jiao et al., 31 Oct 2025). In AES, for each essay, a model predicts an integer grade; massive state-of-the-art models (e.g., BERT, RoBERTa, DeBERTa) are compared using QWK against one or more human ratings. In rationalized AES evaluation, QWK is computed for ensemble models or models leveraging LLM-generated rationales, both in standalone and stacking regimes. Reported QWKs for top models on Prompt 6 of ASAP data range from approximately 0.80 to 0.87, with ensemble strategies incrementally improving agreement (Jiao et al., 31 Oct 2025).

In psychometrics, QWK is also employed to validate label proxies. For example, laughter-normalized “humour quotient” scores were validated against three human annotators, with mean QWK of 0.6, indicating that the automatic signal is as consistent as an additional human rater (Mittal et al., 2021).

5. QWK in Active and Human-in-the-Loop Evaluation

Human-labeling cost motivates hybrid scoring pipelines that intelligently allocate human rater effort. Sampling approaches prioritize items whose manual correction will most increase QWK.

  • Reward Sampling: For each machine-predicted label OijO_{ij}7, its empirical agreement distribution with human labels OijO_{ij}8 is used to compute the expected reward OijO_{ij}9 of relabeling a record ii0. Records are sampled in proportion to ii1, resulting in substantial QWK gains even with modest human annotation budgets (e.g., improvements of 25.6% with 30% re-graded samples) (Singla et al., 2021).
  • Uncertainty/Importance Sampling: Selection based on cross-entropy or uncertainty in prediction.
  • Estimation With Guarantees: For unbiased QWK estimation, importance sampling at the test-taker level, leveraging per-taker uncertainty weights ii2, yields empirical confidence intervals. With ii3, a 95% CI width of ii4 is achievable.

These methods have been empirically validated across BERT and LSTM model baselines, consistently improving QWK under realistic annotation constraints (Singla et al., 2021).

6. Statistical Estimation, Significance, and Ceiling Effects

A critical methodological question concerns the achievable QWK ceiling in the presence of noisy human labels. Classical Test Theory (CTT) provides a principled approach:

  • Theoretical Ceiling (ii5): This is the QWK that an ideal model (predicting latent true scores) can reach against noisy observed labels:

ii6

where ii7 is the reliability of the mean score of all raters.

  • Human-like Ceiling (ii8): This is the QWK attainable by a model whose error variance matches that of a single human:

ii9

where jj0 is the reliability of a single rater.

  • Computation: From two-rater data, compute mean squares between and within items, estimate jj1 and jj2, then apply the formulas above (Uto, 21 Apr 2026).
  • Interpretation: Human–human QWK (jj3) is strictly less than both ceiling values. Using empirical jj4 as the “best possible” can underestimate what a model could achieve if it matched the reliability of human raters in the data.

7. Application-Specific Findings and Limitations

  • In AES, QWK is robust to class imbalance, and model improvements are reliably reflected in QWK increases, as confirmed by ensemble and stacking experiments (Jiao et al., 31 Oct 2025).
  • In the context of multimodal evaluation, such as computational humor, QWK provides a nuanced measure of model–rater agreement, reflecting not just binary hits but proximity in ordinal space (Mittal et al., 2021).
  • QWK does not in itself quantify statistical significance in differences; bootstrapping or related resampling schemes are necessary for confidence intervals and hypothesis testing (Jiao et al., 31 Oct 2025).
  • Optimal QWK is constrained by rater reliability; as a result, further modeling gains may be unattainable unless label noise is reduced (Uto, 21 Apr 2026).

A plausible implication is that as models approach the QWK ceiling imposed by human reliability, further improvements should focus on label quality and procedural re-annotation rather than solely on improving model architecture. The explicit ceiling calculations provide actionable targets for both model selection and dataset curation in ordinal labeling tasks.

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Quadratic Weighted Kappa (QWK).