Papers
Topics
Authors
Recent
2000 character limit reached

Quadratic Weighted Kappa Overview

Updated 13 January 2026
  • Quadratic Weighted Kappa is an evaluation metric for ordinal ratings that measures the agreement between predictions and true labels, adjusting for random chance.
  • It applies a quadratic weighting scheme to penalize larger discrepancies more severely, reflecting the ordinal structure of the data.
  • Its differentiable formulation enables direct optimization in machine learning models, facilitating robust performance in tasks like essay scoring and medical image assessment.

Quadratic Weighted Kappa (QWK) is an evaluation metric designed for ordinal classification systems where predicted and ground truth labels are discrete and ordered, such as integer ratings assigned by human experts. QWK quantifies the level of agreement between two raters, adjusting for chance and penalizing larger discrepancies more severely via a quadratic weighting scheme. This metric has achieved prominence as the principal metric in several high-profile machine learning competitions involving ordinal prediction tasks—such as the Kaggle ASAP essay scoring and diabetic retinopathy detection challenges—due to its sensitivity to both the ordinal structure and the distributional characteristics of the rating data (Vaughn et al., 2015).

1. Formal Definition

Let kk be the number of rating categories, labeled %%%%1%%%%, and NN be the number of samples, each with a true rating rn{1,,k}r_n \in \{1,\ldots,k\} and a predicted rating r^n{1,,k}\hat{r}_n \in \{1,\ldots,k\}. The conventional definition of QWK starts with the k×kk \times k observed confusion matrix OO and the expected count matrix EE (under chance agreement):

  • Oij={n:rn=ir^n=j}O_{ij} = |\{n : r_n = i \land \hat{r}_n = j\}|
  • Marginal counts: pi=j=1kOijp_i = \sum_{j=1}^k O_{ij} (row), qj=i=1kOijq_j = \sum_{i=1}^k O_{ij} (column)
  • Eij=(piqj)/NE_{ij} = (p_i \cdot q_j)/N

Quadratic weights emphasize the ordinal structure:

wij=(ij)2(k1)2w_{ij} = \frac{(i-j)^2}{(k-1)^2}

QWK is then defined as:

κQ=1i=1kj=1kwijOiji=1kj=1kwijEij\kappa_Q = 1 - \frac{\sum_{i=1}^k \sum_{j=1}^k w_{ij} O_{ij}}{\sum_{i=1}^k \sum_{j=1}^k w_{ij} E_{ij}}

This formulation generalizes Cohen’s kappa by offering a tunable penalty that increases with the magnitude of rating disagreement.

2. Simplified Equivalent Formulation

A more analytically tractable form of QWK can be derived by aggregating per-sample loss terms:

Let en,i=1[rn=i]e_{n,i} = \mathbf{1}[r_n = i] and pn,j=1[r^n=j]p_{n,j} = \mathbf{1}[\hat{r}_n = j], then:

  • Oij=n=1Nen,ipn,jO_{ij} = \sum_{n=1}^N e_{n,i} p_{n,j}
  • Eij=(nen,i)(npn,j)NE_{ij} = \frac{(\sum_n e_{n,i})(\sum_n p_{n,j})}{N}

Plugging these into QWK produces:

κQ=1n=1Nwrn,r^n1Ni=1kj=1kwijpiqj\kappa_Q = 1 - \frac{\sum_{n=1}^{N} w_{r_n,\hat{r}_n}}{\frac{1}{N}\sum_{i=1}^k \sum_{j=1}^k w_{ij} p_i q_j}

The numerator becomes a sum of per-sample quadratic losses, and the denominator is fixed for given label marginals. This closed-form simplifies optimization and clarifies the dependency structure: only the numerator responds to predicted label changes, while the denominator is static for a fixed dataset (Vaughn et al., 2015).

3. Mathematical Properties

QWK possesses several mathematically desirable characteristics:

  • Range and Symmetry:
    • κQ[1,1]\kappa_Q \in [-1, 1]
    • κQ=1\kappa_Q = 1 when all predictions match ground truth perfectly.
    • κQ=0\kappa_Q = 0 when agreement equals chance expectation.
    • κQ<0\kappa_Q < 0 indicates systematic disagreement worse than random.
    • Swapping true with predicted ratings leaves κQ\kappa_Q invariant.
  • Error Penalization:
    • Larger rating discrepancies incur quadratic penalties: an error of ij=2|i-j|=2 yields loss (2/(k1))2(2/(k-1))^2, making QWK more sensitive to large errors than linear or absolute metrics.
    • Small mis-rankings are less severely penalized.
  • Interpretation of Weights:
    • Weights correspond to squared error on an ordinal scale, ensuring that misclassifications at category extremes contribute maximally to the penalty.

4. Comparison to Alternative Metrics

Mean Squared Error (MSE):

  • MSE=(1/N)n(r^nrn)2MSE = (1/N) \sum_n (\hat{r}_n - r_n)^2
  • Shares quadratic penalization of label errors with QWK, but lacks correction for chance agreement and is unbounded above.
  • Does not account for surprise in marginal distribution and imbalance effects.

Pearson Correlation (ρ\rho):

  • ρ=Cov(r,r^)/[σ(r)σ(r^)]\rho = \text{Cov}(r, \hat{r}) / [\sigma(r)\sigma(\hat{r})]
  • Evaluates linear association between predicted and true ratings.
  • Can yield high scores despite systematic offsets and does not factor in nominal agreement expectations.

Advantages of QWK:

  • Incorporates both ordinal structure (via wijw_{ij}) and chance agreement baseline.
  • Bounded score in [1,1][-1,1] enables meaningful interpretation across tasks.
  • Robust to rating imbalance, making it particularly suitable for real-world data distributions.

5. Direct Analytic Optimization

To maximize QWK directly within learning algorithms, differentiable surrogates can be constructed via soft label assignments. For each sample, model logits znRkz_n \in \mathbb{R}^k are mapped to soft rating probabilities:

sn,j=softmaxj(zn)s_{n,j} = \text{softmax}_j(z_n)

Define the expected per-sample quadratic loss:

Ln=i=1kj=1ken,isn,jwij=j=1kwrn,jsn,jL_n = \sum_{i=1}^k \sum_{j=1}^k e_{n,i} s_{n,j} w_{ij} = \sum_{j=1}^k w_{r_n, j} s_{n,j}

Optimizing:

Qsoft=1n=1NLnDQ_{soft} = 1 - \frac{\sum_{n=1}^N L_n}{D}

Where D=(1/N)i,jwijpiqjD = (1/N)\sum_{i,j} w_{ij} p_i q_j is constant.

Gradient with respect to logits zn,z_{n,\ell}:

Lnzn,=wrn,sn,sn,jwrn,jsn,j\frac{\partial L_n}{\partial z_{n,\ell}} = w_{r_n,\ell} s_{n,\ell} - s_{n,\ell} \sum_{j} w_{r_n,j} s_{n,j}

This enables integration with SGD-style optimizers. Post-optimization, predicted labels are discretized: r^n=argmaxjsn,j\hat{r}_n = \arg\max_j s_{n,j} (Vaughn et al., 2015).

6. Contextual Significance and Practical Roles

QWK has become the standard metric in domains requiring ordinal prediction aligned with human expert judgment, notably in essay grading and medical image assessment contests. Its adoption stems from its capacity to mitigate the prevailing deficiencies of MSE and correlation in the context of class imbalance and ordinal error structure.

A plausible implication is that QWK’s quadratic penalization and chance correction mitigate overestimation of classifier performance in skewed or multi-rater labeling regimes. QWK's direct optimizability via differentiable surrogates has fostered greater methodological rigor in model development for ordinal classification.

7. Summary of Foundational Equations

Formulation Description Dependent Terms
κQ\kappa_Q (confusion) Conventional quadratic weighted kappa (Eq 1) Oij,Eij,wijO_{ij}, E_{ij}, w_{ij}
κQ\kappa_Q (per-sample) Alternate per-sample loss sum (Eq 2) wrn,r^n,pi,qjw_{r_n,\hat{r}_n}, p_i, q_j
LnL_n Per-sample soft loss for optimization wrn,j,sn,jw_{r_n,j}, s_{n,j}

Equations (1) and (2) are algebraically identical, but Eq (2) exposes the dependency on predicted outputs for optimization and is preferred in direct analytic treatments of model training for ordinal regression (Vaughn et al., 2015). QWK thus serves as both a robust evaluation tool and a trainable objective, facilitating principled model design and assessment in ordinal prediction tasks.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Whiteboard

Topic to Video (Beta)

Follow Topic

Get notified by email when new papers are published related to Quadratic Weighted Kappa.