Papers
Topics
Authors
Recent
Search
2000 character limit reached

Cohen’s Kappa: Agreement Beyond Chance

Updated 15 June 2026
  • Cohen’s kappa is a chance-corrected measure that defines agreement for categorical ratings by comparing observed agreement to that expected from independent marginal distributions.
  • The metric is computed using a confusion matrix and is normalized to lie between perfect agreement (1) and values below zero for systematic disagreement, with clear extensions to binary and multi-class tasks.
  • Limitations including the prevalence paradox have led to alternative statistics like Gwet’s AC1 and Bayesian methods, emphasizing cautious interpretation and complementary reliability indices.

Cohen’s kappa (κ) is a canonical chance-corrected index of pairwise agreement for categorical data. It quantifies the extent to which observed agreement between two raters (or a rater and a classifier) exceeds that expected under independent labeling drawn from their marginal distributions. κ is widely used in medicine, psychology, machine learning, and annotation-based research to evaluate inter-rater reliability or classifier validity, and has yielded a large critical literature on its formal structure, interpretability, and practical limitations. The following sections provide a detailed exposition of its definition, mathematical properties, connections to information theory, computational issues, alternatives, and empirical benchmarking frameworks.

1. Mathematical Definition and Computation

Suppose two raters independently classify nn items into KK mutually exclusive categories. Let nijn_{ij} be the number of items that Rater 1 assigns to category ii and Rater 2 to jj (1i,jK1 \leq i,j \leq K). The observed agreement is

Po=1ni=1Knii,P_o = \frac{1}{n} \sum_{i=1}^K n_{ii} \,,

and the chance agreement under independent draws from the marginal category distributions is

Pe=i=1Kpiqi where pi=jnijn,qi=jnjin.P_e = \sum_{i=1}^K p_i q_i \text{ where } p_i = \frac{\sum_j n_{ij}}{n}, \quad q_i = \frac{\sum_j n_{ji}}{n} \,.

Cohen’s kappa is then

κ=PoPe1Pe.\kappa = \frac{P_o - P_e }{1 - P_e} \,.

For binary labels, this specializes to:

κ=pope1pe,po=TP+TNN,pe=ππ^+(1π)(1π^),\kappa = \frac{p_o - p_e}{1 - p_e}\,, \quad p_o = \frac{TP + TN}{N}\,, \quad p_e = \pi \hat{\pi} + (1-\pi)(1-\hat{\pi})\,,

where KK0 and KK1 are the marginal positive rates for the two raters or systems (Rao et al., 25 May 2026, Tian et al., 2024).

Interpretation: KK2 denotes perfect agreement, KK3 agreement no better than chance, and KK4 systematic disagreement beyond chance.

The denominator KK5 ensures normalization between systematic mismatch (potentially negative lower bound) and perfect concordance (KK6). For multi-class tasks this generalizes directly by forming the confusion or agreement matrix and substituting each KK7 as appropriate (Casagrande et al., 21 Apr 2025, Wong et al., 2021).

2. Statistical Properties, Bounds, and Relation to Other Measures

Cohen’s original work established both explicit upper and lower bounds for KK8 given fixed marginals (Sahu et al., 2024, Safak, 2020). The maximum possible agreement for fixed marginals KK9 is nijn_{ij}0, yielding an upper bound

nijn_{ij}1

However, the minimum feasible agreement nijn_{ij}2—once an open question—admits the closed form nijn_{ij}3 after appropriate permutation of categories (so nijn_{ij}4 is maximal) (Safak, 2020). The tight lower bound is

nijn_{ij}5

which can be substantially negative, especially under marginal imbalance.

For binary variables, there is a linear relationship to the Pearson correlation nijn_{ij}6:

nijn_{ij}7

This constant depends only on marginals. nijn_{ij}8 and the Matthews Correlation Coefficient/phi coefficient have identical numerators but different normalizations, leading to nijn_{ij}9 with equality if marginals match (Sahu et al., 2024, Rao et al., 25 May 2026).

In the presence of abstentions, ii0's value depends on preprocessing choice—exclusion, recode-as-negative, or three-class extension—each answering distinct inferential questions (Rao et al., 25 May 2026).

3. Theoretical Interpretation and Information-Theoretic Connections

ii1 formalizes the idea of “agreement beyond chance” under a permutation (hypergeometric) null model, guaranteeing ii2 for random labeling with fixed marginals (Lippitt et al., 15 Jan 2026). This property persists only when the null holds marginals fixed across resamplings; in multinomial or other nulls where marginals are random, ii3 departs from zero and interpretability collapses (Lippitt et al., 15 Jan 2026).

Recent developments have anchored ii4 to information-theoretic quantities. Notably, there is a smooth, monotonic relationship between ii5 and the Resistor Average Distance (RAD) between class-conditional densities in classification (Crow et al., 2024):

ii6

where ii7 is the resistor average of Kullback–Leibler divergences:

ii8

Thus, maximum achievable ii9 is a direct function of the intrinsic information separation between classes. Empirical studies demonstrate that observed jj0 matches the jj1 upper bound closely on both synthetic and real datasets, confirming this theoretical link (Crow et al., 2024).

Information Agreement (IA), an alternative based on mutual information, avoids the “chance” baseline and normalizes by the minimum entropy of marginal distributions; IA always lies in jj2 and is robust to prevalence/bias artifacts that affect jj3 (Casagrande et al., 2020).

4. Limitations, Prevalence Paradox, and Critique

jj4’s behavior is unintuitive under prevalence or marginal imbalance, known as the “prevalence paradox”: for highly imbalanced marginals, jj5 can be low even with near-perfect agreement on the dominant class (Wong et al., 2021, Silveira et al., 2022). Moreover, jj6's lower bound is not always jj7; extremely skewed marginals may force jj8 only slightly negative or near zero even for maximal disagreement (Safak, 2020).

Empirical analysis over jj9 2×2 tables confirms:

  • Severe underestimation of agreement under extreme prevalence
  • Collapse or instability of 1i,jK1 \leq i,j \leq K0 when row or column marginals are near zero
  • High type I/II rates under moderate observed agreement
  • Very similar “handicapped” behavior to Pearson’s 1i,jK1 \leq i,j \leq K1, and other classical measures

Consequently, Holley & Guilford’s 1i,jK1 \leq i,j \leq K2 and Gwet’s 1i,jK1 \leq i,j \leq K3 have been recommended as more reliable alternatives in dichotomous settings, with 1i,jK1 \leq i,j \leq K4 correcting the prevalence effect and correlating near-perfectly with 1i,jK1 \leq i,j \leq K5 (Silveira et al., 2022).

5. Interpretation, Significance Indices, and Reporting Guidelines

Raw 1i,jK1 \leq i,j \leq K6 values are widely classified into “slight,” “fair,” “moderate,” “substantial,” “almost perfect” following ad hoc thresholds (e.g., Landis–Koch), but these scales are arbitrary and not functionally linked to sampling variability, number of categories, or sample size (Casagrande et al., 21 Apr 2025).

Principled, probabilistic significance can be assigned by computing the likelihood (over all possible confusion matrices for given 1i,jK1 \leq i,j \leq K7) that a random labeler obtains a 1i,jK1 \leq i,j \leq K8 at least as large as observed; this is termed the significance index 1i,jK1 \leq i,j \leq K9. As samples grow, this approach converges to a distributional index Po=1ni=1Knii,P_o = \frac{1}{n} \sum_{i=1}^K n_{ii} \,,0 based on the location of Po=1ni=1Knii,P_o = \frac{1}{n} \sum_{i=1}^K n_{ii} \,,1 in the probability simplex. This framework enables data-dependent, threshold-free interpretability for Po=1ni=1Knii,P_o = \frac{1}{n} \sum_{i=1}^K n_{ii} \,,2 (Casagrande et al., 21 Apr 2025).

Reporting best practice stipulates:

  • Explicit statement of judgment scale and abstention handling
  • Publication of full confusion matrices and marginal rates
  • Reporting Po=1ni=1Knii,P_o = \frac{1}{n} \sum_{i=1}^K n_{ii} \,,3 alongside Po=1ni=1Knii,P_o = \frac{1}{n} \sum_{i=1}^K n_{ii} \,,4 or Po=1ni=1Knii,P_o = \frac{1}{n} \sum_{i=1}^K n_{ii} \,,5 in dichotomous settings
  • Cautious interpretation or replacement when prevalence is extreme or marginal bias is present
  • Consideration of alternative or disattenuated indices when multi-rater or multi-class settings apply (Rao et al., 25 May 2026, Lippitt et al., 15 Jan 2026)

6. Extensions and Empirical Frameworks

Bayesian and Multilevel Extensions

For repeated measures, longitudinal, or multilevel data (e.g., multiple raters and time points, or hierarchical structures), generalized linear mixed models extend Po=1ni=1Knii,P_o = \frac{1}{n} \sum_{i=1}^K n_{ii} \,,6 estimation by incorporating subject, rater, and batch effects. Marginal or conditional Po=1ni=1Knii,P_o = \frac{1}{n} \sum_{i=1}^K n_{ii} \,,7 estimates then summarize agreement while propagating appropriate posterior uncertainty (Hawila et al., 2024). Bayesian models, e.g., BIN, BPN, and BFN frameworks, produce less biased Po=1ni=1Knii,P_o = \frac{1}{n} \sum_{i=1}^K n_{ii} \,,8 estimates with valid credible intervals, especially under small samples or deep nesting.

Cross-Replication Reliability (xRR) Framework

Wong et al. propose benchmarking Po=1ni=1Knii,P_o = \frac{1}{n} \sum_{i=1}^K n_{ii} \,,9 through cross-replication reliability, introducing cross-kappa (Pe=i=1Kpiqi where pi=jnijn,qi=jnjin.P_e = \sum_{i=1}^K p_i q_i \text{ where } p_i = \frac{\sum_j n_{ij}}{n}, \quad q_i = \frac{\sum_j n_{ji}}{n} \,.0) to quantify agreement across replications (e.g., different annotator pools or protocols) (Wong et al., 2021). Given two annotation runs Pe=i=1Kpiqi where pi=jnijn,qi=jnjin.P_e = \sum_{i=1}^K p_i q_i \text{ where } p_i = \frac{\sum_j n_{ij}}{n}, \quad q_i = \frac{\sum_j n_{ji}}{n} \,.1 and Pe=i=1Kpiqi where pi=jnijn,qi=jnjin.P_e = \sum_{i=1}^K p_i q_i \text{ where } p_i = \frac{\sum_j n_{ij}}{n}, \quad q_i = \frac{\sum_j n_{ji}}{n} \,.2 on the same items, Pe=i=1Kpiqi where pi=jnijn,qi=jnjin.P_e = \sum_{i=1}^K p_i q_i \text{ where } p_i = \frac{\sum_j n_{ij}}{n}, \quad q_i = \frac{\sum_j n_{ji}}{n} \,.3 generalizes Pe=i=1Kpiqi where pi=jnijn,qi=jnjin.P_e = \sum_{i=1}^K p_i q_i \text{ where } p_i = \frac{\sum_j n_{ij}}{n}, \quad q_i = \frac{\sum_j n_{ji}}{n} \,.4 to compare any two groups, with normalization correcting for low within-pool reliability:

Pe=i=1Kpiqi where pi=jnijn,qi=jnjin.P_e = \sum_{i=1}^K p_i q_i \text{ where } p_i = \frac{\sum_j n_{ij}}{n}, \quad q_i = \frac{\sum_j n_{ji}}{n} \,.5

where Pe=i=1Kpiqi where pi=jnijn,qi=jnjin.P_e = \sum_{i=1}^K p_i q_i \text{ where } p_i = \frac{\sum_j n_{ij}}{n}, \quad q_i = \frac{\sum_j n_{ji}}{n} \,.6 and Pe=i=1Kpiqi where pi=jnijn,qi=jnjin.P_e = \sum_{i=1}^K p_i q_i \text{ where } p_i = \frac{\sum_j n_{ij}}{n}, \quad q_i = \frac{\sum_j n_{ji}}{n} \,.7 are observed and expected cross-replication disagreements.

Case studies demonstrate that Pe=i=1Kpiqi where pi=jnijn,qi=jnjin.P_e = \sum_{i=1}^K p_i q_i \text{ where } p_i = \frac{\sum_j n_{ij}}{n}, \quad q_i = \frac{\sum_j n_{ji}}{n} \,.8 reveals scenario-specific phenomena—such as population- or protocol-specific bias or reproducibility failures—unobservable with classical Pe=i=1Kpiqi where pi=jnijn,qi=jnjin.P_e = \sum_{i=1}^K p_i q_i \text{ where } p_i = \frac{\sum_j n_{ij}}{n}, \quad q_i = \frac{\sum_j n_{ji}}{n} \,.9 alone.

7. Alternatives, Generalizations, and Practical Recommendations

Many limitations of κ=PoPe1Pe.\kappa = \frac{P_o - P_e }{1 - P_e} \,.0 have motivated both conceptual and computational alternatives:

  • Information Agreement (IA) and its zero-entry extension directly measure mutual information normalized by marginal entropy, avoiding arbitrary chance baselines and yielding κ=PoPe1Pe.\kappa = \frac{P_o - P_e }{1 - P_e} \,.1-scaled, always nonnegative agreement values (Casagrande et al., 2020).
  • Gwet’s κ=PoPe1Pe.\kappa = \frac{P_o - P_e }{1 - P_e} \,.2, Holley & Guilford’s κ=PoPe1Pe.\kappa = \frac{P_o - P_e }{1 - P_e} \,.3, Yule’s κ=PoPe1Pe.\kappa = \frac{P_o - P_e }{1 - P_e} \,.4, and Bennett’s κ=PoPe1Pe.\kappa = \frac{P_o - P_e }{1 - P_e} \,.5 are empirically less biased and more robust across class-imbalance regimes (Silveira et al., 2022, Tian et al., 2024).
  • Prevalence- and bias-adjusted κ=PoPe1Pe.\kappa = \frac{P_o - P_e }{1 - P_e} \,.6 (PABAK) and generalized chance-corrected indices remedy specific defects, but do not address all issues.
  • In simulated studies with correlated rater decision processes, κ=PoPe1Pe.\kappa = \frac{P_o - P_e }{1 - P_e} \,.7 is consistently biased downward by about 0.08 units relative to probabilistic-certainty benchmarks; κ=PoPe1Pe.\kappa = \frac{P_o - P_e }{1 - P_e} \,.8 and κ=PoPe1Pe.\kappa = \frac{P_o - P_e }{1 - P_e} \,.9 are closer to true corrected agreement (Tian et al., 2024).

Practical guidance includes:

  • Avoiding κ=pope1pe,po=TP+TNN,pe=ππ^+(1π)(1π^),\kappa = \frac{p_o - p_e}{1 - p_e}\,, \quad p_o = \frac{TP + TN}{N}\,, \quad p_e = \pi \hat{\pi} + (1-\pi)(1-\hat{\pi})\,,0 when outcome prevalence is extreme or marginal bias is strong.
  • Supplementing κ=pope1pe,po=TP+TNN,pe=ππ^+(1π)(1π^),\kappa = \frac{p_o - p_e}{1 - p_e}\,, \quad p_o = \frac{TP + TN}{N}\,, \quad p_e = \pi \hat{\pi} + (1-\pi)(1-\hat{\pi})\,,1 with prevalence and bias indices; when necessary, apply alternative indices and contextualize any negative or near-zero κ=pope1pe,po=TP+TNN,pe=ππ^+(1π)(1π^),\kappa = \frac{p_o - p_e}{1 - p_e}\,, \quad p_o = \frac{TP + TN}{N}\,, \quad p_e = \pi \hat{\pi} + (1-\pi)(1-\hat{\pi})\,,2 against the true lower bound determined by marginals.
  • Using κ=pope1pe,po=TP+TNN,pe=ππ^+(1π)(1π^),\kappa = \frac{p_o - p_e}{1 - p_e}\,, \quad p_o = \frac{TP + TN}{N}\,, \quad p_e = \pi \hat{\pi} + (1-\pi)(1-\hat{\pi})\,,3 or Bayesian models for benchmarking and quantifying reliability in annotation-intensive, crowdsourced, or multilevel designs.

Summary Table: Cohen’s Kappa—Core Quantities

Quantity Symbol Formula / Description
Observed agreement κ=pope1pe,po=TP+TNN,pe=ππ^+(1π)(1π^),\kappa = \frac{p_o - p_e}{1 - p_e}\,, \quad p_o = \frac{TP + TN}{N}\,, \quad p_e = \pi \hat{\pi} + (1-\pi)(1-\hat{\pi})\,,4 κ=pope1pe,po=TP+TNN,pe=ππ^+(1π)(1π^),\kappa = \frac{p_o - p_e}{1 - p_e}\,, \quad p_o = \frac{TP + TN}{N}\,, \quad p_e = \pi \hat{\pi} + (1-\pi)(1-\hat{\pi})\,,5
Marginal rates κ=pope1pe,po=TP+TNN,pe=ππ^+(1π)(1π^),\kappa = \frac{p_o - p_e}{1 - p_e}\,, \quad p_o = \frac{TP + TN}{N}\,, \quad p_e = \pi \hat{\pi} + (1-\pi)(1-\hat{\pi})\,,6 Row/column sums per category (κ=pope1pe,po=TP+TNN,pe=ππ^+(1π)(1π^),\kappa = \frac{p_o - p_e}{1 - p_e}\,, \quad p_o = \frac{TP + TN}{N}\,, \quad p_e = \pi \hat{\pi} + (1-\pi)(1-\hat{\pi})\,,7, κ=pope1pe,po=TP+TNN,pe=ππ^+(1π)(1π^),\kappa = \frac{p_o - p_e}{1 - p_e}\,, \quad p_o = \frac{TP + TN}{N}\,, \quad p_e = \pi \hat{\pi} + (1-\pi)(1-\hat{\pi})\,,8)
Chance agreement κ=pope1pe,po=TP+TNN,pe=ππ^+(1π)(1π^),\kappa = \frac{p_o - p_e}{1 - p_e}\,, \quad p_o = \frac{TP + TN}{N}\,, \quad p_e = \pi \hat{\pi} + (1-\pi)(1-\hat{\pi})\,,9 KK00
Kappa KK01 KK02
Maximum kappa KK03 KK04
Minimum kappa KK05 KK06
Info distance bound KK07 Resistor average distance between class-conditional densities

Cohen’s KK08 remains a central measure of chance-corrected agreement, yet its interpretation demands careful attention to prevalence, marginal distributions, and the explicit modeling of chance. Alternatives and recent generalizations offer more principled or informative reliability metrics in many empirical settings (Casagrande et al., 2020, Wong et al., 2021, Silveira et al., 2022, Crow et al., 2024, Sahu et al., 2024, Casagrande et al., 21 Apr 2025, Lippitt et al., 15 Jan 2026, Rao et al., 25 May 2026, Tian et al., 2024, Hawila et al., 2024).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Cohen’s Kappa.