Cohen’s Kappa: Agreement Beyond Chance
- Cohen’s kappa is a chance-corrected measure that defines agreement for categorical ratings by comparing observed agreement to that expected from independent marginal distributions.
- The metric is computed using a confusion matrix and is normalized to lie between perfect agreement (1) and values below zero for systematic disagreement, with clear extensions to binary and multi-class tasks.
- Limitations including the prevalence paradox have led to alternative statistics like Gwet’s AC1 and Bayesian methods, emphasizing cautious interpretation and complementary reliability indices.
Cohen’s kappa (κ) is a canonical chance-corrected index of pairwise agreement for categorical data. It quantifies the extent to which observed agreement between two raters (or a rater and a classifier) exceeds that expected under independent labeling drawn from their marginal distributions. κ is widely used in medicine, psychology, machine learning, and annotation-based research to evaluate inter-rater reliability or classifier validity, and has yielded a large critical literature on its formal structure, interpretability, and practical limitations. The following sections provide a detailed exposition of its definition, mathematical properties, connections to information theory, computational issues, alternatives, and empirical benchmarking frameworks.
1. Mathematical Definition and Computation
Suppose two raters independently classify items into mutually exclusive categories. Let be the number of items that Rater 1 assigns to category and Rater 2 to (). The observed agreement is
and the chance agreement under independent draws from the marginal category distributions is
Cohen’s kappa is then
For binary labels, this specializes to:
where 0 and 1 are the marginal positive rates for the two raters or systems (Rao et al., 25 May 2026, Tian et al., 2024).
Interpretation: 2 denotes perfect agreement, 3 agreement no better than chance, and 4 systematic disagreement beyond chance.
The denominator 5 ensures normalization between systematic mismatch (potentially negative lower bound) and perfect concordance (6). For multi-class tasks this generalizes directly by forming the confusion or agreement matrix and substituting each 7 as appropriate (Casagrande et al., 21 Apr 2025, Wong et al., 2021).
2. Statistical Properties, Bounds, and Relation to Other Measures
Cohen’s original work established both explicit upper and lower bounds for 8 given fixed marginals (Sahu et al., 2024, Safak, 2020). The maximum possible agreement for fixed marginals 9 is 0, yielding an upper bound
1
However, the minimum feasible agreement 2—once an open question—admits the closed form 3 after appropriate permutation of categories (so 4 is maximal) (Safak, 2020). The tight lower bound is
5
which can be substantially negative, especially under marginal imbalance.
For binary variables, there is a linear relationship to the Pearson correlation 6:
7
This constant depends only on marginals. 8 and the Matthews Correlation Coefficient/phi coefficient have identical numerators but different normalizations, leading to 9 with equality if marginals match (Sahu et al., 2024, Rao et al., 25 May 2026).
In the presence of abstentions, 0's value depends on preprocessing choice—exclusion, recode-as-negative, or three-class extension—each answering distinct inferential questions (Rao et al., 25 May 2026).
3. Theoretical Interpretation and Information-Theoretic Connections
1 formalizes the idea of “agreement beyond chance” under a permutation (hypergeometric) null model, guaranteeing 2 for random labeling with fixed marginals (Lippitt et al., 15 Jan 2026). This property persists only when the null holds marginals fixed across resamplings; in multinomial or other nulls where marginals are random, 3 departs from zero and interpretability collapses (Lippitt et al., 15 Jan 2026).
Recent developments have anchored 4 to information-theoretic quantities. Notably, there is a smooth, monotonic relationship between 5 and the Resistor Average Distance (RAD) between class-conditional densities in classification (Crow et al., 2024):
6
where 7 is the resistor average of Kullback–Leibler divergences:
8
Thus, maximum achievable 9 is a direct function of the intrinsic information separation between classes. Empirical studies demonstrate that observed 0 matches the 1 upper bound closely on both synthetic and real datasets, confirming this theoretical link (Crow et al., 2024).
Information Agreement (IA), an alternative based on mutual information, avoids the “chance” baseline and normalizes by the minimum entropy of marginal distributions; IA always lies in 2 and is robust to prevalence/bias artifacts that affect 3 (Casagrande et al., 2020).
4. Limitations, Prevalence Paradox, and Critique
4’s behavior is unintuitive under prevalence or marginal imbalance, known as the “prevalence paradox”: for highly imbalanced marginals, 5 can be low even with near-perfect agreement on the dominant class (Wong et al., 2021, Silveira et al., 2022). Moreover, 6's lower bound is not always 7; extremely skewed marginals may force 8 only slightly negative or near zero even for maximal disagreement (Safak, 2020).
Empirical analysis over 9 2×2 tables confirms:
- Severe underestimation of agreement under extreme prevalence
- Collapse or instability of 0 when row or column marginals are near zero
- High type I/II rates under moderate observed agreement
- Very similar “handicapped” behavior to Pearson’s 1, and other classical measures
Consequently, Holley & Guilford’s 2 and Gwet’s 3 have been recommended as more reliable alternatives in dichotomous settings, with 4 correcting the prevalence effect and correlating near-perfectly with 5 (Silveira et al., 2022).
5. Interpretation, Significance Indices, and Reporting Guidelines
Raw 6 values are widely classified into “slight,” “fair,” “moderate,” “substantial,” “almost perfect” following ad hoc thresholds (e.g., Landis–Koch), but these scales are arbitrary and not functionally linked to sampling variability, number of categories, or sample size (Casagrande et al., 21 Apr 2025).
Principled, probabilistic significance can be assigned by computing the likelihood (over all possible confusion matrices for given 7) that a random labeler obtains a 8 at least as large as observed; this is termed the significance index 9. As samples grow, this approach converges to a distributional index 0 based on the location of 1 in the probability simplex. This framework enables data-dependent, threshold-free interpretability for 2 (Casagrande et al., 21 Apr 2025).
Reporting best practice stipulates:
- Explicit statement of judgment scale and abstention handling
- Publication of full confusion matrices and marginal rates
- Reporting 3 alongside 4 or 5 in dichotomous settings
- Cautious interpretation or replacement when prevalence is extreme or marginal bias is present
- Consideration of alternative or disattenuated indices when multi-rater or multi-class settings apply (Rao et al., 25 May 2026, Lippitt et al., 15 Jan 2026)
6. Extensions and Empirical Frameworks
Bayesian and Multilevel Extensions
For repeated measures, longitudinal, or multilevel data (e.g., multiple raters and time points, or hierarchical structures), generalized linear mixed models extend 6 estimation by incorporating subject, rater, and batch effects. Marginal or conditional 7 estimates then summarize agreement while propagating appropriate posterior uncertainty (Hawila et al., 2024). Bayesian models, e.g., BIN, BPN, and BFN frameworks, produce less biased 8 estimates with valid credible intervals, especially under small samples or deep nesting.
Cross-Replication Reliability (xRR) Framework
Wong et al. propose benchmarking 9 through cross-replication reliability, introducing cross-kappa (0) to quantify agreement across replications (e.g., different annotator pools or protocols) (Wong et al., 2021). Given two annotation runs 1 and 2 on the same items, 3 generalizes 4 to compare any two groups, with normalization correcting for low within-pool reliability:
5
where 6 and 7 are observed and expected cross-replication disagreements.
Case studies demonstrate that 8 reveals scenario-specific phenomena—such as population- or protocol-specific bias or reproducibility failures—unobservable with classical 9 alone.
7. Alternatives, Generalizations, and Practical Recommendations
Many limitations of 0 have motivated both conceptual and computational alternatives:
- Information Agreement (IA) and its zero-entry extension directly measure mutual information normalized by marginal entropy, avoiding arbitrary chance baselines and yielding 1-scaled, always nonnegative agreement values (Casagrande et al., 2020).
- Gwet’s 2, Holley & Guilford’s 3, Yule’s 4, and Bennett’s 5 are empirically less biased and more robust across class-imbalance regimes (Silveira et al., 2022, Tian et al., 2024).
- Prevalence- and bias-adjusted 6 (PABAK) and generalized chance-corrected indices remedy specific defects, but do not address all issues.
- In simulated studies with correlated rater decision processes, 7 is consistently biased downward by about 0.08 units relative to probabilistic-certainty benchmarks; 8 and 9 are closer to true corrected agreement (Tian et al., 2024).
Practical guidance includes:
- Avoiding 0 when outcome prevalence is extreme or marginal bias is strong.
- Supplementing 1 with prevalence and bias indices; when necessary, apply alternative indices and contextualize any negative or near-zero 2 against the true lower bound determined by marginals.
- Using 3 or Bayesian models for benchmarking and quantifying reliability in annotation-intensive, crowdsourced, or multilevel designs.
Summary Table: Cohen’s Kappa—Core Quantities
| Quantity | Symbol | Formula / Description |
|---|---|---|
| Observed agreement | 4 | 5 |
| Marginal rates | 6 | Row/column sums per category (7, 8) |
| Chance agreement | 9 | 00 |
| Kappa | 01 | 02 |
| Maximum kappa | 03 | 04 |
| Minimum kappa | 05 | 06 |
| Info distance bound | 07 | Resistor average distance between class-conditional densities |
Cohen’s 08 remains a central measure of chance-corrected agreement, yet its interpretation demands careful attention to prevalence, marginal distributions, and the explicit modeling of chance. Alternatives and recent generalizations offer more principled or informative reliability metrics in many empirical settings (Casagrande et al., 2020, Wong et al., 2021, Silveira et al., 2022, Crow et al., 2024, Sahu et al., 2024, Casagrande et al., 21 Apr 2025, Lippitt et al., 15 Jan 2026, Rao et al., 25 May 2026, Tian et al., 2024, Hawila et al., 2024).