Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fleiss’ Kappa (κ): A Reliability Metric

Updated 13 March 2026
  • Fleiss’ kappa is a chance-corrected coefficient that measures inter-rater agreement when multiple raters classify subjects into nominal categories.
  • It computes agreement by comparing the mean observed agreement with the expected chance agreement derived from marginal probabilities.
  • Widely applied in systematic reviews, medical diagnostics, and educational assessments, it supports extensions for multi-label and hierarchical analyses.

Fleiss’ kappa (κ\kappa) is a chance-corrected coefficient quantifying inter-rater agreement for categorical items when each of nn raters classifies each of NN subjects independently into one of kk nominal categories. Unlike pairwise measures such as Cohen’s κ\kappa, Fleiss’ κ\kappa generalizes agreement assessment to cases with more than two raters. Extensions further allow the analysis of scenarios in which raters can select multiple, possibly hierarchical, categories per subject. Its mathematical rigor and extensibility make κ\kappa a standard reliability metric in systematic reviews, medical diagnostics, educational testing, and other multi-rater domains (Arenas, 2018, Moons et al., 2023).

1. Mathematical Foundations and Formal Definition

Let NN denote the number of subjects, nn the number of raters, and kk the number of categories. Denote by nijn_{ij} the number of raters who assigned subject ii to category jj, for 1iN, 1jk1 \leq i \leq N,\ 1 \leq j \leq k. Fleiss’ κ\kappa is given by:

κ=PˉPˉe1Pˉe\kappa = \frac{\bar{P} - \bar{P}_e}{1 - \bar{P}_e}

where:

  • Pˉ\bar{P}: mean observed agreement across subjects,
  • Pˉe\bar{P}_e: mean agreement expected by chance.

Observed agreement PiP_i for subject ii:

Pi=1n(n1)j=1knij(nij1)P_i = \frac{1}{n(n-1)} \sum_{j=1}^k n_{ij} (n_{ij} - 1)

Mean observed agreement:

Pˉ=1Ni=1NPi\bar{P} = \frac{1}{N} \sum_{i=1}^N P_i

Marginal probability of category jj:

pj=1Nni=1Nnijp_j = \frac{1}{Nn} \sum_{i=1}^N n_{ij}

Chance agreement:

Pˉe=j=1kpj2\bar{P}_e = \sum_{j=1}^k p_j^2

Inserting these into the κ\kappa formula yields the classical Fleiss’ κ\kappa (Arenas, 2018, Moons et al., 2023).

2. Extensions: Generalised and Hierarchical Fleiss’ κ

Classical Fleiss’ κ\kappa requires each rater to assign each subject to exactly one category. In research contexts where raters may select multiple (non-exclusive) categories per subject, or where categories have a hierarchy or weights, a generalized κ\kappa is used (Moons et al., 2023). Key definitions:

  • xijc{0,1}x_{ijc} \in \{0,1\}: indicator that rater jj assigns subject ii to category cc.
  • xicx_{ic}: total “yes” votes for subject ii and category cc.
  • wc>0w_c > 0: context-dependent weights per category (e.g., score magnitude).
  • sics_{ic}: number of raters eligible to select category cc on subject ii (for hierarchical designs).

Agreement per category per subject:

Pic=(xic)(xic1)+(sicxic)(sicxic1)sic(sic1)P_{ic} = \frac{(x_{ic})(x_{ic} - 1) + (s_{ic} - x_{ic})(s_{ic} - x_{ic} - 1)}{s_{ic}(s_{ic} - 1)}

Aggregated observed agreement for category cc:

Poc=i=1NPici=1N1Po_c = \frac{\sum_{i=1}^N P_{ic}}{\sum_{i=1}^N 1}

Expected agreement per category:

Pec=2(ixicisic)22(ixicisic)+1Pe_c = 2\left(\frac{\sum_i x_{ic}}{\sum_i s_{ic}}\right)^2 - 2\left(\frac{\sum_i x_{ic}}{\sum_i s_{ic}}\right) + 1

Generalized κ\kappa for the hierarchical, weighted case:

κ=c=1kwcϕc(PocPec)c=1kwcϕc(1Pec)\kappa = \frac{\sum_{c=1}^k w_c \, \phi_c \, (Po_c - Pe_c)}{\sum_{c=1}^k w_c \, \phi_c \, (1 - Pe_c)}

with ϕc=isicNI\phi_c = \frac{\sum_i s_{ic}}{NI}, a scaling factor ensuring that main categories dominate deep subcategories when eligible rater counts differ.

When categories are exclusive and all raters assess all subjects, the generalization reduces to the classical Fleiss’ κ\kappa (Moons et al., 2023).

3. Computational Procedure and Software Implementations

The canonical computational pipeline is implemented in the Inter-Rater Python package (Arenas, 2018), which accepts:

  • An N×nN \times n data matrix (subjects ×\times raters) of labels,
  • Explicit category definitions,
  • Handles missing or out-of-list ratings as abstentions.

Main computational steps:

  1. Build the N×kN \times k count matrix {nij}\{n_{ij}\}.
  2. Compute per-subject PiP_i, aggregate to Pˉ\bar{P}, and compute pjp_j, Pˉe\bar{P}_e, yielding κ\kappa.
  3. Estimate κ\kappa’s confidence interval using the variance approximation from Fleiss et al. (1979).
  4. For rater-specific diagnostics: permute all rater pairs, compute pairwise Cohen’s κ\kappas, and average to obtain per-rater reliability scores (“permuted-κ”).

Efficiency is O(Nn2)O(N n^2), enabling practical analyses for hundreds or thousands of subjects and tens of raters.

4. Interpretation and Benchmarks

Interpretation typically follows the scale of Landis and Koch (1977):

κ\kappa range Interpretation
<<0.00 Poor agreement
0.00–0.20 Slight agreement
0.21–0.40 Fair agreement
0.41–0.60 Moderate agreement
0.61–0.80 Substantial agreement
0.81–1.00 Almost perfect

These labels are descriptive conventions and may be context-dependent. For instance, a numerical example with N=3N=3, n=4n=4, k=3k=3 resulting in κ0.23\kappa \approx 0.23 would be interpreted as “fair” agreement (Arenas, 2018).

5. Practical Applications and Example Scenarios

Fleiss’ κ\kappa and its generalizations are applied in domains requiring robust quantification of multi-annotator reliability:

  • Systematic reviews involving multiple reviewers classifying abstracts,
  • Medical diagnostics (e.g., inter-pathologist/radiologist reliability),
  • Educational assessment with multiple scorers,
  • Psychiatric diagnosis data where raters select multiple disorders per case,
  • Item-level grading with possible partial credit and hierarchical feedback structures.

Explicit worked examples include checkbox-based mathematics grading with weights and hierarchy, and DSM-III psychiatric diagnoses with variable rater counts per case (Moons et al., 2023). The general approach handles missing data by varying the denominator per subject or category and supports rater-specific reliability diagnostics (Arenas, 2018, Moons et al., 2023).

6. Algorithmic Innovations and Limitations

Algorithmic enhancements present in current software include:

  • Handling of incomplete designs (missing data, abstentions, or per-category eligibility),
  • Retention of per-rater information via exhaustive pairwise permutations,
  • Publication-ready visualizations (group agreement, per-pair, and per-user),
  • Accommodation of hierarchical categories and context-driven weighting.

A commonly identified limitation is that the classical form of Fleiss’ κ\kappa assumes exclusivity of categories and requires complete ratings. Generalizations address this by enabling (a) multi-label assignment, (b) weighted and hierarchical categories, and (c) variable rater eligibility per subject-category pair (Moons et al., 2023).

References

  • "Inter-Rater: Software for analysis of inter-rater reliability by permutating pairs of multiple users" (Arenas, 2018)
  • "Measuring agreement among several raters classifying subjects into one-or-more (hierarchical) nominal categories. A generalisation of Fleiss' kappa" (Moons et al., 2023)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fleiss' Kappa (κ).