Fleiss’ Kappa (κ): A Reliability Metric
- Fleiss’ kappa is a chance-corrected coefficient that measures inter-rater agreement when multiple raters classify subjects into nominal categories.
- It computes agreement by comparing the mean observed agreement with the expected chance agreement derived from marginal probabilities.
- Widely applied in systematic reviews, medical diagnostics, and educational assessments, it supports extensions for multi-label and hierarchical analyses.
Fleiss’ kappa () is a chance-corrected coefficient quantifying inter-rater agreement for categorical items when each of raters classifies each of subjects independently into one of nominal categories. Unlike pairwise measures such as Cohen’s , Fleiss’ generalizes agreement assessment to cases with more than two raters. Extensions further allow the analysis of scenarios in which raters can select multiple, possibly hierarchical, categories per subject. Its mathematical rigor and extensibility make a standard reliability metric in systematic reviews, medical diagnostics, educational testing, and other multi-rater domains (Arenas, 2018, Moons et al., 2023).
1. Mathematical Foundations and Formal Definition
Let denote the number of subjects, the number of raters, and the number of categories. Denote by the number of raters who assigned subject to category , for . Fleiss’ is given by:
where:
- : mean observed agreement across subjects,
- : mean agreement expected by chance.
Observed agreement for subject :
Mean observed agreement:
Marginal probability of category :
Chance agreement:
Inserting these into the formula yields the classical Fleiss’ (Arenas, 2018, Moons et al., 2023).
2. Extensions: Generalised and Hierarchical Fleiss’ κ
Classical Fleiss’ requires each rater to assign each subject to exactly one category. In research contexts where raters may select multiple (non-exclusive) categories per subject, or where categories have a hierarchy or weights, a generalized is used (Moons et al., 2023). Key definitions:
- : indicator that rater assigns subject to category .
- : total “yes” votes for subject and category .
- : context-dependent weights per category (e.g., score magnitude).
- : number of raters eligible to select category on subject (for hierarchical designs).
Agreement per category per subject:
Aggregated observed agreement for category :
Expected agreement per category:
Generalized for the hierarchical, weighted case:
with , a scaling factor ensuring that main categories dominate deep subcategories when eligible rater counts differ.
When categories are exclusive and all raters assess all subjects, the generalization reduces to the classical Fleiss’ (Moons et al., 2023).
3. Computational Procedure and Software Implementations
The canonical computational pipeline is implemented in the Inter-Rater Python package (Arenas, 2018), which accepts:
- An data matrix (subjects raters) of labels,
- Explicit category definitions,
- Handles missing or out-of-list ratings as abstentions.
Main computational steps:
- Build the count matrix .
- Compute per-subject , aggregate to , and compute , , yielding .
- Estimate ’s confidence interval using the variance approximation from Fleiss et al. (1979).
- For rater-specific diagnostics: permute all rater pairs, compute pairwise Cohen’s s, and average to obtain per-rater reliability scores (“permuted-κ”).
Efficiency is , enabling practical analyses for hundreds or thousands of subjects and tens of raters.
4. Interpretation and Benchmarks
Interpretation typically follows the scale of Landis and Koch (1977):
| range | Interpretation |
|---|---|
| 0.00 | Poor agreement |
| 0.00–0.20 | Slight agreement |
| 0.21–0.40 | Fair agreement |
| 0.41–0.60 | Moderate agreement |
| 0.61–0.80 | Substantial agreement |
| 0.81–1.00 | Almost perfect |
These labels are descriptive conventions and may be context-dependent. For instance, a numerical example with , , resulting in would be interpreted as “fair” agreement (Arenas, 2018).
5. Practical Applications and Example Scenarios
Fleiss’ and its generalizations are applied in domains requiring robust quantification of multi-annotator reliability:
- Systematic reviews involving multiple reviewers classifying abstracts,
- Medical diagnostics (e.g., inter-pathologist/radiologist reliability),
- Educational assessment with multiple scorers,
- Psychiatric diagnosis data where raters select multiple disorders per case,
- Item-level grading with possible partial credit and hierarchical feedback structures.
Explicit worked examples include checkbox-based mathematics grading with weights and hierarchy, and DSM-III psychiatric diagnoses with variable rater counts per case (Moons et al., 2023). The general approach handles missing data by varying the denominator per subject or category and supports rater-specific reliability diagnostics (Arenas, 2018, Moons et al., 2023).
6. Algorithmic Innovations and Limitations
Algorithmic enhancements present in current software include:
- Handling of incomplete designs (missing data, abstentions, or per-category eligibility),
- Retention of per-rater information via exhaustive pairwise permutations,
- Publication-ready visualizations (group agreement, per-pair, and per-user),
- Accommodation of hierarchical categories and context-driven weighting.
A commonly identified limitation is that the classical form of Fleiss’ assumes exclusivity of categories and requires complete ratings. Generalizations address this by enabling (a) multi-label assignment, (b) weighted and hierarchical categories, and (c) variable rater eligibility per subject-category pair (Moons et al., 2023).
References
- "Inter-Rater: Software for analysis of inter-rater reliability by permutating pairs of multiple users" (Arenas, 2018)
- "Measuring agreement among several raters classifying subjects into one-or-more (hierarchical) nominal categories. A generalisation of Fleiss' kappa" (Moons et al., 2023)