Fleiss’ Kappa: Multi-Rater Agreement
- Fleiss’ kappa is a statistical measure that quantifies inter-rater agreement for categorical data while correcting for chance.
- It extends Cohen’s kappa to multiple raters by calculating observed and expected agreements through a defined computational framework.
- Widely applied in fields like medicine, education, and information retrieval, it highlights issues such as category imbalance and the prevalence paradox.
Fleiss’ kappa is a statistical measure designed to quantify the degree of agreement among multiple independent raters assigning categorical labels to a set of subjects, correcting for agreement that could be expected by chance. Introduced by J.L. Fleiss in 1971, it generalizes Cohen’s kappa—which addresses only two raters—to accommodate any fixed number of raters (J ≥ 2) and any number of unordered nominal categories. The coefficient has become a standard in diverse fields wherever robust quantification of inter-rater reliability is required, including medicine, information retrieval, and education (Schaer, 2012, Arenas, 2018, Moons et al., 2023).
1. Formal Definition and Derivation
Let denote the number of subjects, the (fixed) number of raters per subject, and the number of mutually exclusive nominal categories. For subject , denotes the number of raters who assigned subject to category , with for each . The classical Fleiss' kappa proceeds via the following sequence (Moons et al., 2023, Arenas, 2018, Schaer, 2012):
- Observed agreement for subject :
0
- Mean observed agreement:
1
- Category marginals: 2.
- Expected agreement by chance:
3
- Fleiss’ kappa:
4
5 corresponds to perfect agreement, 6 indicates agreement no better than chance, and 7 denotes systematic disagreement relative to random assignment.
2. Interpretation, Benchmarks, and Limitations
Although 8 is a continuous metric spanning 9, several interpretive scales are widely employed. The Landis & Koch (1977) benchmark is commonly cited:
| 0 range | Interpretation |
|---|---|
| 1 | poor |
| 2–3 | slight |
| 4–5 | fair |
| 6–7 | moderate |
| 8–9 | substantial |
| 0–1 | almost perfect |
Alternative standards are applied in field-specific contexts; for instance, Greve & Wentura (1997) suggest that 2 should "not be taken too seriously" in information retrieval, while 3 is "good to excellent" (Schaer, 2012).
Limitations:
- Assumes each subject is rated by the same number of raters and that each rater assigns exactly one category per subject.
- Sensitive to the marginal frequencies of categories: in highly imbalanced settings, 4 increases and 5 can be low even when percent agreement is high (the "prevalence paradox").
- All raters are treated as interchangeable; individual-level rater effects are ignored.
- Unable to handle missing data or variable numbers of raters per subject without further adaptation (Schaer, 2012, Moons et al., 2023).
3. Computational Recipe and Algorithmic Implementation
The stepwise calculation is as follows (Schaer, 2012, Arenas, 2018):
- For each subject 6 and category 7, tally 8.
- Compute 9 for all 0 as above.
- Average 1 to get 2.
- Derive category marginals 3 across all ratings.
- Compute 4.
- Combine to obtain 5.
Algorithmic implementations in research and software such as Inter-Rater utilize efficient vectorized matrix operations on the 6 count matrix, handle large datasets (typical complexity 7), and produce confidence intervals using large-sample variance estimates (Arenas, 2018). These implementations can support auxiliary functions, such as permuted pairwise Cohen's kappa for per-rater analysis and visualization.
4. Extensions: Multiple, Weighted, and Hierarchical Categories
The classical Fleiss’ kappa is restrictive in contexts where raters may assign multiple (possibly overlapping) categories, or where categories are structured hierarchically or weighted by importance. A generalization, denoted 8, has been developed to address these needs (Moons et al., 2023).
- Multi-label scenarios: Raters can select any subset of 9 categories for each subject.
- Category weighting: Each category 0 may be assigned a weight 1 (e.g., reflecting pedagogical or diagnostic significance).
- Hierarchical dependencies: Categories may be gated by parent categories (e.g., subtypes only accessible through main types).
The generalization proceeds by computing, for each category, a binary "selected vs. not-selected" kappa, aggregates them with weights, and modifies denominators to account for hierarchy and availability (see also: scaling by 2 for rarely available categories).
Under the assumptions of exclusivity, flat weighting, and no hierarchy, 3 reduces algebraically to the classical 4 (Moons et al., 2023).
5. Statistical Properties, Confidence Intervals, and Practical Examples
- 5 and its generalizations range from 6 (complete systematic disagreement) to 7 (perfect agreement). Negative values are rare in practice.
- Confidence intervals can be computed using large-sample approximations (Arenas, 2018).
- Interpretation must always account for category prevalence, task complexity, number of categories, and the possibility of missing data.
Empirical illustration: In an information retrieval study with LIS students performing binary relevance assessments, 8 averaged 9 (fair agreement), while Krippendorff's 0 averaged 1 (very low). Filtering on 2 or 3 improved the reliability of downstream system evaluation metrics (Schaer, 2012).
Hierarchical/multi-label case example: In a clinical task where psychiatrists rate multiple DSM-III categories per subject, the generalized 4 yielded an overall inter-rater reliability of 5, reflecting low but nonzero agreement beyond chance (Moons et al., 2023).
6. Comparative Metrics and Current Applications
Comparison with Krippendorff’s 6: Both Fleiss’ kappa and Krippendorff's 7 are "chance-corrected" agreement measures. Krippendorff's 8 is more general, handling missing data, variable numbers of raters per subject, and non-nominal scales. In practice, 9 tends to be more conservative (often lower) than 0 and is advised for complex coding or anytime missing data is present (Schaer, 2012).
Software tools: Inter-Rater, an open-source Python program, computes Fleiss’ 1, pairwise Cohen’s 2, and produces detailed rater-level diagnostics. The permutation-based approach quickly identifies raters who diverge from group consensus and supports robust visualization and workflow for systematic reviews, medical diagnosis, and educational assessment (Arenas, 2018).
Application domains: Fleiss’ kappa—and its extensions—see broad usage in:
- Medical diagnosis reliability studies.
- Evaluation of information retrieval system relevance judgments.
- Educational settings for rubric-based grading calibration.
- Any systematic reviewer pooling judgments from multiple expert assessors (Arenas, 2018, Moons et al., 2023, Schaer, 2012).
7. Interpretation Caveats and Recommendations
- 3 coefficients must be interpreted in context: high agreement on a single dominant category inflates chance agreement, potentially deflating 4 even when raw agreement is high—a manifestation of the "prevalence paradox" (Moons et al., 2023).
- Arbitrary 5 thresholds for “acceptable” reliability lack universal justification; reporting confidence intervals and prevalence statistics is advised.
- Filtering low-agreement assessors or task instances can materially impact downstream analyses, confirming the necessity of transparent inter-rater agreement reporting and, where appropriate, data curation for reliability (Schaer, 2012).
A plausible implication is that, while Fleiss’ kappa remains the default for nominal, fully crossed multi-rater designs, researchers confronting multi-label, hierarchical, or incomplete datasets should consider 6 or Krippendorff’s 7 for a more nuanced assessment of inter-rater reliability (Moons et al., 2023, Schaer, 2012).