Papers
Topics
Authors
Recent
Search
2000 character limit reached

Fleiss’ Kappa: Multi-Rater Agreement

Updated 30 May 2026
  • Fleiss’ kappa is a statistical measure that quantifies inter-rater agreement for categorical data while correcting for chance.
  • It extends Cohen’s kappa to multiple raters by calculating observed and expected agreements through a defined computational framework.
  • Widely applied in fields like medicine, education, and information retrieval, it highlights issues such as category imbalance and the prevalence paradox.

Fleiss’ kappa is a statistical measure designed to quantify the degree of agreement among multiple independent raters assigning categorical labels to a set of subjects, correcting for agreement that could be expected by chance. Introduced by J.L. Fleiss in 1971, it generalizes Cohen’s kappa—which addresses only two raters—to accommodate any fixed number of raters (J ≥ 2) and any number of unordered nominal categories. The coefficient has become a standard in diverse fields wherever robust quantification of inter-rater reliability is required, including medicine, information retrieval, and education (Schaer, 2012, Arenas, 2018, Moons et al., 2023).

1. Formal Definition and Derivation

Let II denote the number of subjects, JJ the (fixed) number of raters per subject, and CC the number of mutually exclusive nominal categories. For subject ii, xicx_{ic} denotes the number of raters who assigned subject ii to category cc, with c=1Cxic=J\sum_{c=1}^C x_{ic} = J for each ii. The classical Fleiss' kappa proceeds via the following sequence (Moons et al., 2023, Arenas, 2018, Schaer, 2012):

  • Observed agreement for subject ii:

JJ0

  • Mean observed agreement:

JJ1

  • Category marginals: JJ2.
  • Expected agreement by chance:

JJ3

  • Fleiss’ kappa:

JJ4

JJ5 corresponds to perfect agreement, JJ6 indicates agreement no better than chance, and JJ7 denotes systematic disagreement relative to random assignment.

2. Interpretation, Benchmarks, and Limitations

Although JJ8 is a continuous metric spanning JJ9, several interpretive scales are widely employed. The Landis & Koch (1977) benchmark is commonly cited:

CC0 range Interpretation
CC1 poor
CC2–CC3 slight
CC4–CC5 fair
CC6–CC7 moderate
CC8–CC9 substantial
ii0–ii1 almost perfect

Alternative standards are applied in field-specific contexts; for instance, Greve & Wentura (1997) suggest that ii2 should "not be taken too seriously" in information retrieval, while ii3 is "good to excellent" (Schaer, 2012).

Limitations:

  • Assumes each subject is rated by the same number of raters and that each rater assigns exactly one category per subject.
  • Sensitive to the marginal frequencies of categories: in highly imbalanced settings, ii4 increases and ii5 can be low even when percent agreement is high (the "prevalence paradox").
  • All raters are treated as interchangeable; individual-level rater effects are ignored.
  • Unable to handle missing data or variable numbers of raters per subject without further adaptation (Schaer, 2012, Moons et al., 2023).

3. Computational Recipe and Algorithmic Implementation

The stepwise calculation is as follows (Schaer, 2012, Arenas, 2018):

  1. For each subject ii6 and category ii7, tally ii8.
  2. Compute ii9 for all xicx_{ic}0 as above.
  3. Average xicx_{ic}1 to get xicx_{ic}2.
  4. Derive category marginals xicx_{ic}3 across all ratings.
  5. Compute xicx_{ic}4.
  6. Combine to obtain xicx_{ic}5.

Algorithmic implementations in research and software such as Inter-Rater utilize efficient vectorized matrix operations on the xicx_{ic}6 count matrix, handle large datasets (typical complexity xicx_{ic}7), and produce confidence intervals using large-sample variance estimates (Arenas, 2018). These implementations can support auxiliary functions, such as permuted pairwise Cohen's kappa for per-rater analysis and visualization.

4. Extensions: Multiple, Weighted, and Hierarchical Categories

The classical Fleiss’ kappa is restrictive in contexts where raters may assign multiple (possibly overlapping) categories, or where categories are structured hierarchically or weighted by importance. A generalization, denoted xicx_{ic}8, has been developed to address these needs (Moons et al., 2023).

  • Multi-label scenarios: Raters can select any subset of xicx_{ic}9 categories for each subject.
  • Category weighting: Each category ii0 may be assigned a weight ii1 (e.g., reflecting pedagogical or diagnostic significance).
  • Hierarchical dependencies: Categories may be gated by parent categories (e.g., subtypes only accessible through main types).

The generalization proceeds by computing, for each category, a binary "selected vs. not-selected" kappa, aggregates them with weights, and modifies denominators to account for hierarchy and availability (see also: scaling by ii2 for rarely available categories).

Under the assumptions of exclusivity, flat weighting, and no hierarchy, ii3 reduces algebraically to the classical ii4 (Moons et al., 2023).

5. Statistical Properties, Confidence Intervals, and Practical Examples

  • ii5 and its generalizations range from ii6 (complete systematic disagreement) to ii7 (perfect agreement). Negative values are rare in practice.
  • Confidence intervals can be computed using large-sample approximations (Arenas, 2018).
  • Interpretation must always account for category prevalence, task complexity, number of categories, and the possibility of missing data.

Empirical illustration: In an information retrieval study with LIS students performing binary relevance assessments, ii8 averaged ii9 (fair agreement), while Krippendorff's cc0 averaged cc1 (very low). Filtering on cc2 or cc3 improved the reliability of downstream system evaluation metrics (Schaer, 2012).

Hierarchical/multi-label case example: In a clinical task where psychiatrists rate multiple DSM-III categories per subject, the generalized cc4 yielded an overall inter-rater reliability of cc5, reflecting low but nonzero agreement beyond chance (Moons et al., 2023).

6. Comparative Metrics and Current Applications

Comparison with Krippendorff’s cc6: Both Fleiss’ kappa and Krippendorff's cc7 are "chance-corrected" agreement measures. Krippendorff's cc8 is more general, handling missing data, variable numbers of raters per subject, and non-nominal scales. In practice, cc9 tends to be more conservative (often lower) than c=1Cxic=J\sum_{c=1}^C x_{ic} = J0 and is advised for complex coding or anytime missing data is present (Schaer, 2012).

Software tools: Inter-Rater, an open-source Python program, computes Fleiss’ c=1Cxic=J\sum_{c=1}^C x_{ic} = J1, pairwise Cohen’s c=1Cxic=J\sum_{c=1}^C x_{ic} = J2, and produces detailed rater-level diagnostics. The permutation-based approach quickly identifies raters who diverge from group consensus and supports robust visualization and workflow for systematic reviews, medical diagnosis, and educational assessment (Arenas, 2018).

Application domains: Fleiss’ kappa—and its extensions—see broad usage in:

  • Medical diagnosis reliability studies.
  • Evaluation of information retrieval system relevance judgments.
  • Educational settings for rubric-based grading calibration.
  • Any systematic reviewer pooling judgments from multiple expert assessors (Arenas, 2018, Moons et al., 2023, Schaer, 2012).

7. Interpretation Caveats and Recommendations

  • c=1Cxic=J\sum_{c=1}^C x_{ic} = J3 coefficients must be interpreted in context: high agreement on a single dominant category inflates chance agreement, potentially deflating c=1Cxic=J\sum_{c=1}^C x_{ic} = J4 even when raw agreement is high—a manifestation of the "prevalence paradox" (Moons et al., 2023).
  • Arbitrary c=1Cxic=J\sum_{c=1}^C x_{ic} = J5 thresholds for “acceptable” reliability lack universal justification; reporting confidence intervals and prevalence statistics is advised.
  • Filtering low-agreement assessors or task instances can materially impact downstream analyses, confirming the necessity of transparent inter-rater agreement reporting and, where appropriate, data curation for reliability (Schaer, 2012).

A plausible implication is that, while Fleiss’ kappa remains the default for nominal, fully crossed multi-rater designs, researchers confronting multi-label, hierarchical, or incomplete datasets should consider c=1Cxic=J\sum_{c=1}^C x_{ic} = J6 or Krippendorff’s c=1Cxic=J\sum_{c=1}^C x_{ic} = J7 for a more nuanced assessment of inter-rater reliability (Moons et al., 2023, Schaer, 2012).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Fleiss’ Kappa.