Gwet’s AC1: Robust Agreement Metric
- Gwet’s AC1 is a chance-corrected agreement coefficient designed to overcome kappa’s limitations in imbalanced categorical data.
- It computes raw agreement alongside a pooled chance agreement based on marginal proportions for both binary and multicategory scenarios.
- Practical implementations using R (e.g., the irrCAC package) and extensive simulations validate AC1's stable performance under prevalence extremes.
Gwet’s AC1 is a chance-corrected agreement coefficient engineered to address key limitations of classical kappa-like statistics in inter-rater reliability analysis, particularly the "prevalence paradox" that renders Cohen’s κ and Scott’s π unstable in imbalanced, checklist-oriented, or low-prevalence categorical data. AC1 yields robust, interpretable, and prevalence-insensitive estimates of agreement for categorical ratings by integrating raw agreement with a chance agreement model based on pooled marginal proportions. It is widely applicable in the assessment of binary and multicategory inter-rater data, particularly where traditional kappa coefficients are known to provide misleading interpretations.
1. Mathematical Definition and Computation
Let items be independently classified by two raters into mutually exclusive categories (e.g., “Yes,” “No,” “Not Applicable”). For category , is the observed proportion of items both raters assign to , and is the marginal proportion of all assignments to .
The observed agreement is
Gwet’s model for chance agreement is
The agreement coefficient is then
In the binary case (2×2 table with cell counts 0 for categories 1 and 2):
- Let 3
- 4
- Average prevalence of 5: 6
Closed form for the binary case:
7
Stepwise calculation:
- Compute marginal proportions 8 for each category.
- Calculate 9 as the proportion of matched assignments.
- Compute 0 as above.
- Apply the AC1 formula.
- For inferential analysis, compute confidence intervals or p-values using analytic, bootstrap, or permutation implementations (e.g., R: irrCAC package) (Bilgin et al., 12 Mar 2026, Silveira et al., 2022).
2. Theoretical Rationale and Prevalence Robustness
Gwet’s AC1 was developed to address the tendency of kappa statistics to underestimate agreement in scenarios where one class predominates—an artifact known as the prevalence paradox. Kappa’s chance-agreement estimate, based on marginal product probabilities, inflates as imbalance increases, driving κ toward zero even under high raw agreement.
Gwet’s alternative, based on symmetrized marginal averaging, maintains stability by ensuring 1 does not approach the observed agreement except in cases of perfect or null prevalence. This prevents pathological reduction of the coefficient when categories are imbalanced. As a result, AC1 remains reliable for reliability studies involving binary, skewed, or checklist-style categorical data (Bilgin et al., 12 Mar 2026, Silveira et al., 2022).
3. Practical Computation: Worked Examples and Software
For multicategory agreement (e.g., STROBE checklist scored “Y,” “N,” “NA”), the authors of (Bilgin et al., 12 Mar 2026) compute AC1 across domains and items as follows:
- For each rater pair and checklist item, determine the count of matched vs. unmatched assignments.
- Compute pooled marginal proportions across both raters.
- Apply stepwise calculation as above.
- Point estimates and 95% confidence intervals are obtained with the R package irrCAC.
For 2×2 contingency tables, Silveira & Siqueira’s R function below computes AC1 (and optionally p-values):
4 This allows for direct numeric computation or batch processing alongside inferential statistics for hypothesis testing (Silveira et al., 2022).
4. Empirical Benchmarks and Performance
AC1 has been empirically validated in large-scale and domain-specific studies:
- In an evaluation of STROBE checklist inter-rater reliability for 17 manuscripts in rheumatology, overall agreement across all rater pairs was 85.0% with 2 (95% CI: 0.801–0.851), interpreted as “almost perfect” by Landis & Koch standards (Bilgin et al., 12 Mar 2026).
- Domain-specific performance was higher for structural items (Presentation & Context: 3) than for Methodological Rigor (4).
- The 1,028,789-table simulation in (Silveira et al., 2022) found that AC1 tracked Holley & Guilford’s 5 coefficient (raw agreement-disagreement) almost perfectly (Spearman ρ = 0.9933), outperforming κ (60.87) and showing nearly error-free inferential stability even in high-agreement and imbalanced cases.
- In contrast to other estimators (κ, π, Q, Y, r, McNemar’s χ²) which misbehaved in extreme or unbalanced tables, AC1 provided results consistent with expectations under all tested conditions (Silveira et al., 2022).
5. Interpretation and Recommended Usage
Standard interpretation thresholds, as adopted from Landis & Koch and applied in (Bilgin et al., 12 Mar 2026), are:
| AC1 Value | Interpretation |
|---|---|
| < 0.00 | Poor agreement |
| 0.00–0.20 | Slight agreement |
| 0.21–0.40 | Fair agreement |
| 0.41–0.60 | Moderate agreement |
| 0.61–0.80 | Substantial agreement |
| 0.81–1.00 | Almost perfect |
AC1 assumes values in 7, where 8 represents chance-level agreement, positive values signal concordance, and negative values indicate systematic disagreement.
Best practices:
- AC1 is preferred over κ in low-prevalence, high-prevalence, or highly skewed categorical data.
- For small sample sizes (9), analytic p-values may be unreliable; bootstrap or permutation confidence intervals and exact tests are recommended (Silveira et al., 2022).
- Presentation of the raw contingency table alongside AC1 is strongly advised to contextualize results.
6. Comparative Advantages and Limitations
Gwet’s AC1 combines the prevalence-invariant stability of 0 with rigorous chance-correction and inferential justification. It does not shrink to zero artifactual values under extreme class imbalances, nor does it present the interpretative ambiguities characteristic of 1 or 2 in such settings (Silveira et al., 2022).
However, edge cases with one rater using only one category across all items can produce division-by-zero and require special handling. For multicategory extension, the simplicity and computational transparency of AC1 remain, with direct generalization to 3-category tasks. A plausible implication is that AC1’s behavior remains well-conditioned for complex, high-dimensional rating tasks but should be validated for edge-cases in categorical prevalence (Bilgin et al., 12 Mar 2026, Silveira et al., 2022).
7. Domain Applications and Research Impact
Gwet’s AC1 is frequently used in health sciences, clinical research, and machine learning evaluation where checklist, rating, or labeling reliability is critical and category distributions are imbalanced. It has been instrumental in benchmarking human-expert versus automated or LLM-derived annotation, as demonstrated in recent STROBE checklist evaluations (Bilgin et al., 12 Mar 2026). AC1’s ability to quantify both model-model and model-human agreement across domains, and to pinpoint systematic disagreement via negative values, makes it especially valuable for model auditing and targeted reliability improvement.
In large-scale simulation studies, AC1 consistently aligns with intuitive and raw index performance metrics, establishing it as the metric of choice for modern reliability research, particularly when transparency and inferential stability under prevalence extremes are required (Silveira et al., 2022).