Interrater agreement statistics under the two-rater dichotomous-response case with correlated decisions (2402.08069v1)
Abstract: Measurement of the interrater agreement (IRA) is critical in various disciplines. To correct for potential confounding chance agreement in IRA, Cohen's kappa and many other methods have been proposed. However, owing to the varied strategies and assumptions across these methods, there is a lack of practical guidelines on how these methods should be preferred even for the common two-rater dichotomous rating. To fill the gaps in the literature, we systematically review nine IRA methods and propose a generalized framework that can simulate the correlated decision processes behind the two raters to compare those reviewed methods under comprehensive practical scenarios. Based on the new framework, an estimand of "true" chance-corrected IRA is defined by accounting for the "probabilistic certainty" and serves as the comparison benchmark. We carry out extensive simulations to evaluate the performance of the reviewed IRA measures, and an agglomerative hierarchical clustering analysis is conducted to assess the inter-relationships among the included methods and the benchmark metric. Recommendations for selecting appropriate IRA statistics in different practical conditions are provided and the needs for further advancements in IRA estimation methodologies are emphasized.
- Alan E Kazdin “Research design in clinical psychology” Cambridge University Press, 2021
- “Interobserver agreement issues in radiology” In Diagnostic and Interventional Imaging 101.10 Elsevier, 2020, pp. 639–641
- Natasa Gisev, J Simon Bell and Timothy F Chen “Interrater agreement and interrater reliability: key concepts, approaches, and applications” In Research in Social and Administrative Pharmacy 9.3 Elsevier, 2013, pp. 330–338
- Pankaj K. Choudhary and H.N. Nagaraja “Measuring Agreement in Method Comparison Studies — A Review” In Statistics for Industry and Technologys Birkhäuser Boston, 2005, pp. 215–244
- Jacob Cohen “A coefficient of agreement for nominal scales” In Educational and Psychological Measurement 20.1 Sage Publications Sage CA: Thousand Oaks, CA, 1960, pp. 37–46
- Alvan R Feinstein and Domenic V Cicchetti “High agreement but low kappa: I. The problems of two paradoxes” In Journal of Clinical Epidemiology 43.6 Elsevier, 1990, pp. 543–549
- Domenic V Cicchetti and Alvan R Feinstein “High agreement but low kappa: II. Resolving the paradoxes” In Journal of Clinical Epidemiology 43.6 Elsevier, 1990, pp. 551–558
- Werner Vach “The dependence of Cohen’s kappa on the prevalence does not matter” In Journal of Clinical Epidemiology 58.7 Elsevier, 2005, pp. 655–661
- Xinshu Zhao, Jun S Liu and Ke Deng “Assumptions behind intercoder reliability indices” In Annals of the International Communication Association 36.1 Taylor & Francis, 2013, pp. 419–480
- G Udny Yule “On the methods of measuring association between two attributes” In Journal of the Royal Statistical Society 75.6 JSTOR, 1912, pp. 579–652
- Edward L Spitznagel and John E Helzer “A proposed solution to the base rate problem in the kappa statistic” In Archives of General Psychiatry 42.7 American Medical Association, 1985, pp. 725–728
- Kilem Li Gwet “Computing inter-rater reliability and its variance in the presence of high agreement” In British Journal of Mathematical and Statistical Psychology 61.1 Wiley Online Library, 2008, pp. 29–48
- Kilem L Gwet “Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters” Advanced Analytics, LLC, 2021
- “The kappa paradox” In Shoulder & Elbow 10.4 SAGE Publications Sage UK: London, England, 2018, pp. 308–308
- Nicole JM Blackman and John J Koval “Estimating rater agreement in 2x2 tables: Correction for chance and intraclass correlation” In Applied Psychological Measurement 17.3 Sage Publications Sage CA: Thousand Oaks, CA, 1993, pp. 211–223
- Daniel A Bloch and Helena Chmura Kraemer “2x2 kappa coefficients: measures of agreement or association” In Biometrics JSTOR, 1989, pp. 269–287
- William A Scott “Reliability of content analysis: The case of nominal scale coding” In Public Opinion Quarterly JSTOR, 1955, pp. 321–325
- Jasper Wilson Holley and Joy Paul Guilford “A note on the G index of agreement” In Educational and Psychological Measurement 24.4 Sage Publications Sage CA: Thousand Oaks, CA, 1964, pp. 749–753
- Malcolm J Grant, Cathryn M Button and Brent Snook “An evaluation of interrater reliability measures on binary tasks using d-prime” In Applied Psychological Measurement 41.4 Sage Publications Sage CA: Los Angeles, CA, 2017, pp. 264–276
- Albert W Marshall and Ingram Olkin “A family of bivariate distributions generated by the bivariate Bernoulli distribution” In Journal of the American Statistical Association 80.390 Taylor & Francis, 1985, pp. 332–338
- Shu Xu and Michael F Lorber “Interrater agreement statistics with skewed data: Evaluation of alternatives to Cohen’s kappa” In Journal of Consulting and Clinical Psychology 82.6 American Psychological Association, 2014, pp. 1219
- Guangchao Charles Feng “Underlying determinants driving agreement among coders” In Quality & Quantity 47.5 Springer, 2013, pp. 2983–2997
- Guangchao Charles Feng “Factors affecting intercoder reliability: A Monte Carlo experiment” In Quality & Quantity 47.5 Springer, 2013, pp. 2959–2982
- “Beyond kappa: A review of interrater agreement measures” In Canadian Journal of Statistics 27.1 Wiley Online Library, 1999, pp. 3–23
- Joseph L Fleiss “Measuring agreement between two judges on the presence or absence of a trait” In Biometrics JSTOR, 1975, pp. 651–659
- J Richard Landis and Gary G Koch “A review of statistical methods in the analysis of data arising from observer reliability studies (Part I)” In Statistica Neerlandica 29.3 Wiley Online Library, 1975, pp. 101–123
- Klaus Krippendorff “Bivariate agreement coefficients for reliability of data” In Sociological Methodology 2 JSTOR, 1970, pp. 139–150
- Andrew F Hayes and Klaus Krippendorff “Answering the call for a standard reliability measure for coding data” In Communication Methods and Measures 1.1 Taylor & Francis, 2007, pp. 77–89
- Rutger Van Oest “A new coefficient of interrater agreement: The challenge of highly unequal category proportions” In Psychological Methods 24.4 American Psychological Association, 2019, pp. 439
- William D Perreault Jr and Laurence E Leigh “Reliability of nominal data based on qualitative judgments” In Journal of Marketing Research 26.2 SAGE Publications Sage CA: Los Angeles, CA, 1989, pp. 135–148
- Tak K Mak “Analysing intraclass correlation for dichotomous variables” In Journal of the Royal Statistical Society: Series C (Applied Statistics) 37.3 Wiley Online Library, 1988, pp. 344–352
- Ted Byrt, Janet Bishop and John B Carlin “Bias, prevalence and kappa” In Journal of Clinical Epidemiology 46.5 Elsevier, 1993, pp. 423–429
- Edward M Bennett, Renee Alpert and AC Goldstein “Communications through limited-response questioning” In Public Opinion Quarterly 18.3 Oxford University Press, 1954, pp. 303–308
- Rebecca Zwick “Another look at interrater agreement.” In Psychological Bulletin 103.3 American Psychological Association, 1988, pp. 374
- Ann E Maxwell “Coefficients of agreement between observers and their interpretation” In The British Journal of Psychiatry 130.1 Cambridge University Press, 1977, pp. 79–83
- “On generalizations of the G index and the phi coefficient to nominal scales” In Multivariate Behavioral Research 14.2 Taylor & Francis, 1979, pp. 255–269
- Robert L Brennan and Dale J Prediger “Coefficient kappa: Some uses, misuses, and alternatives” In Educational and Psychological Measurement 41.3 Sage Publications Sage CA: Thousand Oaks, CA, 1981, pp. 687–699
- Robert H Finn “A note on estimating the reliability of categorical data” In Educational and Psychological Measurement 30.1 Sage Publications Sage CA: Thousand Oaks, CA, 1970, pp. 71–76
- W James Potter and Deborah Levine-Donnerstein “Rethinking validity and reliability in content analysis” Taylor & Francis, 1999
- Fred K Hoehler “Bias and prevalence effects on kappa viewed in terms of sensitivity and specificity” In Journal of Clinical Epidemiology 53.5 Elsevier, 2000, pp. 499–503
- SD Walter “Hoehler’s adjusted kappa is equivalent to Yule’s Y” In Journal of Clinical Epidemiology 54.10 Elsevier, 2001, pp. 1072
- “Deriving coefficients of reliability and agreement for ratings” In British Journal of Mathematical and Statistical Psychology 21.1 Wiley Online Library, 1968, pp. 105–116
- John J Bartko “The intraclass correlation coefficient as a measure of reliability” In Psychological Reports 19.1 SAGE Publications Sage CA: Los Angeles, CA, 1966, pp. 3–11
- “Reliability studies of psychiatric diagnosis: Theory and practice” In Archives of General Psychiatry 38.4 American Medical Association, 1981, pp. 408–413
- Mikel Aickin “Maximum likelihood estimation of agreement in the constant predictive probability model, and its relation to Cohen’s kappa” In Biometrics JSTOR, 1990, pp. 293–302
- Xinshu Zhao “A Reliability Index (ai) that assumes honest coders and variable randomness”, 2012
- Joseph L Fleiss, Jacob Cohen and Brian S Everitt “Large sample standard errors of kappa and weighted kappa.” In Psychological Bulletin 72.5 American Psychological Association, 1969, pp. 323
- Dominic V Cicchetti and Joseph L Fleiss “Comparison of the null distributions of weighted kappa and the C ordinal statistic” In Applied Psychological Measurement 1.2 Sage Publications Sage CA: Thousand Oaks, CA, 1977, pp. 195–201
- Joseph L Fleiss and Domenic V Cicchetti “Inference about weighted kappa in the non-null case” In Applied Psychological Measurement 2.1 Sage Publications Sage CA: Thousand Oaks, CA, 1978, pp. 113–117
- Joseph L Fleiss “Measuring nominal scale agreement among many raters.” In Psychological Bulletin 76.5 American Psychological Association, 1971, pp. 378
- Alan Agresti “Categorical data analysis” John Wiley & Sons, 2003
- Douglas G Bonett and Robert M Price “Statistical inference for generalized Yule coefficients in 2x2 contingency tables” In Sociological Methods and Research 35.3 Sage Publications Sage CA: Thousand Oaks, CA, 2007, pp. 429–446
- Menelaos Konstantinidis, Lisa.W. Le and Xin Gao “An Empirical Comparative Assessment of Inter-Rater Agreement of Binary Outcomes and Multiple Raters” In Symmetry 14.2, 2022
- James H Albert and Siddhartha Chib “Bayesian analysis of binary and polychotomous response data” In Journal of the American Statistical Association 88.422 Taylor & Francis, 1993, pp. 669–679
Collections
Sign up for free to add this paper to one or more collections.
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.