Papers
Topics
Authors
Recent
Detailed Answer
Quick Answer
Concise responses based on abstracts only
Detailed Answer
Well-researched responses based on abstracts and relevant paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses
Gemini 2.5 Flash
Gemini 2.5 Flash 52 tok/s
Gemini 2.5 Pro 47 tok/s Pro
GPT-5 Medium 18 tok/s Pro
GPT-5 High 13 tok/s Pro
GPT-4o 100 tok/s Pro
Kimi K2 192 tok/s Pro
GPT OSS 120B 454 tok/s Pro
Claude Sonnet 4 30 tok/s Pro
2000 character limit reached

Interrater agreement statistics under the two-rater dichotomous-response case with correlated decisions (2402.08069v1)

Published 12 Feb 2024 in stat.ME

Abstract: Measurement of the interrater agreement (IRA) is critical in various disciplines. To correct for potential confounding chance agreement in IRA, Cohen's kappa and many other methods have been proposed. However, owing to the varied strategies and assumptions across these methods, there is a lack of practical guidelines on how these methods should be preferred even for the common two-rater dichotomous rating. To fill the gaps in the literature, we systematically review nine IRA methods and propose a generalized framework that can simulate the correlated decision processes behind the two raters to compare those reviewed methods under comprehensive practical scenarios. Based on the new framework, an estimand of "true" chance-corrected IRA is defined by accounting for the "probabilistic certainty" and serves as the comparison benchmark. We carry out extensive simulations to evaluate the performance of the reviewed IRA measures, and an agglomerative hierarchical clustering analysis is conducted to assess the inter-relationships among the included methods and the benchmark metric. Recommendations for selecting appropriate IRA statistics in different practical conditions are provided and the needs for further advancements in IRA estimation methodologies are emphasized.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (54)
  1. Alan E Kazdin “Research design in clinical psychology” Cambridge University Press, 2021
  2. “Interobserver agreement issues in radiology” In Diagnostic and Interventional Imaging 101.10 Elsevier, 2020, pp. 639–641
  3. Natasa Gisev, J Simon Bell and Timothy F Chen “Interrater agreement and interrater reliability: key concepts, approaches, and applications” In Research in Social and Administrative Pharmacy 9.3 Elsevier, 2013, pp. 330–338
  4. Pankaj K. Choudhary and H.N. Nagaraja “Measuring Agreement in Method Comparison Studies — A Review” In Statistics for Industry and Technologys Birkhäuser Boston, 2005, pp. 215–244
  5. Jacob Cohen “A coefficient of agreement for nominal scales” In Educational and Psychological Measurement 20.1 Sage Publications Sage CA: Thousand Oaks, CA, 1960, pp. 37–46
  6. Alvan R Feinstein and Domenic V Cicchetti “High agreement but low kappa: I. The problems of two paradoxes” In Journal of Clinical Epidemiology 43.6 Elsevier, 1990, pp. 543–549
  7. Domenic V Cicchetti and Alvan R Feinstein “High agreement but low kappa: II. Resolving the paradoxes” In Journal of Clinical Epidemiology 43.6 Elsevier, 1990, pp. 551–558
  8. Werner Vach “The dependence of Cohen’s kappa on the prevalence does not matter” In Journal of Clinical Epidemiology 58.7 Elsevier, 2005, pp. 655–661
  9. Xinshu Zhao, Jun S Liu and Ke Deng “Assumptions behind intercoder reliability indices” In Annals of the International Communication Association 36.1 Taylor & Francis, 2013, pp. 419–480
  10. G Udny Yule “On the methods of measuring association between two attributes” In Journal of the Royal Statistical Society 75.6 JSTOR, 1912, pp. 579–652
  11. Edward L Spitznagel and John E Helzer “A proposed solution to the base rate problem in the kappa statistic” In Archives of General Psychiatry 42.7 American Medical Association, 1985, pp. 725–728
  12. Kilem Li Gwet “Computing inter-rater reliability and its variance in the presence of high agreement” In British Journal of Mathematical and Statistical Psychology 61.1 Wiley Online Library, 2008, pp. 29–48
  13. Kilem L Gwet “Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters” Advanced Analytics, LLC, 2021
  14. “The kappa paradox” In Shoulder & Elbow 10.4 SAGE Publications Sage UK: London, England, 2018, pp. 308–308
  15. Nicole JM Blackman and John J Koval “Estimating rater agreement in 2x2 tables: Correction for chance and intraclass correlation” In Applied Psychological Measurement 17.3 Sage Publications Sage CA: Thousand Oaks, CA, 1993, pp. 211–223
  16. Daniel A Bloch and Helena Chmura Kraemer “2x2 kappa coefficients: measures of agreement or association” In Biometrics JSTOR, 1989, pp. 269–287
  17. William A Scott “Reliability of content analysis: The case of nominal scale coding” In Public Opinion Quarterly JSTOR, 1955, pp. 321–325
  18. Jasper Wilson Holley and Joy Paul Guilford “A note on the G index of agreement” In Educational and Psychological Measurement 24.4 Sage Publications Sage CA: Thousand Oaks, CA, 1964, pp. 749–753
  19. Malcolm J Grant, Cathryn M Button and Brent Snook “An evaluation of interrater reliability measures on binary tasks using d-prime” In Applied Psychological Measurement 41.4 Sage Publications Sage CA: Los Angeles, CA, 2017, pp. 264–276
  20. Albert W Marshall and Ingram Olkin “A family of bivariate distributions generated by the bivariate Bernoulli distribution” In Journal of the American Statistical Association 80.390 Taylor & Francis, 1985, pp. 332–338
  21. Shu Xu and Michael F Lorber “Interrater agreement statistics with skewed data: Evaluation of alternatives to Cohen’s kappa” In Journal of Consulting and Clinical Psychology 82.6 American Psychological Association, 2014, pp. 1219
  22. Guangchao Charles Feng “Underlying determinants driving agreement among coders” In Quality & Quantity 47.5 Springer, 2013, pp. 2983–2997
  23. Guangchao Charles Feng “Factors affecting intercoder reliability: A Monte Carlo experiment” In Quality & Quantity 47.5 Springer, 2013, pp. 2959–2982
  24. “Beyond kappa: A review of interrater agreement measures” In Canadian Journal of Statistics 27.1 Wiley Online Library, 1999, pp. 3–23
  25. Joseph L Fleiss “Measuring agreement between two judges on the presence or absence of a trait” In Biometrics JSTOR, 1975, pp. 651–659
  26. J Richard Landis and Gary G Koch “A review of statistical methods in the analysis of data arising from observer reliability studies (Part I)” In Statistica Neerlandica 29.3 Wiley Online Library, 1975, pp. 101–123
  27. Klaus Krippendorff “Bivariate agreement coefficients for reliability of data” In Sociological Methodology 2 JSTOR, 1970, pp. 139–150
  28. Andrew F Hayes and Klaus Krippendorff “Answering the call for a standard reliability measure for coding data” In Communication Methods and Measures 1.1 Taylor & Francis, 2007, pp. 77–89
  29. Rutger Van Oest “A new coefficient of interrater agreement: The challenge of highly unequal category proportions” In Psychological Methods 24.4 American Psychological Association, 2019, pp. 439
  30. William D Perreault Jr and Laurence E Leigh “Reliability of nominal data based on qualitative judgments” In Journal of Marketing Research 26.2 SAGE Publications Sage CA: Los Angeles, CA, 1989, pp. 135–148
  31. Tak K Mak “Analysing intraclass correlation for dichotomous variables” In Journal of the Royal Statistical Society: Series C (Applied Statistics) 37.3 Wiley Online Library, 1988, pp. 344–352
  32. Ted Byrt, Janet Bishop and John B Carlin “Bias, prevalence and kappa” In Journal of Clinical Epidemiology 46.5 Elsevier, 1993, pp. 423–429
  33. Edward M Bennett, Renee Alpert and AC Goldstein “Communications through limited-response questioning” In Public Opinion Quarterly 18.3 Oxford University Press, 1954, pp. 303–308
  34. Rebecca Zwick “Another look at interrater agreement.” In Psychological Bulletin 103.3 American Psychological Association, 1988, pp. 374
  35. Ann E Maxwell “Coefficients of agreement between observers and their interpretation” In The British Journal of Psychiatry 130.1 Cambridge University Press, 1977, pp. 79–83
  36. “On generalizations of the G index and the phi coefficient to nominal scales” In Multivariate Behavioral Research 14.2 Taylor & Francis, 1979, pp. 255–269
  37. Robert L Brennan and Dale J Prediger “Coefficient kappa: Some uses, misuses, and alternatives” In Educational and Psychological Measurement 41.3 Sage Publications Sage CA: Thousand Oaks, CA, 1981, pp. 687–699
  38. Robert H Finn “A note on estimating the reliability of categorical data” In Educational and Psychological Measurement 30.1 Sage Publications Sage CA: Thousand Oaks, CA, 1970, pp. 71–76
  39. W James Potter and Deborah Levine-Donnerstein “Rethinking validity and reliability in content analysis” Taylor & Francis, 1999
  40. Fred K Hoehler “Bias and prevalence effects on kappa viewed in terms of sensitivity and specificity” In Journal of Clinical Epidemiology 53.5 Elsevier, 2000, pp. 499–503
  41. SD Walter “Hoehler’s adjusted kappa is equivalent to Yule’s Y” In Journal of Clinical Epidemiology 54.10 Elsevier, 2001, pp. 1072
  42. “Deriving coefficients of reliability and agreement for ratings” In British Journal of Mathematical and Statistical Psychology 21.1 Wiley Online Library, 1968, pp. 105–116
  43. John J Bartko “The intraclass correlation coefficient as a measure of reliability” In Psychological Reports 19.1 SAGE Publications Sage CA: Los Angeles, CA, 1966, pp. 3–11
  44. “Reliability studies of psychiatric diagnosis: Theory and practice” In Archives of General Psychiatry 38.4 American Medical Association, 1981, pp. 408–413
  45. Mikel Aickin “Maximum likelihood estimation of agreement in the constant predictive probability model, and its relation to Cohen’s kappa” In Biometrics JSTOR, 1990, pp. 293–302
  46. Xinshu Zhao “A Reliability Index (ai) that assumes honest coders and variable randomness”, 2012
  47. Joseph L Fleiss, Jacob Cohen and Brian S Everitt “Large sample standard errors of kappa and weighted kappa.” In Psychological Bulletin 72.5 American Psychological Association, 1969, pp. 323
  48. Dominic V Cicchetti and Joseph L Fleiss “Comparison of the null distributions of weighted kappa and the C ordinal statistic” In Applied Psychological Measurement 1.2 Sage Publications Sage CA: Thousand Oaks, CA, 1977, pp. 195–201
  49. Joseph L Fleiss and Domenic V Cicchetti “Inference about weighted kappa in the non-null case” In Applied Psychological Measurement 2.1 Sage Publications Sage CA: Thousand Oaks, CA, 1978, pp. 113–117
  50. Joseph L Fleiss “Measuring nominal scale agreement among many raters.” In Psychological Bulletin 76.5 American Psychological Association, 1971, pp. 378
  51. Alan Agresti “Categorical data analysis” John Wiley & Sons, 2003
  52. Douglas G Bonett and Robert M Price “Statistical inference for generalized Yule coefficients in 2x2 contingency tables” In Sociological Methods and Research 35.3 Sage Publications Sage CA: Thousand Oaks, CA, 2007, pp. 429–446
  53. Menelaos Konstantinidis, Lisa.W. Le and Xin Gao “An Empirical Comparative Assessment of Inter-Rater Agreement of Binary Outcomes and Multiple Raters” In Symmetry 14.2, 2022
  54. James H Albert and Siddhartha Chib “Bayesian analysis of binary and polychotomous response data” In Journal of the American Statistical Association 88.422 Taylor & Francis, 1993, pp. 669–679
List To Do Tasks Checklist Streamline Icon: https://streamlinehq.com

Collections

Sign up for free to add this paper to one or more collections.

Summary

We haven't generated a summary for this paper yet.

Dice Question Streamline Icon: https://streamlinehq.com

Follow-Up Questions

We haven't generated follow-up questions for this paper yet.

X Twitter Logo Streamline Icon: https://streamlinehq.com