Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
102 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
43 tokens/sec
o3 Pro
6 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models (2403.12025v2)

Published 18 Mar 2024 in cs.CY, cs.CL, and cs.LG

Abstract: LLMs hold promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. We present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and conduct a large-scale empirical case study with the Med-PaLM 2 LLM. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases, and EquityMedQA, a collection of seven datasets enriched for adversarial queries. Both our human assessment framework and dataset design process are grounded in an iterative participatory approach and review of Med-PaLM 2 answers. Through our empirical study, we find that our approach surfaces biases that may be missed via narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. While our approach is not sufficient to holistically assess whether the deployment of an AI system promotes equitable health outcomes, we hope that it can be leveraged and built upon towards a shared goal of LLMs that promote accessible and equitable healthcare.

Introducing a Framework and Datasets for Evaluating Health Equity Harms in LLMs

Overview of Proposed Framework and Datasets

The utilization of LLMs in healthcare has demonstrated considerable potential in enhancing access to medical information and improving patient care. However, alongside the opportunities, there exist significant challenges, particularly concerning the perpetuation of biases and exacerbation of health disparities. Addressing these challenges requires a systematic approach to evaluate and identify biases embedded within LLM-generated content. In this context, the paper presents a comprehensive framework alongside a collection of newly-released datasets aimed at surfacing biases related to health equity in the outputs of medical LLMs. This effort, grounded in an iterative and participatory approach, encompasses multifactorial assessment rubrics for bias evaluation and an empirical case paper with Med-PaLM 2, contributing valuable insights into the identification and mitigation of equity-related harms in LLMs.

Multifactorial Assessment Rubrics

The assessment rubrics detailed in this paper were designed to evaluate bias within LLM-generated answers to medical queries. They incorporate dimensions of bias developed in collaboration with equity experts, reflecting a nuanced approach to understanding bias beyond conventional metrics. Three types of rubrics are introduced:

  • Independent Assessment: Evaluates bias in a single answer to a question, allowing raters to identify various forms of bias including inaccuracies across identity axes, lack of inclusivity, and stereotyping.
  • Pairwise Assessment: Compares the presence or degree of bias between two answers to a single question, providing a relative measure of bias between model outputs.
  • Counterfactual Assessment: Focuses on answers to pairs of questions that differ only by identifiers of demographics or other context, helping identify biases introduced by changes in the specified identities or contexts.

EquityMedQA Datasets

The EquityMedQA comprises seven datasets designed to facilitate the adversarial testing of health equity issues within medical LLMs. These datasets span various aspects of medical information queries, from explicitly adversarial questions to inquiries enriched for content related to known health disparities. The diversity in the collection methodology, including human curation, LLM-generated queries, and focus on global health topics, underscores the comprehensive nature of these datasets in targeting different forms of potential bias. Notably, the datasets include:

  • OMAQ: Features human-curated, explicitly adversarial queries across multiple health topics.
  • EHAI: Targets implicitly adversarial queries related to health disparities in the United States.
  • FBRT-Manual and FBRT-LLM: Contain questions derived through failure-based red teaming of Med-PaLM 2.
  • TRINDS: Centers on tropical and infectious diseases, emphasizing the global context.
  • CC-Manual and CC-LLM: Include counterfactual query pairs with adjustments for identity or context, aiding in a deeper understanding of bias generation.

Empirical Results and Implications

Through an extensive empirical paper utilizing the developed rubrics and datasets, several key findings emerged:

  • Bias in LLM Outputs: The paper revealed biases within Med-PaLM 2 outputs across multiple dimensions, indicating the necessity of diverse methodologies in bias evaluation.
  • Role of Rater Groups: Variation in bias reporting between physician, health equity expert, and consumer rater groups highlighted the importance of including diverse perspectives in bias evaluation efforts.
  • Utility of Counterfactual Analysis: The counterfactual assessment rubric elucidated biases related to changes in demographic identifiers or context, offering profound insights into subtle forms of bias.

Concluding Remarks

The proposed framework and datasets mark a significant advancement in the ongoing efforts to mitigate health equity harms within medical LLMs. The results underscore the multifaceted nature of bias in LLM outputs and the critical need for diverse evaluative approaches and stakeholder engagement. Future research directions include refining the evaluation rubrics, extending the datasets to cover wider global contexts, and developing methodologies to mitigate identified biases effectively.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (149)
  1. “The future landscape of large language models in medicine” In Communications medicine 3.1 Nature Publishing Group UK London, 2023, pp. 141
  2. “Large Language Models in Medicine: The Potentials and Pitfalls: A Narrative Review” In Annals of Internal Medicine 177.2 American College of Physicians, 2024, pp. 210–220
  3. “Large Language Models Encode Clinical Knowledge” In Nature 620.7972 Nature Publishing Group UK London, 2023, pp. 172–180
  4. “Towards Expert-Level Medical Question Answering with Large Language Models”, 2023 arXiv:2305.09617
  5. “Almanac—Retrieval-augmented language models for clinical medicine” In NEJM AI 1.2 Massachusetts Medical Society, 2024, pp. AIoa2300068
  6. “A Large Language Model for Electronic Health Records” In NPJ Digital Medicine 5.1 Nature Publishing Group UK London, 2022, pp. 194
  7. “Large language models are few-shot clinical information extractors” In Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, 2022
  8. Zahir Kanjee, Byron Crowe and Adam Rodman “Accuracy of a generative artificial intelligence model in a complex diagnostic challenge” In Jama 330.1 American Medical Association, 2023, pp. 78–80
  9. “Towards accurate differential diagnosis with large language models” In arXiv preprint arXiv:2312.00164, 2023
  10. “Towards conversational diagnostic ai” In arXiv preprint arXiv:2401.05654, 2024
  11. “Med-Flamingo: A multimodal medical few-shot learner” In Machine Learning for Health (ML4H), 2023, pp. 353–367 PMLR
  12. “Towards generalist biomedical ai” In NEJM AI 1.3 Massachusetts Medical Society, 2024, pp. AIoa2300138
  13. “Consensus, Dissensus and Synergy between Clinicians and Specialist Foundation Models in Radiology Report Generation”, 2023 DOI: 10.21203/rs.3.rs-3940387/v1
  14. “Large language models are few-shot health learners” In arXiv preprint arXiv:2305.15525, 2023
  15. “ChatGPT: Promise and challenges for deployment in low-and middle-income countries” In The Lancet Regional Health–Western Pacific 41 Elsevier, 2023
  16. “Artificial intelligence and the future of global health” In The Lancet 395.10236 Elsevier, 2020, pp. 1579–1586
  17. Stefan Harrer “Attention is not all you need: The complicated case of ethically using large language models in healthcare and medicine” In EBioMedicine 90 Elsevier, 2023
  18. “Centering health equity in large language model deployment” In PLOS Digital Health 2.10 Public Library of Science San Francisco, CA USA, 2023, pp. e0000367
  19. Peter Lee, Sebastien Bubeck and Joseph Petro “Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine” In New England Journal of Medicine 388.13 Mass Medical Soc, 2023, pp. 1233–1239
  20. “On the dangers of stochastic parrots: Can language models be too big?” In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, 2021, pp. 610–623
  21. Geoff Keeling “Algorithmic bias, generalist models, and clinical medicine” In AI and Ethics Springer, 2023, pp. 1–12
  22. Julia Adler-Milstein, Donald A. Redelmeier and Robert M. Wachter “The Limits of Clinician Vigilance as an AI Safety Bulwark” In JAMA, 2024 DOI: 10.1001/jama.2024.3620
  23. Bertalan Meskó and Eric J Topol “The imperative for regulatory oversight of large language models (or generative AI) in healthcare” In NPJ Digital Medicine 6.1 Nature Publishing Group UK London, 2023, pp. 120
  24. “The shaky foundations of large language models and foundation models for electronic health records” In NPJ Digital Medicine 6.1 Nature Publishing Group UK London, 2023, pp. 135
  25. “Structural Racism and Health Inequities in the USA: Evidence and Interventions” In The Lancet 389.10077 Elsevier, 2017, pp. 1453–1463 DOI: 10.1016/S0140-6736(17)30569-X
  26. “Understanding How Discrimination Can Affect Health” In Health Services Research 54.S2, 2019, pp. 1374–1388 DOI: 10.1111/1475-6773.13222
  27. World Health Organization “A Conceptual Framework for Action on the Social Determinants of Health”, Discussion Paper Series on Social Determinants of Health, 2 Geneva: World Health Organization, 2010, pp. 76
  28. World Health Organization “Operational Framework for Monitoring Social Determinants of Health Equity”, 2024
  29. “The Value of Standards for Health Datasets in Artificial Intelligence-Based Applications” In Nature Medicine 29.11 Nature Publishing Group, 2023, pp. 2929–2938 DOI: 10.1038/s41591-023-02608-w
  30. “Racial Underrepresentation in Dermatological Datasets Leads to Biased Machine Learning Models and Inequitable Healthcare” In Journal of Biomed Research 3.1 NIH Public Access, 2022, pp. 42
  31. “A Causal Perspective on Dataset Bias in Machine Learning for Medical Imaging” In Nature Machine Intelligence 6.2 Nature Publishing Group, 2024, pp. 138–146 DOI: 10.1038/s42256-024-00797-8
  32. Kadija Ferryman, Maxine Mackintosh and Marzyeh Ghassemi “Considering Biased Data as Informative Artifacts in AI-Assisted Health Care” In New England Journal of Medicine 389.9 Massachusetts Medical Society, 2023, pp. 833–838 DOI: 10.1056/NEJMra2214964
  33. “Large Language Models Propagate Race-Based Medicine” In NPJ Digital Medicine 6, 2023, pp. 195 DOI: 10.1038/s41746-023-00939-z
  34. “Health Inequities and the Inappropriate Use of Race in Nephrology” In Nature Reviews. Nephrology 18.2, 2022, pp. 84–94 DOI: 10.1038/s41581-021-00501-8
  35. “Dissecting Racial Bias in an Algorithm Used to Manage the Health of Populations” In Science 366.6464 American Association for the Advancement of Science, 2019, pp. 447–453 DOI: 10.1126/science.aax2342
  36. “Participatory Problem Formulation for Fairer Machine Learning Through Community Based System Dynamics” arXiv, 2020 DOI: 10.48550/arXiv.2005.07572
  37. “Problem Formulation and Fairness” In Proceedings of the Conference on Fairness, Accountability, and Transparency, FAT* ’19 New York, NY, USA: Association for Computing Machinery, 2019, pp. 39–48 DOI: 10.1145/3287560.3287567
  38. “Ethical Machine Learning in Healthcare” In Annual Review of Biomedical Data Science 4.1, 2021, pp. 123–144 DOI: 10.1146/annurev-biodatasci-092820-114757
  39. Stephen R. Pfohl, Agata Foryciarz and Nigam H. Shah “An Empirical Characterization of Fair Machine Learning for Clinical Risk Prediction” In Journal of Biomedical Informatics 113, 2021, pp. 103621 DOI: 10.1016/j.jbi.2020.103621
  40. Tiffany C Veinot, Hannah Mitchell and Jessica S Ancker “Good Intentions Are Not Enough: How Informatics Interventions Can Worsen Inequality” In Journal of the American Medical Informatics Association 25.8, 2018, pp. 1080–1088 DOI: 10.1093/jamia/ocy052
  41. “Assessing the Potential of GPT-4 to Perpetuate Racial and Gender Biases in Health Care: A Model Evaluation Study” In The Lancet Digital Health 6.1 Elsevier, 2024, pp. e12–e22 DOI: 10.1016/S2589-7500(23)00225-X
  42. Ruha Benjamin “Race after technology: Abolitionist tools for the new Jim code” Oxford University Press, 2020
  43. “Red-Teaming for Generative AI: Silver Bullet or Security Theater?” arXiv, 2024 DOI: 10.48550/arXiv.2401.15897
  44. “Red teaming language models to reduce harms: Methods, scaling behaviors, and lessons learned” In arXiv preprint arXiv:2209.07858, 2022
  45. “Red teaming language models with language models” In arXiv preprint arXiv:2202.03286, 2022
  46. “The Medical Algorithmic Audit” In The Lancet Digital Health 4.5 Elsevier, 2022, pp. e384–e397 DOI: 10.1016/S2589-7500(22)00003-6
  47. “Targeted Validation: Validating Clinical Prediction Models in Their Intended Population and Setting” In Diagnostic and Prognostic Research 6.1, 2022, pp. 24 DOI: 10.1186/s41512-022-00136-8
  48. “Foundation models for generalist medical artificial intelligence” In Nature 616.7956 Nature Publishing Group UK London, 2023, pp. 259–265
  49. “MedAlign: A Clinician-Generated Dataset for Instruction Following with Electronic Medical Records” arXiv, 2023 arXiv:2308.14089
  50. Ankit Pal, Logesh Kumar Umapathi and Malaikannan Sankarasubbu “Med-HALT: Medical Domain Hallucination Test for Large Language Models” In Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL), 2023, pp. 314–334
  51. “Testing the Limits of Language Models: A Conversational Framework for Medical AI Assessment” In medRxiv Cold Spring Harbor Laboratory Press, 2023
  52. “A ChatGPT Aided Explainable Framework for Zero-Shot Medical Image Diagnosis” In ICML 3rd Workshop on Interpretable Machine Learning in Healthcare (IMLH), 2023
  53. “ChatCAD: Interactive Computer-Aided Diagnosis on Medical Image Using Large Language Models” arXiv, 2023 DOI: 10.48550/arXiv.2302.07257
  54. Giorgio Leonardi, Luigi Portinale and Andrea Santomauro “Enhancing Medical Image Report Generation through Standard Language Models: Leveraging the Power of LLMs in Healthcare” In 2nd AIxIA Workshop on Artificial Intelligence for Healthcare, 2023
  55. “Adapted large language models can outperform medical experts in clinical text summarization” In Nature Medicine Nature Publishing Group US New York, 2024, pp. 1–9
  56. “Multimodal LLMs for Health Grounded in Individual-Specific Data” In Machine Learning for Multimodal Healthcare Data, 2024, pp. 86–102 DOI: 10.1007/978-3-031-47679-2_7
  57. “Impact of Large Language Model Assistance on Patients Reading Clinical Notes: A Mixed-Methods Study” arXiv, 2024 DOI: 10.48550/arXiv.2401.09637
  58. “Human–AI Collaboration Enables More Empathic Conversations in Text-Based Peer-to-Peer Mental Health Support” In Nature Machine Intelligence 5.1 Nature Publishing Group UK London, 2023, pp. 46–57
  59. World Health Organization “Health Equity”, https://www.who.int/health-topics/health-equity
  60. “Use Large Language Models to Promote Equity” arXiv, 2023 DOI: 10.48550/arXiv.2312.14804
  61. Emma Gurevich, Basheer El Hassan and Christo El Morr “Equity within AI Systems: What Can Health Leaders Expect?” In Healthcare Management Forum 36.2, 2023, pp. 119–124 DOI: 10.1177/08404704221125368
  62. Irene Y. Chen, Peter Szolovits and Marzyeh Ghassemi “Can AI Help Reduce Disparities in General Medical and Mental Health Care?” In AMA Journal of Ethics 21.2 American Medical Association, 2019, pp. 167–179 DOI: 10.1001/amajethics.2019.167
  63. “The Root Causes of Health Inequity” In Communities in Action: Pathways to Health Equity National Academies Press, 2017
  64. “The State of Health Disparities in the United States” In Communities in Action: Pathways to Health Equity National Academies Press (US), 2017
  65. Dielle J Lundberg and Jessica A Chen “Structural ableism in public health and healthcare: a definition and conceptual framework” In The Lancet Regional Health–Americas 30 Elsevier, 2024
  66. Elizabeth Brondolo, Linda C. Gallo and Hector F. Myers “Race, Racism and Health: Disparities, Mechanisms, and Interventions” In Journal of Behavioral Medicine 32.1, 2009, pp. 1–8 DOI: 10.1007/s10865-008-9190-3
  67. “Socioeconomic Disparities in Health in the United States: What the Patterns Tell Us” In American Journal of Public Health 100.S1 American Public Health Association, 2010, pp. S186–S196 DOI: 10.2105/AJPH.2009.166082
  68. Stella M. Umuhoza and John E. Ataguba “Inequalities in Health and Health Risk Factors in the Southern African Development Community: Evidence from World Health Surveys” In International Journal for Equity in Health 17, 2018, pp. 52 DOI: 10.1186/s12939-018-0762-8
  69. Hyacinth Eme Ichoku, Gavin Mooney and John Ele-Ojo Ataguba “Africanizing the Social Determinants of Health: Embedded Structural Inequalities and Current Health Outcomes in Sub-Saharan Africa” In International Journal of Health Services 43.4 SAGE Publications Inc, 2013, pp. 745–759 DOI: 10.2190/HS.43.4.i
  70. Yarlini Balarajan, S Selvaraj and S V Subramanian “Health Care and Equity in India” In Lancet 377.9764, 2011, pp. 505–515 DOI: 10.1016/S0140-6736(10)61894-6
  71. “Health Inequity in Workers of Latin America and the Caribbean” In International Journal for Equity in Health 19.1, 2020, pp. 109 DOI: 10.1186/s12939-020-01228-x
  72. “Sources of Bias in Artificial Intelligence That Perpetuate Healthcare Disparities—A Global Review” In PLOS Digital Health 1.3 Public Library of Science, 2022, pp. e0000022 DOI: 10.1371/journal.pdig.0000022
  73. Solon Barocas, Moritz Hardt and Arvind Narayanan “Fairness and Machine Learning: Limitations and Opportunities” MIT Press, 2023
  74. “Considerations for Addressing Bias in Artificial Intelligence for Health Equity” In NPJ Digital Medicine 6, 2023, pp. 170 DOI: 10.1038/s41746-023-00913-9
  75. “Guiding Principles to Address the Impact of Algorithm Bias on Racial and Ethnic Disparities in Health and Health Care” In JAMA network open 6.12, 2023, pp. e2345050 DOI: 10.1001/jamanetworkopen.2023.45050
  76. “Mitigating Racial And Ethnic Bias And Advancing Health Equity In Clinical Algorithms: A Scoping Review” In Health Affairs 42.10 Health Affairs, 2023, pp. 1359–1368 DOI: 10.1377/hlthaff.2023.00553
  77. “Net Benefit, Calibration, Threshold Selection, and Training Objectives for Algorithmic Fairness in Healthcare” In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22 New York, NY, USA: Association for Computing Machinery, 2022, pp. 1039–1052 DOI: 10.1145/3531146.3533166
  78. “Hurtful Words: Quantifying Biases in Clinical Contextual Word Embeddings” In Proceedings of the ACM Conference on Health, Inference, and Learning, CHIL ’20 New York, NY, USA: Association for Computing Machinery, 2020, pp. 110–120 DOI: 10.1145/3368555.3384448
  79. World Health Organization “WHO Releases AI Ethics and Governance Guidance for Large Multi-Modal Models”, x https://www.who.int/news/item/18-01-2024-who-releases-ai-ethics-and-governance-guidance-for-large-multi-modal-models, 2024
  80. “Assessing Racial and Ethnic Bias in Text Generation for Healthcare-Related Tasks by ChatGPT”, 2023 DOI: 10.1101/2023.08.28.23294730
  81. “Sociotechnical Harms of Algorithmic Systems: Scoping a Taxonomy for Harm Reduction” In Proceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’23 New York, NY, USA: Association for Computing Machinery, 2023, pp. 723–741 DOI: 10.1145/3600211.3604673
  82. “Sociotechnical Safety Evaluation of Generative AI Systems” arXiv, 2023 DOI: 10.48550/arXiv.2310.11986
  83. “A Normative Framework for Artificial Intelligence as a Sociotechnical System in Healthcare” In Patterns 4.11 Elsevier, 2023 DOI: 10.1016/j.patter.2023.100864
  84. “Undesirable Biases in NLP: Addressing Challenges of Measurement” In Journal of Artificial Intelligence Research 79, 2024, pp. 1–40 DOI: 10.1613/jair.1.15195
  85. “DICES Dataset: Diversity in Conversational AI Evaluation for Safety” In Advances in Neural Information Processing Systems 36, 2023, pp. 53330–53342
  86. “Intersectionality in Conversational AI Safety: How Bayesian Multilevel Models Help Understand Diverse Perceptions of Safety” arXiv, 2023 DOI: 10.48550/arXiv.2306.11530
  87. “The Reasonable Effectiveness of Diverse Evaluation Data” arXiv, 2023 DOI: 10.48550/arXiv.2301.09406
  88. “A Framework to Assess (Dis)Agreement Among Diverse Rater Groups” arXiv, 2023 DOI: 10.48550/arXiv.2311.05074
  89. “The Equitable AI Research Roundtable (EARR): Towards Community-Based Decision Making in Responsible AI Development” arXiv, 2023 DOI: 10.48550/arXiv.2303.08177
  90. “An Equity-Based Taxonomy for Generative AI: Utilizing Participatory Research to Advance Methods of Evaluation for Equity and Sensitive Domains” In Working paper in submission, 2024
  91. “Learning to Summarize with Human Feedback” In Advances in Neural Information Processing Systems 33, 2020, pp. 3008–3021
  92. “Training a helpful and harmless assistant with reinforcement learning from human feedback” arXiv, 2022 arXiv:2204.05862
  93. “Counterfactual Fairness” In Advances in Neural Information Processing Systems 30 Curran Associates, Inc., 2017
  94. “Counterfactual Fairness in Text Classification through Robustness” In Proceedings of the 2019 AAAI/ACM Conference on AI, Ethics, and Society Honolulu HI USA: ACM, 2019, pp. 219–226 DOI: 10.1145/3306618.3317950
  95. Vinodkumar Prabhakaran, Ben Hutchinson and Margaret Mitchell “Perturbation Sensitivity Analysis to Detect Unintended Model Biases” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP) Hong Kong, China: Association for Computational Linguistics, 2019, pp. 5740–5745 DOI: 10.18653/v1/D19-1578
  96. “Counterfactual Reasoning for Fair Clinical Risk Prediction” In Proceedings of the 4th Machine Learning for Healthcare Conference PMLR, 2019, pp. 325–358
  97. “Causal Multi-level Fairness” In Proceedings of the 2021 AAAI/ACM Conference on AI, Ethics, and Society, AIES ’21 New York, NY, USA: Association for Computing Machinery, 2021, pp. 784–794 DOI: 10.1145/3461702.3462587
  98. “Towards Understanding Sycophancy in Language Models” In The Twelfth International Conference on Learning Representations, 2023
  99. “Overview of the Medical Question Answering Task at TREC 2017 LiveQA” In TREC 2017, 2017
  100. “Bridging the Gap Between Consumers’ Medication Questions and Trusted Answers.” In MedInfo, 2019, pp. 25–29
  101. “statsmodels: Econometric and statistical modeling with Python” In 9th Python in Science Conference, 2010
  102. “SciPy 1.0: Fundamental Algorithms for Scientific Computing in Python” In Nature Methods 17, 2020, pp. 261–272 DOI: 10.1038/s41592-019-0686-2
  103. Santiago Castro “Fast Krippendorff: Fast computation of Krippendorff’s alpha agreement measure” In GitHub repository GitHub, https://github.com/pln-fing-udelar/fast-krippendorff, 2017
  104. Justus J Randolph “Free-Marginal Multirater Kappa (Multirater K [Free]): An Alternative to Fleiss’ Fixed-Marginal Multirater Kappa.” In Online submission ERIC, 2005
  105. Klaus Krippendorff “Estimating the Reliability, Systematic Error and Random Error of Interval Data” In Educational and Psychological Measurement 30.1 SAGE Publications Inc, 1970, pp. 61–70 DOI: 10.1177/001316447003000105
  106. Ka Wong, Praveen Paritosh and Lora Aroyo “Cross-Replication Reliability - An Empirical Approach to Interpreting Inter-rater Reliability” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 7053–7065 DOI: 10.18653/v1/2021.acl-long.548
  107. Bradley Efron “Better Bootstrap Confidence Intervals” In Journal of the American Statistical Association 82.397 Taylor & Francis, 1987, pp. 171–185 DOI: 10.1080/01621459.1987.10478410
  108. “Bootstrapping Clustered Data” In Journal of the Royal Statistical Society Series B: Statistical Methodology 69.3, 2007, pp. 369–390 DOI: 10.1111/j.1467-9868.2007.00593.x
  109. “New Creatinine- and Cystatin C–Based Equations to Estimate GFR without Race” In New England Journal of Medicine 385.19 Massachusetts Medical Society, 2021, pp. 1737–1749 DOI: 10.1056/NEJMoa2102953
  110. “High Agreement but Low Kappa: I. The Problems of Two Paradoxes” In Journal of Clinical Epidemiology 43.6, 1990, pp. 543–549 DOI: 10.1016/0895-4356(90)90158-l
  111. “High Agreement but Low Kappa: II. Resolving the Paradoxes” In Journal of Clinical Epidemiology 43.6, 1990, pp. 551–558 DOI: 10.1016/0895-4356(90)90159-m
  112. David Quarfoot and Richard A. Levine “How Robust Are Multirater Interrater Reliability Indices to Changes in Frequency Distribution?” In The American Statistician 70.4 Taylor & Francis, 2016, pp. 373–384 DOI: 10.1080/00031305.2016.1141708
  113. Joseph R. Dettori and Daniel C. Norvell “Kappa and Beyond: Is There Agreement?” In Global Spine Journal 10.4 SAGE Publications Inc, 2020, pp. 499–501 DOI: 10.1177/2192568220911648
  114. Matthijs J. Warrens “Inequalities between Multi-Rater Kappas” In Advances in Data Analysis and Classification 4.4, 2010, pp. 271–286 DOI: 10.1007/s11634-010-0073-4
  115. “All That Agrees Is Not Gold: Evaluating Ground Truth Labels and Dialogue Content for Safety”, 2023
  116. “Stereotyping Norwegian salmon: An inventory of pitfalls in fairness benchmark datasets” In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 1004–1015
  117. Nevan Wichers, Carson Denison and Ahmad Beirami “Gradient-based language model red teaming” In arXiv preprint arXiv:2401.16656, 2024
  118. Po-Hsuan Cameron Chen, Craig H. Mermel and Yun Liu “Evaluation of Artificial Intelligence on a Reference Standard Based on Subjective Interpretation” In The Lancet Digital Health 3.11 Elsevier, 2021, pp. e693–e695 DOI: 10.1016/S2589-7500(21)00216-8
  119. “Truth Is a Lie: Crowd Truth and the Seven Myths of Human Annotation” In AI Magazine 36.1, 2015, pp. 15–24 DOI: 10.1609/aimag.v36i1.2564
  120. “The Three Sides of CrowdTruth” In Human Computation 1.1, 2014 DOI: 10.15346/hc.v1i1.3
  121. Rebecca J. Passonneau and Bob Carpenter “The Benefits of a Model of Annotation” In Transactions of the Association for Computational Linguistics 2 Cambridge, MA: MIT Press, 2014, pp. 311–326 DOI: 10.1162/tacl_a_00185
  122. “Comparing Bayesian Models of Annotation” In Transactions of the Association for Computational Linguistics 6 Cambridge, MA: MIT Press, 2018, pp. 571–585 DOI: 10.1162/tacl_a_00040
  123. “Using Generative AI to Investigate Medical Imagery Models and Datasets” arXiv, 2023 DOI: 10.48550/arXiv.2306.00985
  124. Timothy P Johnson “Handbook of Health Survey Methods” Wiley Online Library, 2015
  125. “Comparative Survey Methodology” In Survey Methods in Multinational, Multiregional, and Multicultural Contexts John Wiley & Sons, Ltd, 2010, pp. 1–16 DOI: 10.1002/9780470609927.ch1
  126. “Documenting Data Production Processes: A Participatory Approach for Data Work” In Proceedings of the ACM on Human-Computer Interaction 6 Association for Computing Machinery, 2022
  127. “Power to the People? Opportunities and Challenges for Participatory AI” In Equity and Access in Algorithms, Mechanisms, and Optimization Arlington VA USA: ACM, 2022, pp. 1–8 DOI: 10.1145/3551624.3555290
  128. “The Case for Globalizing Fairness: A Mixed Methods Study on Colonialism, AI, and Health in Africa” arXiv, 2024 DOI: 10.48550/arXiv.2403.03357
  129. “Re-Imagining Algorithmic Fairness in India and Beyond” arXiv, 2021 DOI: 10.48550/arXiv.2101.09995
  130. Karina Czyzewski “Colonialism as a Broader Social Determinant of Health” In The International Indigenous Policy Journal 2.1, 2011 DOI: 10.18584/iipj.2011.2.1.5
  131. José G.Pérez Ramos, Adriana Garriga-López and Carlos E. Rodríguez-Díaz “How Is Colonialism a Sociostructural Determinant of Health in Puerto Rico?” In AMA Journal of Ethics 24.4 American Medical Association, 2022, pp. 305–312 DOI: 10.1001/amajethics.2022.305
  132. Abeba Birhane “Algorithmic Colonization of Africa” In SCRIPTed 17.2 Script Centre, University of Edinburgh, 2020, pp. 389–409 DOI: 10.2966/scrip.170220.389
  133. Shakir Mohamed, Marie-Therese Png and William Isaac “Decolonial AI: Decolonial Theory as Sociotechnical Foresight in Artificial Intelligence” In Philosophy & Technology 33.4, 2020, pp. 659–684 DOI: 10.1007/s13347-020-00405-8
  134. “Model Cards for Model Reporting” In Proceedings of the Conference on Fairness, Accountability, and Transparency, 2019, pp. 220–229 DOI: 10.1145/3287560.3287596
  135. “Datasheets for Datasets” In Communications of the ACM 64.12 ACM New York, NY, USA, 2021, pp. 86–92
  136. “Closing the AI Accountability Gap: Defining an End-to-End Framework for Internal Algorithmic Auditing” In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* ’20 New York, NY, USA: Association for Computing Machinery, 2020, pp. 33–44 DOI: 10.1145/3351095.3372873
  137. “Direct Preference Optimization: Your Language Model Is Secretly a Reward Model” In Advances in Neural Information Processing Systems 36, 2024
  138. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks” In Advances in Neural Information Processing Systems 33, 2020, pp. 9459–9474
  139. “"The Human Body Is a Black Box": Supporting Clinical Decision-Making with Deep Learning” In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency, FAT* ’20 New York, NY, USA: Association for Computing Machinery, 2020, pp. 99–109 DOI: 10.1145/3351095.3372827
  140. “What’s Fair Is… Fair? Presenting JustEFAB, an Ethical Framework for Operationalizing Medical Ethics and Social Justice in the Integration of Clinical Machine Learning: JustEFAB” In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’23 New York, NY, USA: Association for Computing Machinery, 2023, pp. 1505–1519 DOI: 10.1145/3593013.3594096
  141. “Health equity assessment of machine learning performance (HEAL): a framework and dermatology AI model case study” In eClinicalMedicine Elsevier, 2024
  142. “Healthsheet: Development of a Transparency Artifact for Health Datasets” In Proceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’22 New York, NY, USA: Association for Computing Machinery, 2022, pp. 1943–1961 DOI: 10.1145/3531146.3533239
  143. The STANDING Together Collaboration “Recommendations for Diversity, Inclusivity, and Generalisability in Artificial Intelligence Health Technologies and Health Datasets” [object Object], 2023 DOI: 10.5281/ZENODO.10048356
  144. Christina Harrington, Sheena Erete and Anne Marie Piper “Deconstructing Community-Based Collaborative Design: Towards More Equitable Participatory Design Engagements” In Proceedings of the ACM on Human-Computer Interaction 3.CSCW, 2019, pp. 216:1–216:25 DOI: 10.1145/3359318
  145. Nancy Krieger “202Ecosocial Theory of Disease Distribution: Embodying Societal & Ecologic Context” In Epidemiology and the People’s Health: Theory and Context Oxford University Press, 2011 DOI: 10.1093/acprof:oso/9780195383874.003.0007
  146. Urie Bronfenbrenner “The ecology of human development: Experiments by nature and design” Harvard university press, 1979
  147. Christina N Harrington “The Forgotten Margins: What Is Community-Based Participatory Health Design Telling Us?” In Interactions 27.3 ACM New York, NY, USA, 2020, pp. 24–29
  148. “Integrating Community-Based Participatory Research and Informatics Approaches to Improve the Engagement and Health of Underserved Populations” In Journal of the American Medical Informatics Association 23.1, 2016, pp. 60–73 DOI: 10.1093/jamia/ocv094
  149. Robin N. Brewer, Christina Harrington and Courtney Heldreth “Envisioning Equitable Speech Technologies for Black Older Adults” In Proceedings of the 2023 ACM Conference on Fairness, Accountability, and Transparency, FAccT ’23 New York, NY, USA: Association for Computing Machinery, 2023, pp. 379–388 DOI: 10.1145/3593013.3594005
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (30)
  1. Stephen R. Pfohl (10 papers)
  2. Heather Cole-Lewis (6 papers)
  3. Rory Sayres (10 papers)
  4. Darlene Neal (3 papers)
  5. Mercy Asiedu (5 papers)
  6. Awa Dieng (8 papers)
  7. Nenad Tomasev (30 papers)
  8. Qazi Mamunur Rashid (3 papers)
  9. Shekoofeh Azizi (23 papers)
  10. Negar Rostamzadeh (38 papers)
  11. Liam G. McCoy (3 papers)
  12. Leo Anthony Celi (49 papers)
  13. Yun Liu (213 papers)
  14. Mike Schaekermann (20 papers)
  15. Alanna Walton (4 papers)
  16. Alicia Parrish (31 papers)
  17. Chirag Nagpal (25 papers)
  18. Preeti Singh (6 papers)
  19. Akeiylah Dewitt (1 paper)
  20. Philip Mansfield (24 papers)
Citations (15)