Papers

Topics

Authors

Recent

View all

Gemini 2.5 Flash

102 tokens/sec

GPT-4o

59 tokens/sec

Gemini 2.5 Pro Pro

43 tokens/sec

o3 Pro

6 tokens/sec

GPT-4.1 Pro

50 tokens/sec

DeepSeek R1 via Azure Pro

28 tokens/sec

2000 character limit reached

352

A Toolbox for Surfacing Health Equity Harms and Biases in Large Language Models (2403.12025v2)

Published 18 Mar 2024 in cs.CY, cs.CL, and cs.LG

Abstract: LLMs hold promise to serve complex health information needs but also have the potential to introduce harm and exacerbate health disparities. Reliably evaluating equity-related model failures is a critical step toward developing systems that promote health equity. We present resources and methodologies for surfacing biases with potential to precipitate equity-related harms in long-form, LLM-generated answers to medical questions and conduct a large-scale empirical case study with the Med-PaLM 2 LLM. Our contributions include a multifactorial framework for human assessment of LLM-generated answers for biases, and EquityMedQA, a collection of seven datasets enriched for adversarial queries. Both our human assessment framework and dataset design process are grounded in an iterative participatory approach and review of Med-PaLM 2 answers. Through our empirical study, we find that our approach surfaces biases that may be missed via narrower evaluation approaches. Our experience underscores the importance of using diverse assessment methodologies and involving raters of varying backgrounds and expertise. While our approach is not sufficient to holistically assess whether the deployment of an AI system promotes equitable health outcomes, we hope that it can be leveraged and built upon towards a shared goal of LLMs that promote accessible and equitable healthcare.

PDF HTML Abstract

Introducing a Framework and Datasets for Evaluating Health Equity Harms in LLMs

Overview of Proposed Framework and Datasets

The utilization of LLMs in healthcare has demonstrated considerable potential in enhancing access to medical information and improving patient care. However, alongside the opportunities, there exist significant challenges, particularly concerning the perpetuation of biases and exacerbation of health disparities. Addressing these challenges requires a systematic approach to evaluate and identify biases embedded within LLM-generated content. In this context, the paper presents a comprehensive framework alongside a collection of newly-released datasets aimed at surfacing biases related to health equity in the outputs of medical LLMs. This effort, grounded in an iterative and participatory approach, encompasses multifactorial assessment rubrics for bias evaluation and an empirical case paper with Med-PaLM 2, contributing valuable insights into the identification and mitigation of equity-related harms in LLMs.

Multifactorial Assessment Rubrics

The assessment rubrics detailed in this paper were designed to evaluate bias within LLM-generated answers to medical queries. They incorporate dimensions of bias developed in collaboration with equity experts, reflecting a nuanced approach to understanding bias beyond conventional metrics. Three types of rubrics are introduced:

Independent Assessment: Evaluates bias in a single answer to a question, allowing raters to identify various forms of bias including inaccuracies across identity axes, lack of inclusivity, and stereotyping.
Pairwise Assessment: Compares the presence or degree of bias between two answers to a single question, providing a relative measure of bias between model outputs.
Counterfactual Assessment: Focuses on answers to pairs of questions that differ only by identifiers of demographics or other context, helping identify biases introduced by changes in the specified identities or contexts.

EquityMedQA Datasets

The EquityMedQA comprises seven datasets designed to facilitate the adversarial testing of health equity issues within medical LLMs. These datasets span various aspects of medical information queries, from explicitly adversarial questions to inquiries enriched for content related to known health disparities. The diversity in the collection methodology, including human curation, LLM-generated queries, and focus on global health topics, underscores the comprehensive nature of these datasets in targeting different forms of potential bias. Notably, the datasets include:

OMAQ: Features human-curated, explicitly adversarial queries across multiple health topics.
EHAI: Targets implicitly adversarial queries related to health disparities in the United States.
FBRT-Manual and FBRT-LLM: Contain questions derived through failure-based red teaming of Med-PaLM 2.
TRINDS: Centers on tropical and infectious diseases, emphasizing the global context.
CC-Manual and CC-LLM: Include counterfactual query pairs with adjustments for identity or context, aiding in a deeper understanding of bias generation.

Empirical Results and Implications

Through an extensive empirical paper utilizing the developed rubrics and datasets, several key findings emerged:

Bias in LLM Outputs: The paper revealed biases within Med-PaLM 2 outputs across multiple dimensions, indicating the necessity of diverse methodologies in bias evaluation.
Role of Rater Groups: Variation in bias reporting between physician, health equity expert, and consumer rater groups highlighted the importance of including diverse perspectives in bias evaluation efforts.
Utility of Counterfactual Analysis: The counterfactual assessment rubric elucidated biases related to changes in demographic identifiers or context, offering profound insights into subtle forms of bias.

Concluding Remarks

The proposed framework and datasets mark a significant advancement in the ongoing efforts to mitigate health equity harms within medical LLMs. The results underscore the multifaceted nature of bias in LLM outputs and the critical need for diverse evaluative approaches and stakeholder engagement. Future research directions include refining the evaluation rubrics, extending the datasets to cover wider global contexts, and developing methodologies to mitigate identified biases effectively.

PDF Markdown Bookmark Chat (Pro)

References (149)

Authors (30)

Stephen R. Pfohl (10 papers)
Heather Cole-Lewis (6 papers)
Rory Sayres (10 papers)
Darlene Neal (3 papers)
Mercy Asiedu (5 papers)
Awa Dieng (8 papers)
Nenad Tomasev (30 papers)
Qazi Mamunur Rashid (3 papers)
Shekoofeh Azizi (23 papers)
Negar Rostamzadeh (38 papers)
Liam G. McCoy (3 papers)
Leo Anthony Celi (49 papers)
Yun Liu (213 papers)
Mike Schaekermann (20 papers)
Alanna Walton (4 papers)
Alicia Parrish (31 papers)
Chirag Nagpal (25 papers)
Preeti Singh (6 papers)
Akeiylah Dewitt (1 paper)
Philip Mansfield (24 papers)

Citations (15)

View on Semantic Scholar

Tweets

https://twitter.com/stephenpfohl/status/1770154743349870772

https://twitter.com/thekaransinghal/status/1770156542282637823

https://twitter.com/AziziShekoofeh/status/1770554306560721349

https://twitter.com/hcolelewis/status/1770162262952562951

https://twitter.com/fly51fly/status/1770203512485630330

https://twitter.com/moorejh/status/1787947544938504275