Addressing cognitive bias in medical language models (2402.08113v3)
Abstract: There is increasing interest in the application LLMs to the medical field, in part because of their impressive performance on medical exam questions. While promising, exam questions do not reflect the complexity of real patient-doctor interactions. In reality, physicians' decisions are shaped by many complex factors, such as patient compliance, personal experience, ethical beliefs, and cognitive bias. Taking a step toward understanding this, our hypothesis posits that when LLMs are confronted with clinical questions containing cognitive biases, they will yield significantly less accurate responses compared to the same questions presented without such biases. In this study, we developed BiasMedQA, a benchmark for evaluating cognitive biases in LLMs applied to medical tasks. Using BiasMedQA we evaluated six LLMs, namely GPT-4, Mixtral-8x70B, GPT-3.5, PaLM-2, Llama 2 70B-chat, and the medically specialized PMC Llama 13B. We tested these models on 1,273 questions from the US Medical Licensing Exam (USMLE) Steps 1, 2, and 3, modified to replicate common clinically-relevant cognitive biases. Our analysis revealed varying effects for biases on these LLMs, with GPT-4 standing out for its resilience to bias, in contrast to Llama 2 70B-chat and PMC Llama 13B, which were disproportionately affected by cognitive bias. Our findings highlight the critical need for bias mitigation in the development of medical LLMs, pointing towards safer and more reliable applications in healthcare.
- The economics of health care quality and medical errors. \JournalTitleJournal of health care finance 39, 39 (2012).
- Bias in medicine: lessons learned and mitigation strategies. \JournalTitleBasic to Translational Science 6, 78–85 (2021).
- Daud-Gallotti, R. M. et al. Nursing workload as a risk factor for healthcare associated infections in icu: a prospective study. \JournalTitlePloS one 7, e52342 (2012).
- Alert, S. E. Inadequate hand-off communication. \JournalTitleSentinel event alert 58, 6 (2017).
- Gaps in knowledge and practices about health care associated infections among health care workers at a tertiary care hospital. \JournalTitleJournal of Islamabad Medical & Dental College (JIMDC) 5, 84–87 (2016).
- Zhang, J. et al. The potential and pitfalls of using a large language model such as chatgpt or gpt-4 as a clinical assistant. \JournalTitlearXiv preprint arXiv:2307.08152 (2023).
- Organization, W. H. et al. Health workforce requirements for universal health coverage and the sustainable development goals. \JournalTitleWorld Health Organization (2016).
- Embracing large language models for medical applications: Opportunities and challenges. \JournalTitleCureus 15 (2023).
- Doctor versus ai: Patient and physician evaluation of large language model responses to rheumatology patient questions, a cross sectional study. \JournalTitleArthritis & Rheumatology (2023).
- Large language models propagate race-based medicine. \JournalTitleNPJ Digital Medicine 6, 195 (2023).
- Zack, T. et al. Assessing the potential of gpt-4 to perpetuate racial and gender biases in health care: a model evaluation study. \JournalTitleThe Lancet Digital Health 6, e12–e22 (2024).
- Jin, D. et al. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. \JournalTitleApplied Sciences 11, 6421 (2021).
- Chen, Z. et al. Meditron-70b: Scaling medical pretraining for large language models. \JournalTitlearXiv preprint arXiv:2311.16079 (2023).
- Nori, H. et al. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. \JournalTitlearXiv preprint arXiv:2311.16452 (2023).
- Implicit bias in healthcare: clinical practice, research and decision making. \JournalTitleFuture healthcare journal 8, 40 (2021).
- Language models are susceptible to incorrect patient self-diagnosis in medical applications. In Deep Generative Models for Health Workshop NeurIPS 2023 (2023).
- Touvron, H. et al. Llama: Open and efficient foundation language models. \JournalTitlearXiv preprint arXiv:2302.13971 (2023).
- Barham, P. et al. Pathways: Asynchronous distributed dataflow for ml. \JournalTitleProceedings of Machine Learning and Systems 4, 430–449 (2022).
- OpenAI et al. Gpt-4 technical report (2023). 2303.08774.
- Wu, C. et al. Pmc-llama: Towards building open-source language models for medicine (2023). 2304.14454.
- Jiang, A. Q. et al. Mixtral of experts. \JournalTitlearXiv preprint arXiv:2401.04088 (2024).
- Thoppilan, R. et al. Lamda: Language models for dialog applications. \JournalTitlearXiv preprint arXiv:2201.08239 (2022).
- Brown, T. et al. Language models are few-shot learners. \JournalTitleAdvances in neural information processing systems 33, 1877–1901 (2020).
- Christiano, P. F. et al. Deep reinforcement learning from human preferences. \JournalTitleAdvances in neural information processing systems 30 (2017).
- Thirunavukarasu, A. J. et al. Large language models in medicine. \JournalTitleNature medicine 1–11 (2023).
- Can large language models reason about medical questions? \JournalTitlearXiv preprint arXiv:2207.08143 (2022).
- Self-diagnosis and large language models: A new front for medical misinformation. \JournalTitlearXiv preprint arXiv:2307.04910 (2023).
- Harrer, S. Attention is not all you need: the complicated case of ethically using large language models in healthcare and medicine. \JournalTitleEBioMedicine 90 (2023).
- Singhal, K. et al. Towards expert-level medical question answering with large language models. \JournalTitlearXiv preprint arXiv:2305.09617 (2023).
- Singhal, K. et al. Large language models encode clinical knowledge. \JournalTitleNature 620, 172–180 (2023).
- Samuel Schmidgall (27 papers)
- Carl Harris (3 papers)
- Ime Essien (1 paper)
- Daniel Olshvang (1 paper)
- Tawsifur Rahman (17 papers)
- Ji Woong Kim (14 papers)
- Rojin Ziaei (4 papers)
- Jason Eshraghian (13 papers)
- Peter Abadir (1 paper)
- Rama Chellappa (190 papers)