Combining Insights From Multiple Large Language Models Improves Diagnostic Accuracy
Abstract: Background: LLMs such as OpenAI's GPT-4 or Google's PaLM 2 are proposed as viable diagnostic support tools or even spoken of as replacements for "curbside consults". However, even LLMs specifically trained on medical topics may lack sufficient diagnostic accuracy for real-life applications. Methods: Using collective intelligence methods and a dataset of 200 clinical vignettes of real-life cases, we assessed and compared the accuracy of differential diagnoses obtained by asking individual commercial LLMs (OpenAI GPT-4, Google PaLM 2, Cohere Command, Meta Llama 2) against the accuracy of differential diagnoses synthesized by aggregating responses from combinations of the same LLMs. Results: We find that aggregating responses from multiple, various LLMs leads to more accurate differential diagnoses (average accuracy for 3 LLMs: $75.3\%\pm 1.6pp$) compared to the differential diagnoses produced by single LLMs (average accuracy for single LLMs: $59.0\%\pm 6.1pp$). Discussion: The use of collective intelligence methods to synthesize differential diagnoses combining the responses of different LLMs achieves two of the necessary steps towards advancing acceptance of LLMs as a diagnostic support tool: (1) demonstrate high diagnostic accuracy and (2) eliminate dependence on a single commercial vendor.
- “Large language model (ChatGPT) as a support tool for breast tumor board” In NPJ Breast Cancer 9.1 Nature Publishing Group UK London, 2023, pp. 44
- Thomas Savage, John Wang and Lisa Shieh “A Large Language Model Screening Tool to Target Patients for Best Practice Alerts: Development and Validation” In JMIR Medical Informatics 11 JMIR Publications Toronto, Canada, 2023, pp. e49886
- Charlotte J Haug and Jeffrey M Drazen “Artificial intelligence and machine learning in clinical medicine, 2023” In New England Journal of Medicine 388.13 Mass Medical Soc, 2023, pp. 1201–1208
- Sajan B Patel and Kyle Lam “ChatGPT: the future of discharge summaries?” In The Lancet Digital Health 5.3 Elsevier, 2023, pp. e107–e108
- Peter Lee, Sebastien Bubeck and Joseph Petro “Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine” In New England Journal of Medicine 388.13 Mass Medical Soc, 2023, pp. 1233–1239
- “Black box warning: large language models and the future of infectious diseases consultation” In Clinical Infectious Diseases Oxford University Press US, 2023, pp. ciad633
- “Worldwide AI ethics: A review of 200 guidelines and recommendations for AI governance” In Patterns 4.10 Elsevier, 2023
- Alexander V Eriksen, Sören Möller and Jesper Ryg “Use of GPT-4 to diagnose complex clinical cases” In NEJM AI 1.1 Massachusetts Medical Society, 2023, pp. AIp2300031
- “Diagnostic Accuracy of a Large Language Model in Pediatric Case Studies” In JAMA pediatrics, 2024
- Shweta Suran, Vishwajeet Pattanaik and Dirk Draheim “Frameworks for collective intelligence: A systematic literature review” In ACM Computing Surveys (CSUR) 53.1 ACM New York, NY, USA, 2020, pp. 1–36
- “Comparative accuracy of diagnosis by collective intelligence of multiple physicians vs individual physicians” In JAMA network open 2.3 American Medical Association, 2019, pp. e190096–e190096
- “Automating hybrid collective intelligence in open-ended medical diagnostics” In Proceedings of the National Academy of Sciences 120.34 National Acad Sciences, 2023, pp. e2221473120
- “Group discussion improves lie detection” In Proceedings of the National Academy of Sciences 112.24 National Acad Sciences, 2015, pp. 7460–7465
- Dongfu Jiang, Xiang Ren and Bill Yuchen Lin “LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion” In arXiv preprint arXiv:2306.02561, 2023
- “One LLM is not Enough: Harnessing the Power of Ensemble Learning for Medical Question Answering” In medRxiv Cold Spring Harbor Laboratory Press, 2023, pp. 2023–12
- “Chateval: Towards better llm-based evaluators through multi-agent debate” In arXiv preprint arXiv:2308.07201, 2023
Paper Prompts
Sign up for free to create and run prompts on this paper using GPT-5.
Top Community Prompts
Collections
Sign up for free to add this paper to one or more collections.