Papers
Topics
Authors
Recent
Search
2000 character limit reached

Combining Insights From Multiple Large Language Models Improves Diagnostic Accuracy

Published 13 Feb 2024 in cs.AI | (2402.08806v1)

Abstract: Background: LLMs such as OpenAI's GPT-4 or Google's PaLM 2 are proposed as viable diagnostic support tools or even spoken of as replacements for "curbside consults". However, even LLMs specifically trained on medical topics may lack sufficient diagnostic accuracy for real-life applications. Methods: Using collective intelligence methods and a dataset of 200 clinical vignettes of real-life cases, we assessed and compared the accuracy of differential diagnoses obtained by asking individual commercial LLMs (OpenAI GPT-4, Google PaLM 2, Cohere Command, Meta Llama 2) against the accuracy of differential diagnoses synthesized by aggregating responses from combinations of the same LLMs. Results: We find that aggregating responses from multiple, various LLMs leads to more accurate differential diagnoses (average accuracy for 3 LLMs: $75.3\%\pm 1.6pp$) compared to the differential diagnoses produced by single LLMs (average accuracy for single LLMs: $59.0\%\pm 6.1pp$). Discussion: The use of collective intelligence methods to synthesize differential diagnoses combining the responses of different LLMs achieves two of the necessary steps towards advancing acceptance of LLMs as a diagnostic support tool: (1) demonstrate high diagnostic accuracy and (2) eliminate dependence on a single commercial vendor.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (16)
  1. “Large language model (ChatGPT) as a support tool for breast tumor board” In NPJ Breast Cancer 9.1 Nature Publishing Group UK London, 2023, pp. 44
  2. Thomas Savage, John Wang and Lisa Shieh “A Large Language Model Screening Tool to Target Patients for Best Practice Alerts: Development and Validation” In JMIR Medical Informatics 11 JMIR Publications Toronto, Canada, 2023, pp. e49886
  3. Charlotte J Haug and Jeffrey M Drazen “Artificial intelligence and machine learning in clinical medicine, 2023” In New England Journal of Medicine 388.13 Mass Medical Soc, 2023, pp. 1201–1208
  4. Sajan B Patel and Kyle Lam “ChatGPT: the future of discharge summaries?” In The Lancet Digital Health 5.3 Elsevier, 2023, pp. e107–e108
  5. Peter Lee, Sebastien Bubeck and Joseph Petro “Benefits, limits, and risks of GPT-4 as an AI chatbot for medicine” In New England Journal of Medicine 388.13 Mass Medical Soc, 2023, pp. 1233–1239
  6. “Black box warning: large language models and the future of infectious diseases consultation” In Clinical Infectious Diseases Oxford University Press US, 2023, pp. ciad633
  7. “Worldwide AI ethics: A review of 200 guidelines and recommendations for AI governance” In Patterns 4.10 Elsevier, 2023
  8. Alexander V Eriksen, Sören Möller and Jesper Ryg “Use of GPT-4 to diagnose complex clinical cases” In NEJM AI 1.1 Massachusetts Medical Society, 2023, pp. AIp2300031
  9. “Diagnostic Accuracy of a Large Language Model in Pediatric Case Studies” In JAMA pediatrics, 2024
  10. Shweta Suran, Vishwajeet Pattanaik and Dirk Draheim “Frameworks for collective intelligence: A systematic literature review” In ACM Computing Surveys (CSUR) 53.1 ACM New York, NY, USA, 2020, pp. 1–36
  11. “Comparative accuracy of diagnosis by collective intelligence of multiple physicians vs individual physicians” In JAMA network open 2.3 American Medical Association, 2019, pp. e190096–e190096
  12. “Automating hybrid collective intelligence in open-ended medical diagnostics” In Proceedings of the National Academy of Sciences 120.34 National Acad Sciences, 2023, pp. e2221473120
  13. “Group discussion improves lie detection” In Proceedings of the National Academy of Sciences 112.24 National Acad Sciences, 2015, pp. 7460–7465
  14. Dongfu Jiang, Xiang Ren and Bill Yuchen Lin “LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion” In arXiv preprint arXiv:2306.02561, 2023
  15. “One LLM is not Enough: Harnessing the Power of Ensemble Learning for Medical Question Answering” In medRxiv Cold Spring Harbor Laboratory Press, 2023, pp. 2023–12
  16. “Chateval: Towards better llm-based evaluators through multi-agent debate” In arXiv preprint arXiv:2308.07201, 2023
Citations (1)

Summary

Paper to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Continue Learning

We haven't generated follow-up questions for this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 0 likes about this paper.