A Survey of Large Language Models in Medicine: Progress, Application, and Challenge (2311.05112v7)

Published 9 Nov 2023 in cs.CL and cs.AI

Abstract: LLMs, such as ChatGPT, have received substantial attention due to their capabilities for understanding and generating human language. While there has been a burgeoning trend in research focusing on the employment of LLMs in supporting different medical tasks (e.g., enhancing clinical diagnostics and providing medical education), a review of these efforts, particularly their development, practical applications, and outcomes in medicine, remains scarce. Therefore, this review aims to provide a detailed overview of the development and deployment of LLMs in medicine, including the challenges and opportunities they face. In terms of development, we provide a detailed introduction to the principles of existing medical LLMs, including their basic model structures, number of parameters, and sources and scales of data used for model development. It serves as a guide for practitioners in developing medical LLMs tailored to their specific needs. In terms of deployment, we offer a comparison of the performance of different LLMs across various medical tasks, and further compare them with state-of-the-art lightweight models, aiming to provide an understanding of the advantages and limitations of LLMs in medicine. Overall, in this review, we address the following questions: 1) What are the practices for developing medical LLMs 2) How to measure the medical task performance of LLMs in a medical setting? 3) How have medical LLMs been employed in real-world practice? 4) What challenges arise from the use of medical LLMs? and 5) How to more effectively develop and deploy medical LLMs? By answering these questions, this review aims to provide insights into the opportunities for LLMs in medicine and serve as a practical resource. We also maintain a regularly updated list of practical guides on medical LLMs at https://github.com/AI-in-Health/MedLLMsPracticalGuide

PDF Abstract

Overview of LLMs in Medicine

The paper "A Survey of LLMs in Medicine: Progress, Application, and Challenge" provides a comprehensive review of the development, deployment, and challenges faced by LLMs in the medical domain. Considering the transformative potential of models such as GPT-4 and ChatGPT, the authors meticulously examine how these LLMs have been adapted for medical tasks, highlight their applications, and address the hurdles involved in their deployment.

Development and Structuring of Medical LLMs

The authors categorize the development of medical LLMs into three main strategies: pre-training, fine-tuning, and prompting.

Pre-training: These models, including BioBERT and ClinicalBERT, are trained from scratch on large-scale medical corpora such as PubMed and MIMIC-III, leveraging objectives like masked LLMing. This approach aims to imbue the models with rich medical knowledge, making them suitable for specialized tasks.
Fine-tuning: This leverages existing general LLMs, refining them with medical data through techniques like Supervised Fine-Tuning (SFT) and Instruction Fine-Tuning (IFT). Models such as MedAlpaca and ClinicalCamel exemplify this by focusing on specific datasets for enhanced domain alignment.
Prompting: Methods like Zero/Few-shot Prompting and Chain-of-Thought (CoT) Prompting enable models to adapt to medical contexts without additional training, as demonstrated by models like MedPaLM and MedPrompt.

Evaluation on Medical Tasks

The authors evaluate LLMs on a spectrum of discriminative and generative tasks:

Discriminative Tasks: Including Question Answering, Entity Extraction, and Relation Extraction, these tasks benefit from the contextual understanding of LLMs. Notably, GPT-4 shows strong performance in medical QA, often surpassing fine-tuned task-specific models.
Generative Tasks: Tasks such as Text Summarization and Text Generation reveal the models' ability to produce coherent and relevant medical text, significantly aiding in clinical report generation.

Key Numerical and Performance Insights

The paper highlights that GPT-4 achieves a competitive accuracy of 86.5% on MedQA for the USMLE, nearly rivaling human experts (87.0%). However, challenges remain in non-QA tasks where traditional fine-tuned models still have a performance edge.

Challenges and Barriers

Several challenges in deploying medical LLMs are addressed:

Hallucination: The risk of producing inaccurate medical information necessitates strategies like factually consistent reinforcement learning.
Data Limitations: The constrained availability of domain-specific data hinders comprehensive model training and evaluation.
Ethical and Safety Concerns: These encompass concerns about data privacy, PII leakage, and the ethical implications of relying on AI in critical healthcare settings.

Future Directions

To advance the integration of LLMs in medicine, the paper suggests the development of new benchmarks to better evaluate models' clinical competence. It also advocates for the use of multimodal inputs, combining text with images and other data forms to enrich model outputs. Furthermore, promoting interdisciplinary collaboration is crucial to simulating real-world medical scenarios and mitigating implementation risks.

Conclusion

This survey underscores the immense promise of LLMs in transforming medical practice but simultaneously cautions about the complexities involved. By addressing the outlined challenges, LLMs can be effectively harnessed to augment medical research and healthcare delivery, promoting significant societal benefits. This paper is an essential resource for researchers aiming to navigate the cutting-edge intersections of AI and medicine.