Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
72 tokens/sec
GPT-4o
61 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ClinicalGPT: Large Language Models Finetuned with Diverse Medical Data and Comprehensive Evaluation (2306.09968v1)

Published 16 Jun 2023 in cs.CL

Abstract: LLMs have exhibited exceptional performance on various NLP tasks, leveraging techniques such as the pre-training, and instruction fine-tuning. Despite these advances, their effectiveness in medical applications is limited, due to challenges such as factual inaccuracies, reasoning abilities, and lack grounding in real-world experience. In this study, we present ClinicalGPT, a LLM explicitly designed and optimized for clinical scenarios. By incorporating extensive and diverse real-world data, such as medical records, domain-specific knowledge, and multi-round dialogue consultations in the training process, ClinicalGPT is better prepared to handle multiple clinical task. Furthermore, we introduce a comprehensive evaluation framework that includes medical knowledge question-answering, medical exams, patient consultations, and diagnostic analysis of medical records. Our results demonstrate that ClinicalGPT significantly outperforms other models in these tasks, highlighting the effectiveness of our approach in adapting LLMs to the critical domain of healthcare.

ClinicalGPT: Advancements in Medical LLMs

The paper "ClinicalGPT: LLMs Finetuned with Diverse Medical Data and Comprehensive Evaluation" introduces a specialized LLM named ClinicalGPT, aimed at addressing the unique challenges posited by medical domain applications. The paper highlights that while LLMs like BERT, GPT-3, and PALM have showcased robust performance across various NLP tasks, their effectiveness in medical contexts remains constrained by factual inaccuracies, reasoning deficits, and insufficient real-world grounding.

Methodology

ClinicalGPT is constructed on the foundation of BLOOM-7B, selected for its open-source status and multilingual support. The training of ClinicalGPT differs significantly from general-purpose LLMs in that it incorporates extensive real-world medical datasets, including cMedQA2, cMedQA-KG, MD-EHR, MEDQA-MCMLE, and MedDialog. These datasets encompass Chinese medical Q&A forums, knowledge graph-derived question-answer pairs, medical exam questions, multi-turn dialogues mimicking real doctor-patient interactions, and comprehensive electronic health records.

The finetuning process for ClinicalGPT employs the T5 model’s text generation strategy, enhanced with instruction-tuning using supervised fine-tuning (SFT) and parameter-efficient fine-tuning methods such as LoRA to improve computational efficiency. Furthermore, an innovative reinforcement learning (RL) framework, integrating a trained reward model, is employed to refine ClinicalGPT with human feedback loop, thus aligning its outputs with the expectations of medical practitioners.

Evaluation

The evaluation of ClinicalGPT was robust and multifaceted, assessing its performance across four primary tasks: medical conversation, medical examinations, diagnosis, and medical question answering. The model's performance was benchmarked against other prevalent fine-tuned models, namely ChatGLM-6B, LLAMA-7B, and BLOOM-7B.

  1. Medical Conversation: The evaluation utilized metrics such as BLEU, ROUGE, and GLEU to assess the quality of model-generated dialogue in MedDialog’s test set. ClinicalGPT excelled in BLEU-1 and ROUGE metrics, indicating its capability to generate comprehensive and contextually relevant responses.
  2. Medical Examination: The model was tested using the MEDQA-MCMLE dataset across categories like ethics, respiratory, and digestive systems. ClinicalGPT surpassed comparative models with an average accuracy of 38.4%, particularly excelling in rheumatic immune diseases.
  3. Diagnosis: On the MD-EHR dataset, ClinicalGPT's diagnostic accuracy was evaluated. Notably, it achieved a commendable average accuracy of 80.9%, significantly outperforming its competitors.
  4. Medical Question Answering: Using a subset of cMedQA2, ClinicalGPT was evaluated by GPT-4 on accuracy, helpfulness, and safety. ClinicalGPT showed superior performance, winning in the majority of comparisons against other LLMs like BLOOM-7B and LLAMA-7B.

Implications

ClinicalGPT’s development signifies an important step forward in integrating AI within the medical field. The model's ability to process and generate accurate, relevant medical information has practical applications in clinical decision support, patient interaction, and health data management. Theoretically, it highlights the potential for LLMs to evolve as domain-specific experts, offering reliable support in complex fields like medicine.

The success of ClinicalGPT could stimulate further research into domain-specific finetuning of LLMs, leveraging diverse data sources and advanced training methodologies to overcome the limitations of general-purpose models. Future developments may focus on expanding ClinicalGPT’s linguistic and cultural adaptability, refinement of reasoning capabilities, and the integration of additional safety measures to further enhance its reliability in clinical settings.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (24)
  1. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  2. Aligning books and movies: Towards story-like visual explanations by watching movies and reading books. In Proceedings of the IEEE international conference on computer vision, pages 19–27, 2015.
  3. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21(1):5485–5551, 2020.
  4. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020.
  5. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022.
  6. Lamda: Language models for dialog applications. arXiv preprint arXiv:2201.08239, 2022.
  7. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  8. Christian Baumgartner. The potential impact of chatgpt in clinical and translational medicine. Clinical and translational medicine, 13(3), 2023.
  9. Tyler Cowen. The ai revolution in medicine: Gpt-4 and beyond. 2023.
  10. Foundation models for generalist medical artificial intelligence. Nature, 616(7956):259–265, 2023.
  11. On the dangers of stochastic parrots: Can language models be too big? In Proceedings of the 2021 ACM conference on fairness, accountability, and transparency, pages 610–623, 2021.
  12. Multi-scale attentive interaction networks for chinese medical question answer selection. IEEE Access, 6:74061–74071, 2018.
  13. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021.
  14. Meddialog: Two large-scale medical dialogue datasets. arXiv preprint arXiv:2004.03329, 2020.
  15. Webgpt: Browser-assisted question-answering with human feedback. arXiv preprint arXiv:2112.09332, 2021.
  16. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
  17. Bloom: A 176b-parameter open-access multilingual language model, 2023.
  18. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  19. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 1–16. IEEE, 2020.
  20. Glm-130b: An open bilingual pre-trained model. arXiv preprint arXiv:2210.02414, 2022.
  21. Llama: Open and efficient foundation language models, 2023.
  22. Bloom: A 176b-parameter open-access multilingual language model. arXiv preprint arXiv:2211.05100, 2022.
  23. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318, 2002.
  24. Chin-Yew Lin. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out, pages 74–81, 2004.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (5)
  1. Guangyu Wang (25 papers)
  2. Guoxing Yang (11 papers)
  3. Zongxin Du (1 paper)
  4. Longjun Fan (1 paper)
  5. Xiaohu Li (26 papers)
Citations (68)
X Twitter Logo Streamline Icon: https://streamlinehq.com
Youtube Logo Streamline Icon: https://streamlinehq.com