Adapting Open-Source Large Language Models for Cost-Effective, Expert-Level Clinical Note Generation with On-Policy Reinforcement Learning (2405.00715v4)
Abstract: Proprietary LLMs such as GPT-4 and Gemini have demonstrated promising capabilities in clinical text summarization tasks. However, due to patient data privacy concerns and computational costs, many healthcare providers prefer using small, locally-hosted models over external generic LLMs. This study presents a comprehensive domain- and task-specific adaptation process for the open-source LLaMA-2 13 billion parameter model, enabling it to generate high-quality clinical notes from outpatient patient-doctor dialogues. Our process incorporates continued pre-training, supervised fine-tuning, and reinforcement learning from both AI and human feedback. We introduced a new approach, DistillDirect, for performing on-policy reinforcement learning with Gemini 1.0 Pro as the teacher model. Our resulting model, LLaMA-Clinic, can generate clinical notes comparable in quality to those authored by physicians. In a blinded physician reader study, the majority (90.4%) of individual evaluations rated the notes generated by LLaMA-Clinic as "acceptable" or higher across all three criteria: real-world readiness, completeness, and accuracy. In the more challenging "Assessment and Plan" section, LLaMA-Clinic scored higher (4.2/5) in real-world readiness than physician-authored notes (4.1/5). Our cost analysis for inference shows that our LLaMA-Clinic model achieves a 3.75-fold cost reduction compared to an external generic LLM service. Additionally, we highlight key considerations for future clinical note-generation tasks, emphasizing the importance of pre-defining a best-practice note format, rather than relying on LLMs to determine this for clinical practice. We have made our newly created synthetic clinic dialogue-note dataset and the physician feedback dataset publicly available to foster future research.
- A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics. arXiv preprint arXiv:2310.05694, 2023.
- A survey of large language models in medicine: Progress, application, and challenge. arXiv preprint arXiv:2311.05112, 2023a.
- Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023.
- Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617, 2023.
- Clinical text summarization: Adapting large language models can outperform human experts. arXiv preprint arXiv:2309.07430, 2023.
- Ai chatbots, health privacy, and challenges to hipaa compliance. Jama, 2023.
- The future landscape of large language models in medicine. Communications medicine, 3(1):141, 2023.
- Alaa Dania Adimi. Comparison & cost analysis: Should we invest in open-source or closed-source llms? https://medium.com/@ja_adimi, Jan 2024. Accessed: March, 2024.
- Ben Wodecki. Open-source vs closed models: The true cost of running ai. https://aibusiness.com/nlp/open-source-vs-closed-models-the-true-cost-of-running-ai, Nov 2023. Accessed: April, 2024.
- Ryan Mac Michael M. Grynbaum. The times sues openai and microsoft over a.i. use of copyrighted work. https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html, Dec 2023. Accessed: Jan, 2024.
- Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247, 2023.
- Huatuo: Tuning llama model with chinese medical knowledge. arXiv preprint arXiv:2304.06975, 2023.
- Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge. arXiv preprint arXiv:2303.14070, 2023.
- Pmc-llama: Further finetuning llama on medical papers. arXiv preprint arXiv:2304.14454, 2023a.
- Clinical camel: An open-source expert-level medical language model with dialogue-based knowledge encoding. arXiv preprint arXiv:2305.12031, 2023.
- Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079, 2023.
- The time needed for clinical documentation versus direct patient care. Methods of information in medicine, 48(01):84–91, 2009.
- Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine. New England Journal of Medicine, 388(13):1233–1239, 2023.
- Overview of the mediqa-chat 2023 shared tasks on the summarization & generation of doctor-patient conversations. In Proceedings of the 5th Clinical Natural Language Processing Workshop, pages 503–513, 2023.
- Overview of the mediqa-sum task at imageclef 2023: Summarization and classification of doctor-patient conversations. In CLEF, 2023a.
- Wanglab at mediqa-chat 2023: Clinical note generation from doctor-patient conversations using large language models. In Proceedings of the 5th Clinical Natural Language Processing Workshop, pages 323–334, 2023.
- Human evaluation and correlation with automatic metrics in consultation note generation. arXiv preprint arXiv:2204.00447, 2022.
- Aci-bench: a novel ambient clinical intelligence dataset for benchmarking automatic visit note generation. Scientific Data, 10(1):586, 2023b.
- Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
- Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
- MTSamples. Transcribed medical transcription sample reports and examples. https://mtsamples.com/. Accessed: Nov, 2023.
- Improving large language models for clinical named entity recognition via prompt engineering. Journal of the American Medical Informatics Association, page ocad259, 2024.
- Mimic-iv, a freely accessible electronic health record dataset. Scientific data, 10(1):1, 2023.
- Drg-llama: tuning llama model to predict diagnosis-related group for hospitalized patients. npj Digital Medicine, 7(1):16, 2024.
- Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
- Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
- Stas Bekman. Machine Learning Engineering Online Book. https://github.com/stas00/ml-engineering, 2023. Accessed: Nov, 2023.
- Challenges of reinforcement learning. Deep Reinforcement Learning: Fundamentals, Research and Applications, pages 249–272, 2020.
- Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
- Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
- Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023b.
- Publicly shareable clinical large language model built on synthetic clinical notes. arXiv preprint arXiv:2309.00237, 2023.
- Fine-tuning or retrieval? comparing knowledge injection in llms. arXiv preprint arXiv:2312.05934, 2023.
- Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023b.
- Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine. arXiv preprint arXiv:2308.09442, 2023.
- A study of generative large language model for medical research and healthcare. arXiv preprint arXiv:2305.13523, 2023.
- Do we still need clinical language models? arXiv preprint arXiv:2302.08091, 2023.
- Adversarial preference optimization. arXiv preprint arXiv:2311.08045, 2023.
- Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
- Direct language model alignment from online ai feedback. arXiv preprint arXiv:2402.04792, 2024.
- Understanding the effects of rlhf on llm generalisation and diversity. arXiv preprint arXiv:2310.06452, 2023.
- Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
- meta-llama. Llama recipes. https://github.com/meta-llama/llama-recipes. Accessed: Mar, 2024.
- Hans W Borchers and Maintainer Hans W Borchers. Package ‘pracma’. Practical numerical math functions, version, 2(5), 2019.
- Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
- Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
- Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
- Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
- TRL: Transformer Reinforcement Learning. https://github.com/stas00/ml-engineering. Accessed: Dec, 2023.
- Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
- Measuring harm in health care: optimizing adverse event review. Medical care, 55(4):436–441, 2017.
- Scipy 1.0: fundamental algorithms for scientific computing in python. Nature methods, 17(3):261–272, 2020.
- Kilem L Gwet. Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters. Advanced Analytics, LLC, 2014.
- Kilem L Gwet and Maintainer Kilem L Gwet. Package “irrcac”, 2019.
- The impact of grey zones on the accuracy of agreement measures for ordinal tables. BMC Medical Research Methodology, 21:1–19, 2021.
- Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
- Hanyin Wang (13 papers)
- Chufan Gao (14 papers)
- Bolun Liu (4 papers)
- Qiping Xu (2 papers)
- Guleid Hussein (2 papers)
- Mohamad El Labban (2 papers)
- Kingsley Iheasirim (3 papers)
- Hariprasad Korsapati (3 papers)
- Jimeng Sun (181 papers)
- Chuck Outcalt (1 paper)