Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
51 tokens/sec
GPT-4o
60 tokens/sec
Gemini 2.5 Pro Pro
44 tokens/sec
o3 Pro
8 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Adapting Open-Source Large Language Models for Cost-Effective, Expert-Level Clinical Note Generation with On-Policy Reinforcement Learning (2405.00715v4)

Published 25 Apr 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Proprietary LLMs such as GPT-4 and Gemini have demonstrated promising capabilities in clinical text summarization tasks. However, due to patient data privacy concerns and computational costs, many healthcare providers prefer using small, locally-hosted models over external generic LLMs. This study presents a comprehensive domain- and task-specific adaptation process for the open-source LLaMA-2 13 billion parameter model, enabling it to generate high-quality clinical notes from outpatient patient-doctor dialogues. Our process incorporates continued pre-training, supervised fine-tuning, and reinforcement learning from both AI and human feedback. We introduced a new approach, DistillDirect, for performing on-policy reinforcement learning with Gemini 1.0 Pro as the teacher model. Our resulting model, LLaMA-Clinic, can generate clinical notes comparable in quality to those authored by physicians. In a blinded physician reader study, the majority (90.4%) of individual evaluations rated the notes generated by LLaMA-Clinic as "acceptable" or higher across all three criteria: real-world readiness, completeness, and accuracy. In the more challenging "Assessment and Plan" section, LLaMA-Clinic scored higher (4.2/5) in real-world readiness than physician-authored notes (4.1/5). Our cost analysis for inference shows that our LLaMA-Clinic model achieves a 3.75-fold cost reduction compared to an external generic LLM service. Additionally, we highlight key considerations for future clinical note-generation tasks, emphasizing the importance of pre-defining a best-practice note format, rather than relying on LLMs to determine this for clinical practice. We have made our newly created synthetic clinic dialogue-note dataset and the physician feedback dataset publicly available to foster future research.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics. arXiv preprint arXiv:2310.05694, 2023.
  2. A survey of large language models in medicine: Progress, application, and challenge. arXiv preprint arXiv:2311.05112, 2023a.
  3. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023.
  4. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617, 2023.
  5. Clinical text summarization: Adapting large language models can outperform human experts. arXiv preprint arXiv:2309.07430, 2023.
  6. Ai chatbots, health privacy, and challenges to hipaa compliance. Jama, 2023.
  7. The future landscape of large language models in medicine. Communications medicine, 3(1):141, 2023.
  8. Alaa Dania Adimi. Comparison & cost analysis: Should we invest in open-source or closed-source llms? https://medium.com/@ja_adimi, Jan 2024. Accessed: March, 2024.
  9. Ben Wodecki. Open-source vs closed models: The true cost of running ai. https://aibusiness.com/nlp/open-source-vs-closed-models-the-true-cost-of-running-ai, Nov 2023. Accessed: April, 2024.
  10. Ryan Mac Michael M. Grynbaum. The times sues openai and microsoft over a.i. use of copyrighted work. https://www.nytimes.com/2023/12/27/business/media/new-york-times-open-ai-microsoft-lawsuit.html, Dec 2023. Accessed: Jan, 2024.
  11. Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247, 2023.
  12. Huatuo: Tuning llama model with chinese medical knowledge. arXiv preprint arXiv:2304.06975, 2023.
  13. Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge. arXiv preprint arXiv:2303.14070, 2023.
  14. Pmc-llama: Further finetuning llama on medical papers. arXiv preprint arXiv:2304.14454, 2023a.
  15. Clinical camel: An open-source expert-level medical language model with dialogue-based knowledge encoding. arXiv preprint arXiv:2305.12031, 2023.
  16. Meditron-70b: Scaling medical pretraining for large language models. arXiv preprint arXiv:2311.16079, 2023.
  17. The time needed for clinical documentation versus direct patient care. Methods of information in medicine, 48(01):84–91, 2009.
  18. Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine. New England Journal of Medicine, 388(13):1233–1239, 2023.
  19. Overview of the mediqa-chat 2023 shared tasks on the summarization & generation of doctor-patient conversations. In Proceedings of the 5th Clinical Natural Language Processing Workshop, pages 503–513, 2023.
  20. Overview of the mediqa-sum task at imageclef 2023: Summarization and classification of doctor-patient conversations. In CLEF, 2023a.
  21. Wanglab at mediqa-chat 2023: Clinical note generation from doctor-patient conversations using large language models. In Proceedings of the 5th Clinical Natural Language Processing Workshop, pages 323–334, 2023.
  22. Human evaluation and correlation with automatic metrics in consultation note generation. arXiv preprint arXiv:2204.00447, 2022.
  23. Aci-bench: a novel ambient clinical intelligence dataset for benchmarking automatic visit note generation. Scientific Data, 10(1):586, 2023b.
  24. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023.
  25. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023.
  26. MTSamples. Transcribed medical transcription sample reports and examples. https://mtsamples.com/. Accessed: Nov, 2023.
  27. Improving large language models for clinical named entity recognition via prompt engineering. Journal of the American Medical Informatics Association, page ocad259, 2024.
  28. Mimic-iv, a freely accessible electronic health record dataset. Scientific data, 10(1):1, 2023.
  29. Drg-llama: tuning llama model to predict diagnosis-related group for hospitalized patients. npj Digital Medicine, 7(1):16, 2024.
  30. Scaling laws for neural language models. arXiv preprint arXiv:2001.08361, 2020.
  31. Training compute-optimal large language models. arXiv preprint arXiv:2203.15556, 2022.
  32. Stas Bekman. Machine Learning Engineering Online Book. https://github.com/stas00/ml-engineering, 2023. Accessed: Nov, 2023.
  33. Challenges of reinforcement learning. Deep Reinforcement Learning: Fundamentals, Research and Applications, pages 249–272, 2020.
  34. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024.
  35. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024.
  36. Bloomberggpt: A large language model for finance. arXiv preprint arXiv:2303.17564, 2023b.
  37. Publicly shareable clinical large language model built on synthetic clinical notes. arXiv preprint arXiv:2309.00237, 2023.
  38. Fine-tuning or retrieval? comparing knowledge injection in llms. arXiv preprint arXiv:2312.05934, 2023.
  39. Lima: Less is more for alignment. arXiv preprint arXiv:2305.11206, 2023b.
  40. Biomedgpt: Open multimodal generative pre-trained transformer for biomedicine. arXiv preprint arXiv:2308.09442, 2023.
  41. A study of generative large language model for medical research and healthcare. arXiv preprint arXiv:2305.13523, 2023.
  42. Do we still need clinical language models? arXiv preprint arXiv:2302.08091, 2023.
  43. Adversarial preference optimization. arXiv preprint arXiv:2311.08045, 2023.
  44. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023.
  45. Direct language model alignment from online ai feedback. arXiv preprint arXiv:2402.04792, 2024.
  46. Understanding the effects of rlhf on llm generalisation and diversity. arXiv preprint arXiv:2310.06452, 2023.
  47. Lora: Low-rank adaptation of large language models. arXiv preprint arXiv:2106.09685, 2021.
  48. meta-llama. Llama recipes. https://github.com/meta-llama/llama-recipes. Accessed: Mar, 2024.
  49. Hans W Borchers and Maintainer Hans W Borchers. Package ‘pracma’. Practical numerical math functions, version, 2(5), 2019.
  50. Finetuned language models are zero-shot learners. arXiv preprint arXiv:2109.01652, 2021.
  51. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35:27730–27744, 2022.
  52. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017.
  53. Direct preference optimization: Your language model is secretly a reward model. arXiv preprint arXiv:2305.18290, 2023.
  54. TRL: Transformer Reinforcement Learning. https://github.com/stas00/ml-engineering. Accessed: Dec, 2023.
  55. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  56. Measuring harm in health care: optimizing adverse event review. Medical care, 55(4):436–441, 2017.
  57. Scipy 1.0: fundamental algorithms for scientific computing in python. Nature methods, 17(3):261–272, 2020.
  58. Kilem L Gwet. Handbook of inter-rater reliability: The definitive guide to measuring the extent of agreement among raters. Advanced Analytics, LLC, 2014.
  59. Kilem L Gwet and Maintainer Kilem L Gwet. Package “irrcac”, 2019.
  60. The impact of grey zones on the accuracy of agreement measures for ordinal tables. BMC Medical Research Methodology, 21:1–19, 2021.
  61. Learning to summarize with human feedback. Advances in Neural Information Processing Systems, 33:3008–3021, 2020.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (10)
  1. Hanyin Wang (13 papers)
  2. Chufan Gao (14 papers)
  3. Bolun Liu (4 papers)
  4. Qiping Xu (2 papers)
  5. Guleid Hussein (2 papers)
  6. Mohamad El Labban (2 papers)
  7. Kingsley Iheasirim (3 papers)
  8. Hariprasad Korsapati (3 papers)
  9. Jimeng Sun (181 papers)
  10. Chuck Outcalt (1 paper)
X Twitter Logo Streamline Icon: https://streamlinehq.com

Tweets