Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Does Biomedical Training Lead to Better Medical Performance? (2404.04067v4)

Published 5 Apr 2024 in cs.CL, cs.AI, and cs.LG

Abstract: LLMs are expected to significantly contribute to patient care, diagnostics, and administrative processes. Emerging biomedical LLMs aim to address healthcare-specific challenges, including privacy demands and computational constraints. Assessing the models' suitability for this sensitive application area is of the utmost importance. However, biomedical training has not been systematically evaluated on medical tasks. This study investigates the effect of biomedical training in the context of six practical medical tasks evaluating $25$ models. In contrast to previous evaluations, our results reveal a performance decline in nine out of twelve biomedical models after fine-tuning, particularly on tasks involving hallucinations, ICD10 coding, and instruction adherence. General-domain models like Meta-Llama-3.1-70B-Instruct outperformed their biomedical counterparts, indicating a trade-off between domain-specific fine-tuning and general medical task performance. We open-source all evaluation scripts and datasets at https://github.com/TIO-IKIM/CLUE to support further research in this critical area.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (61)
  1. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  2. LongHealth: A Question Answering Benchmark with Long Clinical Documents, January 2024.
  3. Publicly available clinical BERT embeddings. In Anna Rumshisky, Kirk Roberts, Steven Bethard, and Tristan Naumann (eds.), Proceedings of the 2nd Clinical Natural Language Processing Workshop, pp.  72–78, Minneapolis, Minnesota, USA, June 2019. Association for Computational Linguistics. doi: 10.18653/v1/W19-1909. URL https://aclanthology.org/W19-1909.
  4. Asma Ben Abacha and Dina Demner-Fushman. On the summarization of consumer health questions. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, ACL 2019, Florence, Italy, July 28th - August 2, 2019.
  5. Questions under discussion: From sentence to discourse, 2017.
  6. Olivier Bodenreider. The unified medical language system (umls): integrating biomedical terminology. Nucleic acids research, 32(suppl_1):D267–D270, 2004.
  7. Automated clinical coding using off-the-shelf large language models. In Deep Generative Models for Health Workshop NeurIPS 2023, 2023. URL https://openreview.net/forum?id=mqnR8rGWkn.
  8. MEDITRON-70B: Scaling Medical Pretraining for Large Language Models, November 2023.
  9. Palm: Scaling language modeling with pathways. arxiv 2022. arXiv preprint arXiv:2204.02311, 10, 2022.
  10. Cross-lingual language model pretraining. Advances in neural information processing systems, 32, 2019.
  11. Investigating data contamination in modern benchmarks for large language models, 2023.
  12. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805, 2018.
  13. Summarizing Patients’ Problems from Hospital Progress Notes Using Pre-trained Sequence-to-Sequence Models. In Nicoletta Calzolari, Chu-Ren Huang, Hansaem Kim, James Pustejovsky, Leo Wanner, Key-Sun Choi, Pum-Mo Ryu, Hsin-Hsi Chen, Lucia Donatelli, Heng Ji, Sadao Kurohashi, Patrizia Paggio, Nianwen Xue, Seokhwan Kim, Younggyun Hahm, Zhong He, Tony Kyungil Lee, Enrico Santus, Francis Bond, and Seung-Hoon Na (eds.), Proceedings of the 29th International Conference on Computational Linguistics, pp.  2979–2991, Gyeongju, Republic of Korea, October 2022. International Committee on Computational Linguistics.
  14. Dr. bench: Diagnostic reasoning benchmark for clinical natural language processing. Journal of Biomedical Informatics, 138:104286, 2023.
  15. Physiobank, physiotoolkit, and physionet: components of a new research resource for complex physiologic signals. circulation, 101(23):e215–e220, 2000.
  16. MedAlpaca – An Open-Source Collection of Medical Conversational AI Models and Training Data, October 2023.
  17. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2021. URL https://openreview.net/forum?id=d7KBjmI3GmQ.
  18. A comprehensive survey of deep learning for image captioning. ACM Computing Surveys (CsUR), 51(6):1–36, 2019.
  19. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023.
  20. Mixtral of experts, 2024.
  21. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421, 2021.
  22. PubMedQA: A dataset for biomedical research question answering. In Kentaro Inui, Jing Jiang, Vincent Ng, and Xiaojun Wan (eds.), Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.  2567–2577, Hong Kong, China, November 2019. Association for Computational Linguistics. doi: 10.18653/v1/D19-1259. URL https://aclanthology.org/D19-1259.
  23. Mimic-iv, 2021. URL https://physionet.org/content/mimiciv/1.0/.
  24. Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9, 2016.
  25. Ehrnoteqa: A patient-specific question answering benchmark for evaluating large language models in clinical settings, 2024.
  26. BioMistral: A Collection of Open-Source Pretrained Large Language Models for Medical Domains, February 2024.
  27. Benefits, limits, and risks of gpt-4 as an ai chatbot for medicine. New England Journal of Medicine, 388(13):1233–1239, 2023.
  28. Same task, more tokens: the impact of input length on the reasoning performance of large language models, 2024.
  29. An open source data contamination report for large language models, 2024.
  30. Chin-Yew Lin. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out, pp.  74–81, Barcelona, Spain, July 2004. Association for Computational Linguistics. URL https://aclanthology.org/W04-1013.
  31. Utility of chatgpt in clinical practice. Journal of Medical Internet Research, 25:e48568, 2023.
  32. Med-bert: A pretraining framework for medical records named entity recognition. IEEE Transactions on Industrial Informatics, 18(8):5600–5608, 2021.
  33. BioMedGPT: Open Multimodal Generative Pre-trained Transformer for BioMedicine, August 2023. URL http://arxiv.org/abs/2308.09442. arXiv:2308.09442 [cs] version: 2.
  34. Data Contamination: From Memorization to Exploitation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers), pp.  157–165, Dublin, Ireland, May 2022. Association for Computational Linguistics. doi: 10.18653/v1/2022.acl-short.18.
  35. Capabilities of gpt-4 on medical challenge problems. arXiv preprint arXiv:2303.13375, 2023.
  36. World Health Organization. Icd-10 : international statistical classification of diseases and related health problems : tenth revision, 2004.
  37. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Gerardo Flores, George H Chen, Tom Pollard, Joyce C Ho, and Tristan Naumann (eds.), Proceedings of the Conference on Health, Inference, and Learning, volume 174 of Proceedings of Machine Learning Research, pp.  248–260. PMLR, 07–08 Apr 2022. URL https://proceedings.mlr.press/v174/pal22a.html.
  38. A study of generative large language model for medical research and healthcare. NPJ Digital Medicine, 6(1):210, 2023a.
  39. Generative large language models are all-purpose text analytics engines: Text-to-text learning is all your need. arXiv preprint arXiv:2312.06099, 2023b.
  40. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67, 2020.
  41. Med-bert: pretrained contextualized embeddings on large-scale structured electronic health records for disease prediction. NPJ digital medicine, 4(1):86, 2021.
  42. Craige Roberts. Information structure: Towards an integrated formal theory of pragmatics. Semantics and pragmatics, 5:6–1, 2012.
  43. Lessons from natural language inference in the clinical domain. URL http://arxiv.org/abs/1808.06752.
  44. Lessons from natural language inference in the clinical domain. CoRR, abs/1808.06752, 2018. URL http://arxiv.org/abs/1808.06752.
  45. Detecting Pretraining Data from Large Language Models, March 2024.
  46. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca, 2023.
  47. Large language models in medicine. Nature medicine, 29(8):1930–1940, 2023.
  48. Llama 2: Open Foundation and Fine-Tuned Chat Models, July 2023.
  49. Zephyr: Direct distillation of lm alignment, 2023.
  50. Jan Van Kuppevelt. Discourse structure, topicality and questioning. Journal of linguistics, 31(1):109–147, 1995.
  51. Attention is all you need. In Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.
  52. Superglue: A stickier benchmark for general-purpose language understanding systems. In H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett (eds.), Advances in Neural Information Processing Systems, volume 32. Curran Associates, Inc., 2019. URL https://proceedings.neurips.cc/paper_files/paper/2019/file/4496bf24afe7fab6f046bf4923da8de6-Paper.pdf.
  53. Ethical considerations of using chatgpt in health care. Journal of Medical Internet Research, 25:e48009, 2023.
  54. Lawrence L. Weed. Medical records, patient care, and medical education. Irish Journal of Medical Science, 39(6):271–282, June 1964. ISSN 0021-1265. doi: 10.1007/bf02945791. URL http://dx.doi.org/10.1007/BF02945791.
  55. Huggingface’s transformers: State-of-the-art natural language processing. arXiv preprint arXiv:1910.03771, 2019.
  56. PMC-LLaMA: Towards Building Open-source Language Models for Medicine, August 2023a.
  57. Qudeval: The evaluation of questions under discussion discourse parsing. arXiv preprint arXiv:2310.14520, 2023b.
  58. Radbert: adapting transformer-based language models to radiology. Radiology: Artificial Intelligence, 4(4):e210258, 2022.
  59. Surpassing gpt-4 medical coding with a two-stage approach. arXiv preprint arXiv:2311.13735, 2023.
  60. Chatdoctor: A medical chat model fine-tuned on llama model using medical domain knowledge. arXiv preprint arXiv:2303.14070, 2023.
  61. Bertscore: Evaluating text generation with bert. In International Conference on Learning Representations, 2019.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (7)
  1. Amin Dada (9 papers)
  2. Marie Bauer (3 papers)
  3. Amanda Butler Contreras (2 papers)
  4. Osman Alperen Koraş (5 papers)
  5. Constantin Marc Seibold (2 papers)
  6. Kaleb E Smith (14 papers)
  7. Jens Kleesiek (80 papers)
Citations (3)
X Twitter Logo Streamline Icon: https://streamlinehq.com