A Comprehensive Survey on Evaluating Large Language Model Applications in the Medical Industry (2404.15777v4)
Abstract: Since the inception of the Transformer architecture in 2017, LLMs such as GPT and BERT have evolved significantly, impacting various industries with their advanced capabilities in language understanding and generation. These models have shown potential to transform the medical field, highlighting the necessity for specialized evaluation frameworks to ensure their effective and ethical deployment. This comprehensive survey delineates the extensive application and requisite evaluation of LLMs within healthcare, emphasizing the critical need for empirical validation to fully exploit their capabilities in enhancing healthcare outcomes. Our survey is structured to provide an in-depth analysis of LLM applications across clinical settings, medical text data processing, research, education, and public health awareness. We begin by exploring the roles of LLMs in various medical applications, detailing their evaluation based on performance in tasks such as clinical diagnosis, medical text data processing, information retrieval, data analysis, and educational content generation. The subsequent sections offer a comprehensive discussion on the evaluation methods and metrics employed, including models, evaluators, and comparative experiments. We further examine the benchmarks and datasets utilized in these evaluations, providing a categorized description of benchmarks for tasks like question answering, summarization, information extraction, bioinformatics, information retrieval and general comprehensive benchmarks. This structure ensures a thorough understanding of how LLMs are assessed for their effectiveness, accuracy, usability, and ethical alignment in the medical domain. ...
- A. Rao, M. Pang, J. Kim, M. Kamineni, W. Lie, A. K. Prasad, A. Landman, K. Dreyer, and M. D. Succi, “Assessing the utility of chatgpt throughout the entire clinical workflow: development and usability study,” Journal of Medical Internet Research, vol. 25, p. e48659, 2023.
- T. Zack, E. Lehman, M. Suzgun, J. A. Rodriguez, L. A. Celi, J. Gichoya, D. Jurafsky, P. Szolovits, D. W. Bates, R.-E. E. Abdulnour et al., “Assessing the potential of gpt-4 to perpetuate racial and gender biases in health care: a model evaluation study,” The Lancet Digital Health, vol. 6, no. 1, pp. e12–e22, 2024.
- K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl et al., “Large language models encode clinical knowledge,” Nature, vol. 620, no. 7972, pp. 172–180, 2023.
- R. S. Goodman, J. R. Patrinely, C. A. Stone, E. Zimmerman, R. R. Donald, S. S. Chang, S. T. Berkowitz, A. P. Finn, E. Jahangir, E. A. Scoville et al., “Accuracy and reliability of chatbot responses to physician questions,” JAMA network open, vol. 6, no. 10, pp. e2 336 483–e2 336 483, 2023.
- H. Sun, K. Zhang, W. Lan, Q. Gu, G. Jiang, X. Yang, W. Qin, and D. Han, “An ai dietitian for type 2 diabetes mellitus management based on large language and image recognition models: Preclinical concept validation study,” Journal of Medical Internet Research, vol. 25, p. e51300, 2023.
- K. Pushpanathan, Z. W. Lim, S. M. E. Yew, D. Z. Chen, H. A. H. Lin, J. H. L. Goh, W. M. Wong, X. Wang, M. C. J. Tan, V. T. C. Koh et al., “Popular large language model chatbots’ accuracy, comprehensiveness, and self-awareness in answering ocular symptom queries,” Iscience, vol. 26, no. 11, 2023.
- I. A. Bernstein, Y. V. Zhang, D. Govil, I. Majid, R. T. Chang, Y. Sun, A. Shue, J. C. Chou, E. Schehlein, K. L. Christopher et al., “Comparison of ophthalmologist and large language model chatbot responses to online patient eye care questions,” JAMA Network Open, vol. 6, no. 8, pp. e2 330 320–e2 330 320, 2023.
- T. I. Wilhelm, J. Roos, and R. Kaczmarczyk, “Large language models for therapy recommendations across 3 clinical specialties: comparative study,” Journal of Medical Internet Research, vol. 25, p. e49324, 2023.
- T. Kuroiwa, A. Sarcon, T. Ibara, E. Yamada, A. Yamamoto, K. Tsukamoto, and K. Fujita, “The potential of chatgpt as a self-diagnostic tool in common orthopedic diseases: exploratory study,” Journal of Medical Internet Research, vol. 25, p. e47621, 2023.
- J. Chervenak, H. Lieman, M. Blanco-Breindel, and S. Jindal, “The promise and peril of using a large language model to obtain clinical information: Chatgpt performs strongly as a fertility counseling tool with limitations,” Fertility and sterility, vol. 120, no. 3, pp. 575–583, 2023.
- I. Levkovich and Z. Elyoseph, “Suicide risk assessments through the eyes of chatgpt-3.5 versus chatgpt-4: vignette study,” JMIR mental health, vol. 10, p. e51232, 2023.
- F. Liu, T. Zhu, X. Wu, B. Yang, C. You, C. Wang, L. Lu, Z. Liu, Y. Zheng, X. Sun et al., “A medical multimodal large language model for future pandemics,” NPJ Digital Medicine, vol. 6, no. 1, p. 226, 2023.
- Z. W. Lim, K. Pushpanathan, S. M. E. Yew, Y. Lai, C.-H. Sun, J. S. H. Lam, D. Z. Chen, J. H. L. Goh, M. C. J. Tan, B. Sheng et al., “Benchmarking large language models’ performances for myopia care: a comparative analysis of chatgpt-3.5, chatgpt-4.0, and google bard,” EBioMedicine, vol. 95, 2023.
- P. Goktas, G. Karakaya, A. F. Kalyoncu, and E. Damadoglu, “Artificial intelligence chatbots in allergy and immunology practice: where have we been and where are we going?” The Journal of Allergy and Clinical Immunology: In Practice, vol. 11, no. 9, pp. 2697–2700, 2023.
- A. Rao, J. Kim, M. Kamineni, M. Pang, W. Lie, K. J. Dreyer, and M. D. Succi, “Evaluating gpt as an adjunct for radiologic decision making: Gpt-4 versus gpt-3.5 in a breast imaging pilot,” Journal of the American College of Radiology, vol. 20, no. 10, pp. 990–997, 2023.
- A. Nicolson, J. Dowling, and B. Koopman, “Improving chest x-ray report generation by leveraging warm starting,” Artificial intelligence in medicine, vol. 144, p. 102633, 2023.
- H. Fraser, D. Crossland, I. Bacher, M. Ranney, T. Madsen, R. Hilliard et al., “Comparison of diagnostic and triage accuracy of ada health and webmd symptom checkers, chatgpt, and physicians for patients in an emergency department: clinical data analysis study,” JMIR mHealth and uHealth, vol. 11, no. 1, p. e49995, 2023.
- P. P. Suthar, A. Kounsal, L. Chhetri, D. Saini, and S. G. Dua, “Artificial intelligence (ai) in radiology: a deep dive into chatgpt 4.0’s accuracy with the american journal of neuroradiology’s (ajnr)"" case of the month"",” Cureus, vol. 15, no. 8, 2023.
- C. Peng, X. Yang, A. Chen, K. E. Smith, N. PourNejatian, A. B. Costa, C. Martin, M. G. Flores, Y. Zhang, T. Magoc et al., “A study of generative large language model for medical research and healthcare,” NPJ Digital Medicine, vol. 6, no. 1, p. 210, 2023.
- Q. Chen, H. Sun, H. Liu, Y. Jiang, T. Ran, X. Jin, X. Xiao, Z. Lin, H. Chen, and Z. Niu, “An extensive benchmark study on biomedical text generation and mining with chatgpt,” Bioinformatics, vol. 39, no. 9, p. btad557, 2023.
- I. Jahan, M. T. R. Laskar, C. Peng, and J. X. Huang, “A comprehensive evaluation of large language models on benchmark biomedical text processing tasks,” Computers in Biology and Medicine, p. 108189, 2024.
- K. Mermin-Bunnell, Y. Zhu, A. Hornback, G. Damhorst, T. Walker, C. Robichaux, L. Mathew, N. Jaquemet, K. Peters, T. M. Johnson et al., “Use of natural language processing of patient-initiated electronic health record messages to identify patients with covid-19 infection,” JAMA network open, vol. 6, no. 7, pp. e2 322 299–e2 322 299, 2023.
- E. Alsentzer, M. J. Rasmussen, R. Fontoura, A. L. Cull, B. Beaulieu-Jones, K. J. Gray, D. W. Bates, and V. P. Kovacheva, “Zero-shot interpretable phenotyping of postpartum hemorrhage using large language models,” NPJ Digital Medicine, vol. 6, no. 1, p. 212, 2023.
- D. S. Lituiev, B. Lacar, S. Pak, P. L. Abramowitsch, E. H. De Marchis, and T. A. Peterson, “Automatic extraction of social determinants of health from medical notes of chronic lower back pain patients,” Journal of the American Medical Informatics Association, vol. 30, no. 8, pp. 1438–1447, 2023.
- T. M. Buonocore, C. Crema, A. Redolfi, R. Bellazzi, and E. Parimbelli, “Localizing in-domain adaptation of transformer-based biomedical language models,” Journal of Biomedical Informatics, vol. 144, p. 104431, 2023.
- S. Nowak, D. Biesner, Y. Layer, M. Theis, H. Schneider, W. Block, B. Wulff, U. Attenberger, R. Sifa, and A. Sprinkart, “Transformer-based structuring of free-text radiology report databases,” European Radiology, vol. 33, no. 6, pp. 4228–4236, 2023.
- I. de la Iglesia, M. Vivó, P. Chocrón, G. de Maeztu, K. Gojenola, and A. Atutxa, “An open source corpus and automatic tool for section identification in spanish health records,” Journal of Biomedical Informatics, vol. 145, p. 104461, 2023.
- W. Hendrix, M. Rutten, N. Hendrix, B. van Ginneken, C. Schaefer-Prokop, E. T. Scholten, M. Prokop, and C. Jacobs, “Trends in the incidence of pulmonary nodules in chest computed tomography: 10-year results from two dutch hospitals,” European Radiology, vol. 33, no. 11, pp. 8279–8288, 2023.
- J. Liu, D. Capurro, A. Nguyen, and K. Verspoor, “Attention-based multimodal fusion with contrast for robust clinical prediction in the face of missing modalities,” Journal of Biomedical Informatics, vol. 145, p. 104466, 2023.
- D. Morena, J. Fernández, C. Campos, M. Castillo, G. López, M. Benavent, and J. L. Izquierdo, “Clinical profile of patients with idiopathic pulmonary fibrosis in real life,” Journal of Clinical Medicine, vol. 12, no. 4, p. 1669, 2023.
- E. Guo, M. Gupta, J. Deng, Y.-J. Park, M. Paget, and C. Naugler, “Automated paper screening for clinical reviews using large language models: Data analysis study,” Journal of Medical Internet Research, vol. 26, p. e48996, 2024.
- Q. Jin, W. Kim, Q. Chen, D. C. Comeau, L. Yeganova, W. J. Wilbur, and Z. Lu, “Medcpt: Contrastive pre-trained transformers with large-scale pubmed search logs for zero-shot biomedical information retrieval,” Bioinformatics, vol. 39, no. 11, p. btad651, 2023.
- Y. Wang, H. Zhao, S. Sciabola, and W. Wang, “cmolgpt: A conditional generative pre-trained transformer for target-specific de novo molecular generation,” Molecules, vol. 28, no. 11, p. 4430, 2023.
- H. Li and B. Liu, “Bioseq-diabolo: Biological sequence similarity analysis using diabolo,” PLoS computational biology, vol. 19, no. 6, p. e1011214, 2023.
- Y. Zhang, M. Lang, J. Jiang, Z. Gao, F. Xu, T. Litfin, K. Chen, J. Singh, X. Huang, G. Song et al., “Multiple sequence alignment-based rna language model and its application to structural inference,” Nucleic Acids Research, vol. 52, no. 1, pp. e3–e3, 2024.
- L. Y. Jiang, X. C. Liu, N. P. Nejatian, M. Nasir-Moin, D. Wang, A. Abidin, K. Eaton, H. A. Riina, I. Laufer, P. Punjabi et al., “Health system-scale language models are all-purpose prediction engines,” Nature, vol. 619, no. 7969, pp. 357–362, 2023.
- Y. Lu, X. Liu, Z. Du, Y. Gao, and G. Wang, “Medkpl: a heterogeneous knowledge enhanced prompt learning framework for transferable diagnosis,” Journal of Biomedical Informatics, vol. 143, p. 104417, 2023.
- Z. Hussain, Z. Sheikh, A. Tahir, K. Dashtipour, M. Gogate, A. Sheikh, A. Hussain et al., “Artificial intelligence–enabled social media analysis for pharmacovigilance of covid-19 vaccinations in the united kingdom: observational study,” JMIR Public Health and Surveillance, vol. 8, no. 5, p. e32543, 2022.
- M. Májovskỳ, M. Černỳ, M. Kasal, M. Komarc, and D. Netuka, “Artificial intelligence can generate fraudulent but authentic-looking scientific medical articles: Pandora’s box has been opened,” Journal of medical Internet research, vol. 25, p. e46924, 2023.
- A. S. Doyal, D. Sender, M. Nanda, and R. A. Serrano, “Chatgpt and artificial intelligence in medical writing: concerns and ethical considerations,” Cureus, vol. 15, no. 8, 2023.
- S. Suppadungsuk, C. Thongprayoon, P. Krisanapan, S. Tangpanithandee, O. Garcia Valencia, J. Miao, P. Mekraksakit, K. Kashani, and W. Cheungpasitporn, “Examining the validity of chatgpt in identifying relevant nephrology literature: findings and implications,” Journal of Clinical Medicine, vol. 12, no. 17, p. 5550, 2023.
- A. J. Hueber and A. Kleyer, “Quality of citation data using the natural language processing tool chatgpt in rheumatology: creation of false references,” RMD open, vol. 9, no. 2, p. e003248, 2023.
- L. Tang, Z. Sun, B. Idnay, J. G. Nestor, A. Soroush, P. A. Elias, Z. Xu, Y. Ding, G. Durrett, J. F. Rousseau et al., “Evaluating large language models on medical evidence summarization,” npj Digital Medicine, vol. 6, no. 1, p. 158, 2023.
- M.-H. Temsah, I. Altamimi, A. Jamal, K. Alhasan, and A. Al-Eyadhy, “Chatgpt surpasses 1000 publications on pubmed: envisioning the road ahead,” Cureus, vol. 15, no. 9, 2023.
- Y. Kaneda, R. Takahashi, U. Kaneda, S. Akashima, H. Okita, S. Misaki, A. Yamashiro, A. Ozaki, and T. Tanimoto, “Assessing the performance of gpt-3.5 and gpt-4 on the 2023 japanese nursing examination,” Cureus, vol. 15, no. 8, 2023.
- A. Kumari, A. Kumari, A. Singh, S. K. Singh, A. Juhi, A. K. D. Dhanvijay, M. J. Pinjar, H. Mondal, and A. K. Dhanvijay, “Large language models in hematology case solving: a comparative study of chatgpt-3.5, google bard, and microsoft bing,” Cureus, vol. 15, no. 8, 2023.
- S. W. Li, M. W. Kemp, S. J. Logan, P. S. Dimri, N. Singh, C. N. Mattar, P. Dashraath, H. Ramlal, A. P. Mahyuddin, S. Kanayan et al., “Chatgpt outscored human candidates in a virtual objective structured clinical examination in obstetrics and gynecology,” American Journal of Obstetrics and Gynecology, vol. 229, no. 2, pp. 172–e1, 2023.
- D. Brin, V. Sorin, A. Vaid, A. Soroush, B. S. Glicksberg, A. W. Charney, G. Nadkarni, and E. Klang, “Comparing chatgpt and gpt-4 performance in usmle soft skill assessments,” Scientific Reports, vol. 13, no. 1, p. 16492, 2023.
- R. Golan, S. J. Ripps, R. Reddy, J. Loloi, A. P. Bernstein, Z. M. Connelly, N. S. Golan, R. Ramasamy, S. Ripps, R. V. Reddy et al., “Chatgpt’s ability to assess quality and readability of online medical information: evidence from a cross-sectional study,” Cureus, vol. 15, no. 7, 2023.
- J. Mago and M. Sharma, “The potential usefulness of chatgpt in oral and maxillofacial radiology,” Cureus, vol. 15, no. 7, 2023.
- A. K. D. Dhanvijay, M. J. Pinjar, N. Dhokane, S. R. Sorte, A. Kumari, H. Mondal, and A. K. Dhanvijay, “Performance of large language models (chatgpt, bing search, and google bard) in solving case vignettes in physiology,” Cureus, vol. 15, no. 8, 2023.
- J. Alghamdi, Y. Lin, and S. Luo, “Towards covid-19 fake news detection using transformer-based models,” Knowledge-Based Systems, vol. 274, p. 110642, 2023.
- E. Sezgin, F. Chekeni, J. Lee, and S. Keim, “Clinical accuracy of large language models and google search responses to postpartum depression questions: cross-sectional study,” Journal of Medical Internet Research, vol. 25, p. e49240, 2023.
- J. Cadamuro, F. Cabitza, Z. Debeljak, S. De Bruyne, G. Frans, S. M. Perez, H. Ozdemir, A. Tolios, A. Carobene, and A. Padoan, “Potentials and pitfalls of chatgpt and natural-language artificial intelligence models for the understanding of laboratory medicine test results. an assessment by the european federation of clinical chemistry and laboratory medicine (eflm) working group on artificial intelligence (wg-ai),” Clinical Chemistry and Laboratory Medicine (CCLM), vol. 61, no. 7, pp. 1158–1166, 2023.
- A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems, vol. 30, 2017.
- A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al., “Improving language understanding by generative pre-training,” 2018.
- J. Devlin, M.-W. Chang, K. Lee, and K. Toutanova, “Bert: Pre-training of deep bidirectional transformers for language understanding,” arXiv preprint arXiv:1810.04805, 2018.
- A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al., “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019.
- T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell et al., “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020.
- L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., “Training language models to follow instructions with human feedback,” Advances in neural information processing systems, vol. 35, pp. 27 730–27 744, 2022.
- P. P. Ray, “Chatgpt: A comprehensive review on background, applications, key challenges, bias, ethics, limitations and future scope,” Internet of Things and Cyber-Physical Systems, 2023.
- J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkat et al., “Gpt-4 technical report,” arXiv preprint arXiv:2303.08774, 2023.
- G. Team, R. Anil, S. Borgeaud, Y. Wu, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth et al., “Gemini: a family of highly capable multimodal models,” arXiv preprint arXiv:2312.11805, 2023.
- H. Touvron, T. Lavril, G. Izacard, X. Martinet, M.-A. Lachaux, T. Lacroix, B. Rozière, N. Goyal, E. Hambro, F. Azhar et al., “Llama: Open and efficient foundation language models,” arXiv preprint arXiv:2302.13971, 2023.
- S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V. Lin et al., “Opt: Open pre-trained transformer language models,” arXiv preprint arXiv:2205.01068, 2022.
- G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Love et al., “Gemma: Open models based on gemini research and technology,” arXiv preprint arXiv:2403.08295, 2024.
- R. E. Wang and D. Demszky, “Is chatgpt a good teacher coach? measuring zero-shot performance for scoring and providing actionable insights on classroom instruction,” arXiv preprint arXiv:2306.03090, 2023.
- W. Dai, J. Lin, H. Jin, T. Li, Y.-S. Tsai, D. Gašević, and G. Chen, “Can large language models provide feedback to students? a case study on chatgpt,” in 2023 IEEE International Conference on Advanced Learning Technologies (ICALT). IEEE, 2023, pp. 323–325.
- F. Yu, L. Quartey, and F. Schilder, “Legal prompting: Teaching a language model to think like a lawyer,” arXiv preprint arXiv:2212.01326, 2022.
- J. Savelka, K. D. Ashley, M. A. Gray, H. Westermann, and H. Xu, “Explaining legal concepts with augmented large language models (gpt-4),” arXiv preprint arXiv:2306.09525, 2023.
- C. Thapa, S. I. Jang, M. E. Ahmed, S. Camtepe, J. Pieprzyk, and S. Nepal, “Transformer-based language models for software vulnerability detection,” in Proceedings of the 38th Annual Computer Security Applications Conference, 2022, pp. 481–496.
- S. I. Ross, F. Martinez, S. Houde, M. Muller, and J. D. Weisz, “The programmer’s assistant: Conversational interaction with a large language model for software development,” in Proceedings of the 28th International Conference on Intelligent User Interfaces, 2023, pp. 491–514.
- X. Zhang and Q. Yang, “Xuanyuan 2.0: A large chinese financial chat model with hundreds of billions parameters,” in Proceedings of the 32nd ACM International Conference on Information and Knowledge Management, 2023, pp. 4435–4439.
- G. Son, H. Jung, M. Hahm, K. Na, and S. Jin, “Beyond classification: Financial reasoning in state-of-the-art language models,” arXiv preprint arXiv:2305.01505, 2023.
- Y. Bang, S. Cahyawijaya, N. Lee, W. Dai, D. Su, B. Wilie, H. Lovenia, Z. Ji, T. Yu, W. Chung et al., “A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity,” arXiv preprint arXiv:2302.04023, 2023.
- Y.-S. Wang and Y. Chang, “Toxicity detection with generative prompt-based inference,” arXiv preprint arXiv:2205.12390, 2022.
- L. Fluri, D. Paleka, and F. Tramèr, “Evaluating superhuman models with consistency checks,” arXiv preprint arXiv:2306.09983, 2023.
- A. Mallen, A. Asai, V. Zhong, R. Das, D. Khashabi, and H. Hajishirzi, “When not to trust language models: Investigating effectiveness of parametric and non-parametric memories,” arXiv preprint arXiv:2212.10511, 2022.
- Y. Cao, L. Zhou, S. Lee, L. Cabello, M. Chen, and D. Hershcovich, “Assessing cross-cultural alignment between chatgpt and human societies: An empirical study,” arXiv preprint arXiv:2303.17466, 2023.
- E. Ferrara, “Should chatgpt be biased? challenges and risks of bias in large language models,” arXiv preprint arXiv:2304.03738, 2023.
- Z. Guo, R. Jin, C. Liu, Y. Huang, D. Shi, L. Yu, Y. Liu, J. Li, B. Xiong, D. Xiong et al., “Evaluating large language models: A comprehensive survey,” arXiv preprint arXiv:2310.19736, 2023.
- Y. Liu, Y. Yao, J.-F. Ton, X. Zhang, R. G. H. Cheng, Y. Klochkov, M. F. Taufiq, and H. Li, “Trustworthy llms: a survey and guideline for evaluating large language models’ alignment,” arXiv preprint arXiv:2308.05374, 2023.
- Y. Chang, X. Wang, J. Wang, Y. Wu, L. Yang, K. Zhu, H. Chen, X. Yi, C. Wang, Y. Wang et al., “A survey on evaluation of large language models,” ACM Transactions on Intelligent Systems and Technology, 2023.
- Yining Huang (11 papers)
- Keke Tang (22 papers)
- Meilian Chen (3 papers)
- Boyuan Wang (15 papers)