Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

Large Language Models in the Clinic: A Comprehensive Benchmark (2405.00716v4)

Published 25 Apr 2024 in cs.CL and cs.AI

Abstract: The adoption of LLMs to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical decisions involve answering open-ended questions without pre-set options. To better understand LLMs in the clinic, we construct a benchmark ClinicBench. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. Furthermore, we construct six novel datasets and clinical tasks that are complex but common in real-world practice, e.g., open-ended decision-making, long document processing, and emerging drug analysis. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings. Finally, we invite medical experts to evaluate the clinical usefulness of LLMs. The benchmark data is available at https://github.com/AI-in-Health/ClinicBench.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (79)
  1. Anthropic. 2023. Claude-2.
  2. Anmol Arora and Ananya Arora. 2023. The promise of large language models in health care. The Lancet, 401(10377):641.
  3. Qwen technical report. ArXiv, abs/2309.16609.
  4. Automatic semantic classification of scientific literature according to the hallmarks of cancer. Bioinformatics, 32(3):432–440.
  5. The genetic association database. Nature genetics, 36(5):431–432.
  6. Scibert: A pretrained language model for scientific text. arXiv preprint arXiv:1903.10676.
  7. Olivier Bodenreider. 2004. The unified medical language system (umls): integrating biomedical terminology. Nucleic acids research, 32(suppl_1):D267–D270.
  8. Language models are few-shot learners. In Annual Conference on Neural Information Processing Systems.
  9. Eunsuk Chang and Javed Mostafa. 2021. The use of snomed ct, 2013-2020: a literature review. Journal of the American Medical Informatics Association, 28(9):2017–2026.
  10. An extensive benchmark study on biomedical text generation and mining with chatgpt. Bioinformatics, 39(9):btad557.
  11. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality.
  12. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311.
  13. Preparing a collection of radiology examinations for distribution and retrieval. J. Am. Medical Informatics Assoc., 23(2):304–310.
  14. Ncbi disease corpus: a resource for disease name recognition and concept normalization. Journal of biomedical informatics, 47:1–10.
  15. Kevin Donnelly et al. 2006. Snomed-ct: The advanced terminology and coding system for ehealth. Studies in health technology and informatics, 121:279.
  16. Glm: General language model pretraining with autoregressive blank infilling. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 320–335.
  17. Medalign: A clinician-generated dataset for instruction following with electronic medical records. arXiv preprint arXiv:2308.14089.
  18. Domain-specific language model pretraining for biomedical natural language processing. ACM Transactions on Computing for Healthcare (HEALTH), 3(1):1–23.
  19. Domain-specific language model pretraining for biomedical natural language processing. ACM Trans. Comput. Heal., 3(1):2:1–2:23.
  20. Medalpaca–an open-source collection of medical conversational ai models and training data. arXiv preprint arXiv:2304.08247.
  21. A survey of large language models for healthcare: from data, technology, and applications to accountability and ethics. arXiv preprint arXiv:2310.05694.
  22. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300.
  23. Improving radiology summarization with radiograph and anatomy prompts. arXiv preprint arXiv:2210.08303.
  24. Survey of hallucination in natural language generation. ACM Computing Surveys, 55(12):1–38.
  25. What disease does this patient have? a large-scale open domain question answering dataset from medical exams. Applied Sciences, 11(14):6421.
  26. Pubmedqa: A dataset for biomedical research question answering. arXiv preprint arXiv:1909.06146.
  27. On the automatic generation of medical imaging reports. In Annual Meeting of the Association for Computational Linguistics.
  28. MIMIC-III, a freely accessible critical care database. Scientific Data, 3.
  29. Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific data, 6(1):317.
  30. Felipe C Kitamura. 2023. Chatgpt is shaping the future of medical writing but still requires human judgment. Radiology, page 230171.
  31. Biobert: a pre-trained biomedical language representation model for biomedical text mining. Bioinformatics, 36(4):1234–1240.
  32. Biocreative v cdr task corpus: a resource for chemical disease relation extraction. Database, 2016.
  33. Halueval: A large-scale hallucination evaluation benchmark for large language models. arXiv e-prints, pages arXiv–2305.
  34. Chatdoctor: A medical chat model fine-tuned on a large language model meta-ai (llama) using medical domain knowledge.
  35. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In ACL.
  36. Competence-based multimodal curriculum learning for medical report generation. In Annual Meeting of the Association for Computational Linguistics.
  37. Exploring and distilling posterior and prior knowledge for radiology report generation. In IEEE Conference on Computer Vision and Pattern Recognition.
  38. Retrieve, reason, and refine: Generating accurate and faithful patient instructions. Advances in Neural Information Processing Systems, 35:18864–18877.
  39. Auto-encoding knowledge graph for unsupervised medical report generation. In Advances in Neural Information Processing Systems.
  40. Can generalist foundation models outcompete special-purpose tuning? case study in medicine. arXiv preprint arXiv:2311.16452.
  41. OpenAI. 2023a. Chatgpt [large language model]. https://chat.openai.com.
  42. OpenAI. 2023b. Gpt-4 technical report. ArXiv, abs/2303.08774.
  43. OpenAI. 2023c. Gpt-4 technical report. arXiv preprint arXiv:2303.08774.
  44. Medmcqa: A large-scale multi-subject multi-choice dataset for medical domain question answering. In Conference on Health, Inference, and Learning, pages 248–260. PMLR.
  45. BLEU: a Method for automatic evaluation of machine translation. In ACL.
  46. Sajan B Patel and Kyle Lam. 2023. Chatgpt: the future of discharge summaries? The Lancet Digital Health, 5(3):e107–e108.
  47. Chatgpt: friend or foe. Lancet Digit. Health, 5:e102.
  48. Transfer learning in biomedical natural language processing: An evaluation of BERT and elmo on ten benchmarking datasets. In BioNLP@ACL, pages 58–65.
  49. Named entity recognition and relation detection for biomedical information extraction. Frontiers in cell and developmental biology, page 673.
  50. Exploring the effectiveness of instruction tuning in biomedical language processing. arXiv preprint arXiv:2401.00579.
  51. The role of large language models in medical education: applications and implications.
  52. Semeval-2013 task 9: Extraction of drug-drug interactions from biomedical texts (ddiextraction 2013). In Proceedings of the Seventh International Workshop on Semantic Evaluation (SemEval 2013). Association for Computational Linguistics.
  53. Chatgpt and other large language models are double-edged swords. Radiology, page 230163.
  54. Large language models encode clinical knowledge. Nature, pages 1–9.
  55. Towards expert-level medical question answering with large language models. arXiv preprint arXiv:2305.09617.
  56. Deep learning methods for biomedical named entity recognition: a survey and qualitative comparison. Briefings in Bioinformatics, 22(6):bbab282.
  57. Does synthetic data generation of llms help clinical text mining? arXiv preprint arXiv:2303.04360.
  58. Stanford alpaca: An instruction-following llama model. https://github.com/tatsu-lab/stanford_alpaca.
  59. Large language models in medicine. Nature medicine, 29(8):1930–1940.
  60. Opportunities and challenges for chatgpt and large language models in biomedicine and health. Briefings in Bioinformatics, 25(1):bbad493.
  61. Clinical camel: An open-source expert-level medical language model with dialogue-based knowledge encoding. arXiv preprint arXiv:2305.12031.
  62. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971.
  63. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288.
  64. Tsinghua KEG. 2023. Chatglm-6b: A large-scale language model. https://github.com/THUDM/ChatGLM-6B/blob/main/README_en.md. Accessed: 2023-11-05.
  65. Towards generalist biomedical ai. arXiv preprint arXiv:2307.14334.
  66. Huatuo: Tuning llama model with chinese medical knowledge. arXiv preprint arXiv:2304.06975.
  67. Chatglm-med. https://github.com/SCIR-HI/Med-ChatGLM.
  68. Pmc-llama: Further finetuning llama on medical papers. arXiv preprint arXiv:2304.14454.
  69. Doctorglm: Fine-tuning your chinese doctor is not a herculean task. arXiv preprint arXiv:2304.01097.
  70. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. arXiv preprint arXiv:2304.01196.
  71. Harnessing the power of llms in practice: A survey on chatgpt and beyond. arXiv preprint arXiv:2304.13712.
  72. Zhongjing: Enhancing the chinese medical capabilities of large language model through expert feedback and real-world multi-turn dialogue. arXiv preprint arXiv:2308.03549.
  73. Qilin-med: Multi-stage knowledge injection advanced medical large language model. arXiv preprint arXiv:2310.09089.
  74. Glm-130b: An open bilingual pre-trained model. In The Eleventh International Conference on Learning Representations.
  75. Huatuogpt, towards taming language model to be a doctor. arXiv preprint arXiv:2305.15075.
  76. Alpacare: Instruction-tuned large language models for medical application. arXiv preprint arXiv:2310.14558.
  77. A survey of large language models. arXiv preprint arXiv:2303.18223.
  78. A survey of large language models in medicine: Progress, application, and challenge. arXiv preprint arXiv:2311.05112.
  79. Druggpt: A knowledge-grounded collaborative large language model for evidence-based drug analysis. Preprint.
User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (19)
  1. Hongjian Zhou (8 papers)
  2. Yining Hua (23 papers)
  3. Omid Rohanian (12 papers)
  4. Lei Clifton (9 papers)
  5. David A. Clifton (54 papers)
  6. Fenglin Liu (54 papers)
  7. Zheng Li (326 papers)
  8. Qingyu Yin (44 papers)
  9. Jingfeng Yang (31 papers)
  10. Xianfeng Tang (62 papers)
  11. Chen Luo (77 papers)
  12. Ming Zeng (123 papers)
  13. Haoming Jiang (52 papers)
  14. Yifan Gao (69 papers)
  15. Priyanka Nigam (8 papers)
  16. Sreyashi Nag (16 papers)
  17. Bing Yin (56 papers)
  18. Xuan Zhou (42 papers)
  19. Anshul Thakur (13 papers)
Citations (4)
X Twitter Logo Streamline Icon: https://streamlinehq.com