Large Language Models in the Clinic: A Comprehensive Benchmark (2405.00716v4)

Published 25 Apr 2024 in cs.CL and cs.AI

Abstract: The adoption of LLMs to assist clinicians has attracted remarkable attention. Existing works mainly adopt the close-ended question-answering (QA) task with answer options for evaluation. However, many clinical decisions involve answering open-ended questions without pre-set options. To better understand LLMs in the clinic, we construct a benchmark ClinicBench. We first collect eleven existing datasets covering diverse clinical language generation, understanding, and reasoning tasks. Furthermore, we construct six novel datasets and clinical tasks that are complex but common in real-world practice, e.g., open-ended decision-making, long document processing, and emerging drug analysis. We conduct an extensive evaluation of twenty-two LLMs under both zero-shot and few-shot settings. Finally, we invite medical experts to evaluate the clinical usefulness of LLMs. The benchmark data is available at https://github.com/AI-in-Health/ClinicBench.

PDF HTML Abstract

Summarize PDF Markdown Bookmark Chat (Pro)

References (79)

Authors (19)

Hongjian Zhou (8 papers)
Yining Hua (23 papers)
Omid Rohanian (12 papers)
Lei Clifton (9 papers)
David A. Clifton (54 papers)
Fenglin Liu (54 papers)
Zheng Li (326 papers)
Qingyu Yin (44 papers)
Jingfeng Yang (31 papers)
Xianfeng Tang (62 papers)
Chen Luo (77 papers)
Ming Zeng (123 papers)
Haoming Jiang (52 papers)
Yifan Gao (69 papers)
Priyanka Nigam (8 papers)
Sreyashi Nag (16 papers)
Bing Yin (56 papers)
Xuan Zhou (42 papers)
Anshul Thakur (13 papers)

Citations (4)

View on Semantic Scholar

Tweets

https://twitter.com/realmofresearch/status/1786358520486482110

Large Language Models in the Clinic: A Comprehensive Benchmark (2405.00716v4)

Related Papers

Tweets