ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction? (2411.06469v1)

Published 10 Nov 2024 in cs.CL

Abstract: LLMs hold great promise to revolutionize current clinical systems for their superior capacities on medical text processing tasks and medical licensing exams. Meanwhile, traditional ML models such as SVM and XGBoost have still been mainly adopted in clinical prediction tasks. An emerging question is Can LLMs beat traditional ML models in clinical prediction? Thus, we build a new benchmark ClinicalBench to comprehensively study the clinical predictive modeling capacities of both general-purpose and medical LLMs, and compare them with traditional ML models. ClinicalBench embraces three common clinical prediction tasks, two databases, 14 general-purpose LLMs, 8 medical LLMs, and 11 traditional ML models. Through extensive empirical investigation, we discover that both general-purpose and medical LLMs, even with different model scales, diverse prompting or fine-tuning strategies, still cannot beat traditional ML models in clinical prediction yet, shedding light on their potential deficiency in clinical reasoning and decision-making. We call for caution when practitioners adopt LLMs in clinical applications. ClinicalBench can be utilized to bridge the gap between LLMs' development for healthcare and real-world clinical practice.

PDF Abstract

Summary of ClinicalBench: A Comparative Analysis of LLMs and Traditional ML Models in Clinical Prediction

The paper "ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?" introduces ClinicalBench, a novel benchmark designed to rigorously evaluate the performance of both general-purpose and medical LLMs against traditional ML models in clinical predictive tasks. Given the increasing interest in applying LLMs to healthcare, this paper critically examines and compares their efficacy in clinical prediction tasks, specifically focusing on Length-of-Stay Prediction, Mortality Prediction, and Readmission Prediction.

Research Approach

The paper employed two considerable datasets, MIMIC-III and MIMIC-IV, which are extensively utilized in health data science for modeling intensive care unit (ICU) admissions. ClinicalBench incorporates 11 traditional ML models, including logistic regression, SVM, XGBoost, and neural network architectures such as MLP, Transformer, and RNN, and contrasts their performance with 22 LLMs consisting of 14 general-purpose and 8 medical-specific models. The LLMs are assessed under several configuration scenarios with varying parameter sizes, decoding temperatures, prompt engineering strategies, and fine-tuning approaches.

Key Findings

Direct Prompting: Despite the theoretical advantages of LLMs’ extensive medical knowledge, the paper demonstrates that direct prompting methods result in LLMs underperforming compared to traditional ML models across all tasks. LLMs could not achieve the performance levels consistently exhibited by traditional models such as XGBoost and SVM.
Prompt Engineering: Various sophisticated prompting strategies, such as Chain-of-Thought, Self-Reflection, and In-Context Learning, were scrutinized. Although methods like In-Context Learning showed some improvement under specific contexts, especially on Length-of-Stay Prediction, these strategies generally failed to significantly elevate LLMs’ performance to surpass traditional models.
Fine-Tuning Efficacy: Fine-tuning LLMs yielded noticeable performance gains in certain tasks, particularly Length-of-Stay and Mortality Predictions. However, these gains were still insufficient to overtake the conventional ML methods' effectiveness. Interestingly, models like Gemma2-9B performed better in some contexts but lacked consistent dominance over traditional counterparts.
Impact of Parameter Scale and Temperature: Scaling up LLMs' parameters and adjusting decoding temperatures varied in effectiveness, suggesting that while expanding model capacity might enhance performance in specific instances, it does not ensure superior outcomes over traditional ML techniques in clinical predictions.

Implications and Considerations

The paper underscores the challenges of integrating LLMs into real-world clinical workflows due to their limitations in emulating clinical reasoning despite their promising strategic insights in medical knowledge assessments. Nevertheless, it highlights the need for further research to enhance LLMs’ ability to deal with the intricacies of healthcare data, possibly through the advances in data synthesis and emerging digital twin technologies.

Future Research Directions

To close the performance gap between LLMs and traditional ML models in clinical predictions, future research should explore deeper integration of contextually rich clinical data during model training phases while examining robust post-training methodologies tailored for healthcare. ClinicalBench emerges as a tool to facilitate ongoing advancements by offering a structured framework for testing emerging models across realistic healthcare scenarios.

In conclusion, while LLMs hold potential for transforming healthcare, ClinicalBench provides crucial insights into their current capabilities and impediments, cautioning practitioners about their limitations and emphasizing the reliability of traditional ML models in clinical applications.

PDF Markdown Bookmark Chat (Pro)

Authors (8)

Canyu Chen (26 papers)
Jian Yu (42 papers)
Shan Chen (31 papers)
Che Liu (59 papers)
Zhongwei Wan (39 papers)
Danielle Bitterman (11 papers)
Fei Wang (573 papers)
Kai Shu (88 papers)

Related Papers

Find Related Papers

Tweets

https://twitter.com/rohanpaul_ai/status/1864126765830164788

https://twitter.com/theomitsa/status/1857513280102871187

https://twitter.com/arXivGPT/status/1858212271648018811

https://twitter.com/GptMaestro/status/1857446996174610736

https://twitter.com/aidailynew/status/1860765705743786426

https://twitter.com/cgbielickmd/status/1859641554740396089