Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash
38 tokens/sec
GPT-4o
59 tokens/sec
Gemini 2.5 Pro Pro
41 tokens/sec
o3 Pro
7 tokens/sec
GPT-4.1 Pro
50 tokens/sec
DeepSeek R1 via Azure Pro
28 tokens/sec
2000 character limit reached

ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction? (2411.06469v1)

Published 10 Nov 2024 in cs.CL
ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?

Abstract: LLMs hold great promise to revolutionize current clinical systems for their superior capacities on medical text processing tasks and medical licensing exams. Meanwhile, traditional ML models such as SVM and XGBoost have still been mainly adopted in clinical prediction tasks. An emerging question is Can LLMs beat traditional ML models in clinical prediction? Thus, we build a new benchmark ClinicalBench to comprehensively study the clinical predictive modeling capacities of both general-purpose and medical LLMs, and compare them with traditional ML models. ClinicalBench embraces three common clinical prediction tasks, two databases, 14 general-purpose LLMs, 8 medical LLMs, and 11 traditional ML models. Through extensive empirical investigation, we discover that both general-purpose and medical LLMs, even with different model scales, diverse prompting or fine-tuning strategies, still cannot beat traditional ML models in clinical prediction yet, shedding light on their potential deficiency in clinical reasoning and decision-making. We call for caution when practitioners adopt LLMs in clinical applications. ClinicalBench can be utilized to bridge the gap between LLMs' development for healthcare and real-world clinical practice.

Summary of ClinicalBench: A Comparative Analysis of LLMs and Traditional ML Models in Clinical Prediction

The paper "ClinicalBench: Can LLMs Beat Traditional ML Models in Clinical Prediction?" introduces ClinicalBench, a novel benchmark designed to rigorously evaluate the performance of both general-purpose and medical LLMs against traditional ML models in clinical predictive tasks. Given the increasing interest in applying LLMs to healthcare, this paper critically examines and compares their efficacy in clinical prediction tasks, specifically focusing on Length-of-Stay Prediction, Mortality Prediction, and Readmission Prediction.

Research Approach

The paper employed two considerable datasets, MIMIC-III and MIMIC-IV, which are extensively utilized in health data science for modeling intensive care unit (ICU) admissions. ClinicalBench incorporates 11 traditional ML models, including logistic regression, SVM, XGBoost, and neural network architectures such as MLP, Transformer, and RNN, and contrasts their performance with 22 LLMs consisting of 14 general-purpose and 8 medical-specific models. The LLMs are assessed under several configuration scenarios with varying parameter sizes, decoding temperatures, prompt engineering strategies, and fine-tuning approaches.

Key Findings

  1. Direct Prompting: Despite the theoretical advantages of LLMs’ extensive medical knowledge, the paper demonstrates that direct prompting methods result in LLMs underperforming compared to traditional ML models across all tasks. LLMs could not achieve the performance levels consistently exhibited by traditional models such as XGBoost and SVM.
  2. Prompt Engineering: Various sophisticated prompting strategies, such as Chain-of-Thought, Self-Reflection, and In-Context Learning, were scrutinized. Although methods like In-Context Learning showed some improvement under specific contexts, especially on Length-of-Stay Prediction, these strategies generally failed to significantly elevate LLMs’ performance to surpass traditional models.
  3. Fine-Tuning Efficacy: Fine-tuning LLMs yielded noticeable performance gains in certain tasks, particularly Length-of-Stay and Mortality Predictions. However, these gains were still insufficient to overtake the conventional ML methods' effectiveness. Interestingly, models like Gemma2-9B performed better in some contexts but lacked consistent dominance over traditional counterparts.
  4. Impact of Parameter Scale and Temperature: Scaling up LLMs' parameters and adjusting decoding temperatures varied in effectiveness, suggesting that while expanding model capacity might enhance performance in specific instances, it does not ensure superior outcomes over traditional ML techniques in clinical predictions.

Implications and Considerations

The paper underscores the challenges of integrating LLMs into real-world clinical workflows due to their limitations in emulating clinical reasoning despite their promising strategic insights in medical knowledge assessments. Nevertheless, it highlights the need for further research to enhance LLMs’ ability to deal with the intricacies of healthcare data, possibly through the advances in data synthesis and emerging digital twin technologies.

Future Research Directions

To close the performance gap between LLMs and traditional ML models in clinical predictions, future research should explore deeper integration of contextually rich clinical data during model training phases while examining robust post-training methodologies tailored for healthcare. ClinicalBench emerges as a tool to facilitate ongoing advancements by offering a structured framework for testing emerging models across realistic healthcare scenarios.

In conclusion, while LLMs hold potential for transforming healthcare, ClinicalBench provides crucial insights into their current capabilities and impediments, cautioning practitioners about their limitations and emphasizing the reliability of traditional ML models in clinical applications.

User Edit Pencil Streamline Icon: https://streamlinehq.com
Authors (8)
  1. Canyu Chen (26 papers)
  2. Jian Yu (42 papers)
  3. Shan Chen (31 papers)
  4. Che Liu (59 papers)
  5. Zhongwei Wan (39 papers)
  6. Danielle Bitterman (11 papers)
  7. Fei Wang (573 papers)
  8. Kai Shu (88 papers)