Assessing The Potential Of Mid-Sized Language Models For Clinical QA (2404.15894v1)

Published 24 Apr 2024 in cs.CL and cs.AI

Abstract: LLMs, such as GPT-4 and Med-PaLM, have shown impressive performance on clinical tasks; however, they require access to compute, are closed-source, and cannot be deployed on device. Mid-size models such as BioGPT-large, BioMedLM, LLaMA 2, and Mistral 7B avoid these drawbacks, but their capacity for clinical tasks has been understudied. To help assess their potential for clinical use and help researchers decide which model they should use, we compare their performance on two clinical question-answering (QA) tasks: MedQA and consumer query answering. We find that Mistral 7B is the best performing model, winning on all benchmarks and outperforming models trained specifically for the biomedical domain. While Mistral 7B's MedQA score of 63.0% approaches the original Med-PaLM, and it often can produce plausible responses to consumer health queries, room for improvement still exists. This study provides the first head-to-head assessment of open source mid-sized models on clinical tasks.

PDF Abstract

Assessing the Efficacy of Mid-Sized LLMs on Clinical QA Tasks

Introduction

The utilization of LLMs in the healthcare sector has raised significant interest due to their promising applications in clinical question-answering (QA) tasks. Traditionally dominated by highly specialized large-scale models like GPT-4 and Med-PaLM, these platforms extend impressive capabilities but pose challenges including extensive computational demands, closed-source architectures, and their unsuitability for on-device deployment. This paper evaluates the performance of four mid-sized models: BioGPT-large, BioMedLM, LLaMA 2, and Mistral 7B in healthcare QA tasks to gauge their utility in clinical contexts without the constraints posed by larger models.

Evaluation Setup and Methods

Clinical QA Datasets

The models were rigorously tested on two primary QA benchmarks:

MedQA: Focuses on USMLE-style multiple-choice questions assessing medical knowledge and clinical decision-making.
MultiMedQA Long Form Answering: Engages models in generating paragraph-sized responses to consumer health queries which simulate real-world questions a patient might ask.

Model Training and Tuning

All models underwent fine-tuning on specific clinical datasets:

MedQA Training: Models were tuned to select correct answers from multiple-choice options based on provided clinical prompts. Fine-tuning was uniform across all models to maintain comparability.
MultiMedQA Training: Due to the lack of explicit training data, a novel dataset was curated from various online medical resources translating medical content into a question-response format suitable for model training.

Results and Performances

MedQA Task Performance

The best-performing model on the MedQA task was Mistral 7B, achieving a notable score of 63.0% post additional training with an expanded dataset. This compared favorably to the scores from dedicated biomedical models trained exclusively on niche datasets, yet still significantly lower than the capabilities exhibited by the largest models like GPT-4.

MultiMedQA Long Form Task Performance

The evaluation involved a detailed clinician review of generated responses across several metrics:

Completeness and Medical Accuracy: Mistral 7B demonstrated the highest competence, often generating the most comprehensive and medically appropriate responses.
Safety Metrics: Responses were assessed for potential harm and error propensity showing that while high, the performance of mid-sized models like Mistral 7B still indicates room for improvement to match the leading models like Med-PaLM 2.

Discussion

Despite Mistral 7B's leading performance among mid-sized models, the results underscore a gap that remains with higher echelon models. The insights suggest that while mid-sized models can offer practical alternatives with relative environmental and economic efficiency, they do not yet match the pinnacle performance of models exceeding tens of billions in parameter count.

Conclusion and Future Directions

The paper importantly establishes a benchmark for mid-sized models in clinical QA tasks, suggesting that they hold potential yet require further advancements. Future explorations could investigate the integration of biomedical domain focus during initial model training, use of advanced training strategies like reinforcement learning, and expansion of model scale within computational feasibility.

This analysis leaves the field poised for continued innovation, where the accessibility of open-source, moderately scaled models may democratize advanced AI tools in clinical settings, providing substantial utility while managing cost and computational overhead.