Medical Adaptation of Large Language and Vision-Language Models: Are We Making Progress? (2411.04118v2)

Published 6 Nov 2024 in cs.CL, cs.AI, and cs.LG

Abstract: Several recent works seek to develop foundation models specifically for medical applications, adapting general-purpose LLMs and vision-LLMs (VLMs) via continued pretraining on publicly available biomedical corpora. These works typically claim that such domain-adaptive pretraining (DAPT) improves performance on downstream medical tasks, such as answering medical licensing exam questions. In this paper, we compare seven public "medical" LLMs and two VLMs against their corresponding base models, arriving at a different conclusion: all medical VLMs and nearly all medical LLMs fail to consistently improve over their base models in the zero-/few-shot prompting regime for medical question-answering (QA) tasks. For instance, across the tasks and model pairs we consider in the 3-shot setting, medical LLMs only outperform their base models in 12.1% of cases, reach a (statistical) tie in 49.8% of cases, and are significantly worse than their base models in the remaining 38.2% of cases. Our conclusions are based on (i) comparing each medical model head-to-head, directly against the corresponding base model; (ii) optimizing the prompts for each model separately; and (iii) accounting for statistical uncertainty in comparisons. While these basic practices are not consistently adopted in the literature, our ablations show that they substantially impact conclusions. Our findings suggest that state-of-the-art general-domain models may already exhibit strong medical knowledge and reasoning capabilities, and offer recommendations to strengthen the conclusions of future studies.

PDF HTML Abstract

Evaluation of Medical Adaptation in LLMs and VLMs

The paper "Medical Adaptation of Large Language and Vision-LLMs: Are We Making Progress?" provides a critical analysis of the current state of domain-adaptive pretraining (DAPT) for developing medical-specific LLMs and vision-LLMs (VLMs). The authors compare several medically adapted models to their general-domain counterparts, revealing that the effectiveness of DAPT might not be as substantial as previously claimed.

Methodology Overview

The authors evaluate seven medical LLMs and two medical VLMs against their base models using a head-to-head comparison strategy. The models were tested across various medical question-answering (QA) tasks using datasets such as MedQA, MedMCQA, PubMedQA, and MMLU-Medical. To ensure fairness, they independently optimized the prompt formats and selection of few-shot examples for each model and performed robust statistical testing to account for uncertainty.

Key Findings

The paper reports that, across the evaluated tasks, medical-specific adaptations provided little to no improvement in zero- and few-shot prompting scenarios. In the few-shot setting, medical LLMs only showed statistically significant improvements in 12.1% of the cases, matched base model performance in 49.8% of cases, and performed worse in 38.2% of instances. For medical VLMs, no statistically significant improvements were noted over the base models.

Implications

These findings highlight important implications for the field:

Model Training and Pretraining Corpora: The fact that general-domain models often include publicly available biomedical texts like PubMed suggests that general LLMs may already possess substantial medical knowledge. Hence, the benefit of DAPT with additional medical corpora may be redundant.
Evaluation Practices: The paper emphasizes the need for rigorous experimental setups that account for LLM/VLM sensitivity to prompt formats and the statistical significance of performance differences. This approach guards against overestimation of the impact of medical adaptation.
Healthcare Applications: From a practical standpoint, while state-of-the-art medical models are attractive for clinical tasks, this paper suggests practitioners could efficiently use general models, given proper prompt engineering, without requiring extensive domain-specific adaptations.

Future Directions

The paper implies several avenues for future research:

Broader Task Evaluation: Future analyses could extend beyond classical medical QA tasks to include diverse clinical tasks such as medical report generation, diagnostic assistance, and personalized patient interaction to determine contexts where medical DAPT might show benefits.
Exploring Fine-Tuning Benefits: The potential advantage of DAPT might lie in better initialization for domain-specific fine-tuning rather than zero- or few-shot performance; this aspect warrants further exploration.
In-Depth Analysis of Model Architectures: A deeper understanding of how different architectures benefit (or not) from DAPT could guide more efficient model designs and task-specific adaptations.

Conclusion

This paper offers a valuable critique of the assumed advantages of DAPT in medical contexts, urging refinements in how researchers claim and evaluate improvements in specialized NLP tasks. Ensuring rigorous evaluation frameworks and considering general-domain models' capabilities is crucial in progressing responsibly in the development of medically-oriented AI models.