Open (Clinical) LLMs are Sensitive to Instruction Phrasings
The paper by Ceballos Arroyo et al. tackles a significant issue in the use of LLMs within the clinical domain: the sensitivity of these models to variations in instruction phrasings. The authors conduct a comprehensive evaluation of the robustness of seven different LLMs, both general and domain-specific (clinical), against instructions provided by medical professionals for various clinical tasks.
Introduction and Motivation
The paper sets out to delve into the robustness of instruction-tuned LLMs when faced with natural variations in instruction phrasing, particularly within the context of clinical NLP tasks. The sensitivity of LLMs to how instructions are phrased is not a new discovery, but the implications in the healthcare domain, where clinician-written prompts can directly affect model outputs with potential consequences for patient outcomes, are particularly concerning.
Methodology
The authors designed an experimental setup encompassing ten clinical classification tasks and six information extraction tasks from well-established datasets such as MIMIC-III, i2b2, and n2c2 challenges. They recruited 20 medical professionals from diverse backgrounds to write prompts for each task. Ultimately, instructions from 12 practitioners were used to test the robustness of the seven LLMs. The workspace for the models included both general domain models (e.g., Llama 2, Alpaca) and domain-specific clinical models (e.g., Clinical Camel, Asclepius, MedAlpaca).
The evaluation of these models involved examining their performance across the range of provided prompts, translating this into metrics like mean and standard deviation for classification and information extraction tasks. Additionally, the authors explored the implications of instruction phrasings on the fairness of the models’ predictions across different demographic subgroups (e.g., race and sex) within the clinical tasks.
Key Findings
- Variability in Performance: The analysis revealed substantial differences in model performance across different instruction phrasings for both classification and extraction tasks. In particular, some degree of brittleness was observed for domain-specific models trained on clinical data, which contrasts with their general domain counterparts.
- Best vs. Worst Case Performance: The paper showed that general models such as Llama 2 (7b) performed consistently better across various prompts compared to their clinical analogs. For instance, in the mortality prediction task, Llama 2 (13b) achieved superior results compared to Clinical Camel with less variability.
- Fairness: Evaluating fairness, the authors found significant discrepancies in model performance between different demographic subgroups. For instance, the mortality prediction task exhibited performance differences of up to 0.35 AUROC points between White and Non-White patients and up to 0.19 AUROC points between male and female patients.
Implications and Future Directions
The findings from this paper have profound implications both practical and theoretical. Practically, the brittleness of clinical LLMs to instruction variations suggests that deploying these models in real-world settings could lead to inconsistent outcomes, influencing patient care based on arbitrary prompt differences. Theoretically, the work underscores a critical need for developing more robust LLMs capable of maintaining stable performance across varying natural language instructions.
From a fairness perspective, the paper highlights the necessity to address demographic biases inherent in current LLMs. Given the differences in performance across race and sex, more research is needed to ensure equitable outcomes in clinical settings where bias can have far-reaching consequences.
Conclusion
The research presented by Ceballos Arroyo et al. offers a rigorous analysis of instruction-sensitive LLMs in clinical environments. The key takeaway is the evident lack of robustness in current models to instruction phrasings, which raises concerns for their applicability in high-stakes domains such as healthcare. Future efforts should be directed towards improving the stability and fairness of LLMs to ensure consistent and reliable performance irrespective of instruction variability. The findings point to an urgent need for developing advanced methodologies to enhance the resilience of LLMs, ultimately promoting safer and more reliable AI systems in clinical practice.