MedREQAL: Examining Medical Knowledge Recall of LLMs via Question Answering
"MedREQAL: Examining Medical Knowledge Recall of LLMs via Question Answering" provides a comprehensive investigation into the proficiency of various LLMs in recalling and generating accurate medical knowledge. This paper, authored by Juraj Vladika, Phillip Schneider, and Florian Matthes from the Technical University of Munich, evaluates the performance of LLMs using a novel dataset constructed from systematic reviews, which are recognized for synthesizing high-quality evidence-based conclusions in the field of healthcare.
Introduction
The research was motivated by the impressive yet under-explored capability of LLMs to encode domain-specific knowledge during their pre-training on extensive text corpora. Given the increasing potential and application of LLMs in healthcare, it becomes imperative to scrutinize the quality and recall effectiveness of medical knowledge embedded within these models. Systematic reviews serve as an optimal source for this assessment due to their structured and rigorous approach in addressing clinical questions with synthesized evidence.
MedREQAL Dataset
The MedREQAL dataset was constructed from systematic reviews, specifically those conducted by the Cochrane Collaboration, a prominent organization dedicated to evidence-based healthcare. The dataset consists of 2,786 question-answer pairs, derived by automatically generating questions from the objectives of the systematic reviews and extracting conclusions as answers. Furthermore, a classification label (supported, refuted, or not enough information) was assigned to each question-answer pair to facilitate a structured evaluation.
Methodology
The evaluation involved six LLMs, including both general-purpose and biomedical-specific models. The selected models were GPT-4, Mistral-7B, Mixtral, PMC-LLaMa 13B, MedAlpaca 7B, and ChatDoctor 7B. Each model was evaluated in a zero-shot setting, where they were required to generate answers and classify the questions without additional context. The performance was measured using classification metrics (accuracy and F1 score) and natural language generation (NLG) metrics (ROUGE and BERTScore).
Results
The results highlighted distinct capabilities and limitations of the evaluated models. Notably, Mixtral outperformed both in classification (accuracy: 62.0, F1: 34.8) and generation (ROUGE-L: 21.1, BERTScore: 85.6), surpassing the otherwise highly regarded GPT-4. The disparity in model performance was attributed to the inclination of some models towards generating affirmative answers without sufficient evidence, as observed with GPT-4 and ChatDoctor. Mistral and Mixtral demonstrated a greater affinity towards cautious and evidence-based responses, which contributed to their superior performance.
Interestingly, biomedical models such as PMC-LLaMa showcased the ability to refer to specific randomized control trials, but struggled with comprehensive evidence synthesis, indicating the need for further refinement. The paper also revealed a fundamental challenge for LLMs in distinguishing between "refuted" and "not enough information" classes due to similar negative phrasing, which can lead to classification errors.
Implications and Future Work
The findings from this paper have several practical and theoretical implications. The moderate success in recalling medical evidence points towards the potential integration of LLMs in clinical decision support systems, provided that the models' propensity for generating definitive answers is mitigated. This underscores the importance of continuous improvement and frequent updating of the models' knowledge base to reflect the latest scientific evidence.
Furthermore, the MedREQAL dataset itself serves as a valuable resource for advancing research in biomedical question answering, providing a benchmark for developing more sophisticated retrieval-augmented generation methodologies and multi-document summarization techniques. Future research should explore refining the in-context learning capabilities of LLMs and developing mechanisms to ensure the currency of medical knowledge encoded within these models. Addressing the challenge of distinguishing between lack of evidence and evidence refutation remains a critical area for model enhancement.
Conclusion
The MedREQAL paper presents a robust framework for evaluating the medical knowledge recall abilities of LLMs and illuminates both their strengths and limitations. By leveraging a high-quality dataset derived from systematic reviews, this paper provides a rigorous and insightful analysis, advancing our understanding of LLM capabilities in the complex domain of healthcare. The results highlight the necessity for ongoing model refinement and pave the way for future innovations in medical AI applications.