RareBench: Evaluating LLMs in Rare Disease Diagnoses
Introduction
LLMs like GPT-4 have exhibited promising capabilities in various domains, including healthcare. Given their extensive knowledge base, there's potential for these models to assist in the diagnosis of rare diseases, which is a significant challenge due to their low prevalence and the shortage of specialized knowledge among general practitioners. This paper introduces "RareBench," a pioneering benchmark designed to systematically evaluate the capabilities of LLMs in diagnosing rare diseases.
Methodology
RareBench is constructed on the foundation of the largest open-source dataset on rare disease patients. It assesses LLMs across four critical dimensions: phenotype extraction from electronic health records (EHRs), screening for specific rare diseases, comparative analysis of common and rare diseases, and differential diagnosis among universal rare diseases. Additionally, the paper employs a dynamic few-shot prompt methodology, leveraging a comprehensive rare disease knowledge graph synthesized from multiple knowledge bases. This approach significantly boosts the diagnostic capabilities of LLMs by enhancing their understanding of the complex relationship between phenotypes and rare diseases.
Evaluation and Results
The paper's experimental findings underscore the promising potential of integrating LLMs into the clinical diagnostic process for rare diseases. GPT-4, in particular, demonstrated capabilities on par with senior doctors across various specialties in the differential diagnosis of rare diseases. This was achieved through the development and application of a rare disease knowledge integration dynamic few-shot prompting strategy, which marks a significant stride in utilizing LLMs for complex clinical scenarios.
- Phenotype Extraction: All models performed modestly, with GPT-4 leading, suggesting room for improvement in accurately extracting and standardizing phenotypes from EHRs.
- Screening Specific Rare Diseases: GPT-4 showed the highest recall rates, indicating effectiveness in identifying risk factors and symptoms for specific rare diseases.
- Comparative Analysis: GPT-4 achieved superior performance in distinguishing between rare and common diseases, demonstrating nuanced understanding and reasoning capabilities.
- Differential Diagnosis among Universal Rare Diseases: Notably, GPT-4’s diagnostic capabilities, supported by dynamic few-shot prompting, compared favorably against a panel of specialist physicians.
Discussion
This comprehensive evaluation reveals significant insights and implications for broader AI applications in healthcare. The introduction of RareBench offers a structured framework to rigorously assess and refine the diagnostic acumen of LLMs in rare diseases. The results notably highlight GPT-4's competency, rivaling that of experienced specialists, particularly when augmented by dynamic few-shot prompts grounded in an integrated rare disease knowledge graph. This suggests a potential paradigm shift in rare disease diagnosis, where LLMs could augment or even extend the reach of specialist knowledge to generalist settings.
Future Perspectives
The integration of LLMs like GPT-4 in aiding the diagnosis of rare diseases opens new avenues for research and application. Future developments could focus on:
- Enhancing LLMs' understanding and extraction of medical entities from EHRs to improve diagnostic accuracy.
- Expanding the coverage and depth of the rare disease knowledge graph to encompass a broader array of diseases and phenotypes.
- Leveraging multimodal LLMs to include diagnostic imaging and lab results for a more holistic approach to rare disease diagnosis.
Conclusion
RareBench marks a significant advancement in evaluating LLMs' utility in diagnosing rare diseases. The findings from this paper pave the way for future innovations in AI-assisted diagnostics, promising to bridge the gap in medical expertise and improve outcomes for patients with rare diseases.