RareBench: Can LLMs Serve as Rare Diseases Specialists? (2402.06341v2)

Published 9 Feb 2024 in cs.CL

Abstract: Generalist LLMs, such as GPT-4, have shown considerable promise in various domains, including medical diagnosis. Rare diseases, affecting approximately 300 million people worldwide, often have unsatisfactory clinical diagnosis rates primarily due to a lack of experienced physicians and the complexity of differentiating among many rare diseases. In this context, recent news such as "ChatGPT correctly diagnosed a 4-year-old's rare disease after 17 doctors failed" underscore LLMs' potential, yet underexplored, role in clinically diagnosing rare diseases. To bridge this research gap, we introduce RareBench, a pioneering benchmark designed to systematically evaluate the capabilities of LLMs on 4 critical dimensions within the realm of rare diseases. Meanwhile, we have compiled the largest open-source dataset on rare disease patients, establishing a benchmark for future studies in this domain. To facilitate differential diagnosis of rare diseases, we develop a dynamic few-shot prompt methodology, leveraging a comprehensive rare disease knowledge graph synthesized from multiple knowledge bases, significantly enhancing LLMs' diagnostic performance. Moreover, we present an exhaustive comparative study of GPT-4's diagnostic capabilities against those of specialist physicians. Our experimental findings underscore the promising potential of integrating LLMs into the clinical diagnostic process for rare diseases. This paves the way for exciting possibilities in future advancements in this field.

Citations (7)

View on Semantic Scholar

Summary

The paper introduces RareBench, a benchmark that rigorously evaluates LLMs for rare disease diagnosis.
The study employs a dynamic few-shot prompting strategy and a comprehensive rare disease knowledge graph to enhance phenotype extraction and screening.
GPT-4 demonstrates specialist-level differential diagnosis, underscoring the potential of LLMs in clinical rare disease applications.

RareBench: Evaluating LLMs in Rare Disease Diagnoses

Introduction

LLMs like GPT-4 have exhibited promising capabilities in various domains, including healthcare. Given their extensive knowledge base, there's potential for these models to assist in the diagnosis of rare diseases, which is a significant challenge due to their low prevalence and the shortage of specialized knowledge among general practitioners. This paper introduces "RareBench," a pioneering benchmark designed to systematically evaluate the capabilities of LLMs in diagnosing rare diseases.

Methodology

RareBench is constructed on the foundation of the largest open-source dataset on rare disease patients. It assesses LLMs across four critical dimensions: phenotype extraction from electronic health records (EHRs), screening for specific rare diseases, comparative analysis of common and rare diseases, and differential diagnosis among universal rare diseases. Additionally, the paper employs a dynamic few-shot prompt methodology, leveraging a comprehensive rare disease knowledge graph synthesized from multiple knowledge bases. This approach significantly boosts the diagnostic capabilities of LLMs by enhancing their understanding of the complex relationship between phenotypes and rare diseases.

Evaluation and Results

The paper's experimental findings underscore the promising potential of integrating LLMs into the clinical diagnostic process for rare diseases. GPT-4, in particular, demonstrated capabilities on par with senior doctors across various specialties in the differential diagnosis of rare diseases. This was achieved through the development and application of a rare disease knowledge integration dynamic few-shot prompting strategy, which marks a significant stride in utilizing LLMs for complex clinical scenarios.

Phenotype Extraction: All models performed modestly, with GPT-4 leading, suggesting room for improvement in accurately extracting and standardizing phenotypes from EHRs.
Screening Specific Rare Diseases: GPT-4 showed the highest recall rates, indicating effectiveness in identifying risk factors and symptoms for specific rare diseases.
Comparative Analysis: GPT-4 achieved superior performance in distinguishing between rare and common diseases, demonstrating nuanced understanding and reasoning capabilities.
Differential Diagnosis among Universal Rare Diseases: Notably, GPT-4’s diagnostic capabilities, supported by dynamic few-shot prompting, compared favorably against a panel of specialist physicians.

Discussion

This comprehensive evaluation reveals significant insights and implications for broader AI applications in healthcare. The introduction of RareBench offers a structured framework to rigorously assess and refine the diagnostic acumen of LLMs in rare diseases. The results notably highlight GPT-4's competency, rivaling that of experienced specialists, particularly when augmented by dynamic few-shot prompts grounded in an integrated rare disease knowledge graph. This suggests a potential paradigm shift in rare disease diagnosis, where LLMs could augment or even extend the reach of specialist knowledge to generalist settings.

Future Perspectives

The integration of LLMs like GPT-4 in aiding the diagnosis of rare diseases opens new avenues for research and application. Future developments could focus on:

Enhancing LLMs' understanding and extraction of medical entities from EHRs to improve diagnostic accuracy.
Expanding the coverage and depth of the rare disease knowledge graph to encompass a broader array of diseases and phenotypes.
Leveraging multimodal LLMs to include diagnostic imaging and lab results for a more holistic approach to rare disease diagnosis.

Conclusion

RareBench marks a significant advancement in evaluating LLMs' utility in diagnosing rare diseases. The findings from this paper pave the way for future innovations in AI-assisted diagnostics, promising to bridge the gap in medical expertise and improve outcomes for patients with rare diseases.

PDF Markdown

Related Papers

Tweets

https://twitter.com/EricTopol/status/1757060329845215690

https://twitter.com/arankomatsuzaki/status/1756861270781796420

https://twitter.com/RolandBakerIII/status/1757391969981263956

https://twitter.com/knishimae0531/status/1756864634555089136