Rare Disease Diagnosis Agentic System
- Rare disease diagnosis agentic systems are advanced frameworks that integrate AI, modular agents, and clinical repositories to enable precise diagnostic reasoning.
- DeepRare employs a modular architecture with a central LLM host, specialized servers, and external data sources to aggregate multimodal information.
- The system’s self-reflective workflow and evidence-linked reasoning provide clinicians with transparent, validated diagnostic insights across diverse cases.
A rare disease diagnosis agentic system is a computational framework designed to facilitate the accurate and efficient diagnosis of rare diseases by integrating advanced AI, long-term structured memory, modular domain-specific agents, and up-to-date multimodal clinical knowledge. Such systems address the fundamental challenges associated with the clinical heterogeneity, low prevalence, and knowledge sparsity characteristic of rare conditions. DeepRare, as introduced in (2506.20430), operationalizes these principles in a fully modular, scalable, and interpretable architecture able to provide transparent diagnostic reasoning for a wide spectrum of rare disorders.
1. System Architecture and Agents
DeepRare is structured as a modular, agent-oriented platform that separates the diagnostic process into orchestrated subtasks across three main components:
- Central Host with Long-Term Memory: This core coordinating entity is powered by a LLM and maintains a persistent system context, orchestrating the diagnostic workflow by managing case information, assigning subtasks, synthesizing results, and iteratively refining diagnostic hypotheses.
- Specialized Agent Servers: Over 40 agent servers handle domain-specific analytics including phenotype extraction (normalizing free-text and structured sources to HPO terms), disease name normalization, patient similarity search, retrieval-augmented evidence collection (including web-scale databases, literature, clinical guidelines, and curated case libraries), and bioinformatics analysis for both phenotypic and genotypic data.
- External Data Sources: Integration spans web-scale repositories and primary biomedical resources such as OMIM, Orphanet, Human Phenotype Ontology (HPO), PubMed, gnomAD, ClinVar, and others, ensuring access to the most current clinical information.
Interaction Workflow: When a user submits a case—including free-text, structured HPO terms, and/or genomic variant files—the central host decomposes the problem, dynamically delegates subtasks to agents, aggregates all returned evidence into memory, and produces an iteratively refined set of diagnostic hypotheses accompanied by transparent justification.
This architecture supports high scalability, modularity, and continual extensibility for new data types, knowledge sources, or analytical tasks.
2. Multimodal Diagnostic Reasoning and Algorithms
DeepRare’s diagnostic process is designed for heterogenous (multi-modal) input and implements a staged, self-reflective workflow:
- Information Collection:
- Phenotype Handling: Free-text is standardized to HPO using both LLM-based and BioLORD embedding-based approaches. Structured features are expanded to include synonyms and semantically similar HPO concepts.
- Genotype Handling: If present, VCF files are annotated and integrated with phenotype features using Exomiser, gnomAD, ClinVar, and related resources. Variant prioritization is contextually merged with the patient phenotype.
- Evidence Gathering: Agents retrieve relevant evidence via web-scale search (MedCPT, PubCaseFinder, PhenoBrain, similar case embedding, etc.), and aggregate all results into a structured long-term memory module.
- Tentative Diagnosis:
- The central host (LLM) proposes an initial ranked diagnostic list by synthesizing all phenotypic, genotypic, and literature-informed evidence.
- Self-Reflective Diagnosis:
- Each candidate disease is normalized, and agent servers conduct targeted evidence retrieval. The diagnostic hypothesis is recursively challenged, reviewed against all collected supporting information, and revised as necessary until a confidence threshold is achieved.
The process is formalized as:
where is the ranked disease list and is the corresponding chain of reasoning for each candidate.
- Traceable Reasoning Generation:
- Each diagnostic hypothesis is accompanied by a stepwise explanation linking every inference to explicit in-system references (literature, curated cases, tool outputs), ensuring a fully transparent and auditable diagnostic rationale.
Mathematical and Algorithmic Details
- Recall@K Evaluation:
$\text{Recall@}k = \frac{\text{Number of cases where correct diagnosis in top %%%%0%%%%}}{\text{Total cases}}$
- Phenotype Similarity Search (via embedding/cosine similarity):
3. Traceable Reasoning and Evidence Synthesis
A defining feature of DeepRare is its emphasis on traceable, evidence-linked diagnostic reasoning:
- Evidence-Linked Explanation: Every diagnostic suggestion contains a summary of the logical/clinical reasoning, with all assertions hyperlinked to supporting references, which include literature entries (PubMed links), bioinformatics tool results (e.g., variant annotation), prior case reports, and database records.
- Structured Output Example:
1 2 3 4 5 6 7 8 9 |
## DISEASE NAME (Rank #N) ### Diagnostic Reasoning: - Symptom match (reference [1]). - Relevant gene variant (reference [2]). - Case match (reference [3]). - Pathophysiology (OMIM reference [4]). ## References [1] PubMed: ... [2] Exomiser result: ... |
- No Hallucinated Citations: The system prohibits fabricated references, constraining LLM outputs to verified records retrieved during the analysis pipeline.
- Expert Validation: Manual review yielded 95.4% agreement between clinical experts and DeepRare's references and inferences across sampled cases, supporting the system's factuality and reliability.
4. Performance Evaluation and Benchmarking
DeepRare demonstrates robust performance across eight benchmark datasets encompassing 6,401 cases and 2,919 rare diseases:
- Recall@1 (Phenotype Only): 57.18% (average; best LLM setting), compared to the next-best Reasoning LLM at 33.39%.
- Recall@1 (HPO + Gene): 70.60% on a set of 109 exome-supported cases (Xinhua Hospital), compared to 53.20% for Exomiser.
- Disease-Level Accuracy: Achieves 100% correct top-ranked diagnosis for 1013 out of 2,919 tracked diseases.
- Robustness and Modular Advantage: Outperforms 15 bioinformatics, LLM, and agentic system baselines across all specialties and sites. Ablation studies confirm that the agentic design and tool aggregation yield up to +28% accuracy gains over single-model approaches.
- Traceability: Over 95% citation accuracy and evidence alignment with expert manual checking.
Dataset | DeepRare Recall@1 | Best Baseline | Δ |
---|---|---|---|
RareBench MME | 70.0% | 40.0% | +30.0 |
MyGene2 | 74.0% | 39.7% | +34.3 |
MIMIC-IV-Rare | 29.2% | 14.6% | +14.6 |
Xinhua Hospital | 58.3% | 43.3% | +15.0 |
Performance is consistently strong across specialties, data sources, and multi-modal (phenotype + genotype) cases.
5. Clinical Application and Implementation
DeepRare is implemented as a secure, clinician-facing web platform (http://raredx.cn/doctor), designed for real-world deployment in hospital settings. Key operational features include:
- Multi-modal Input Support: Enables upload and structured entry of demographic data, free-text notes, laboratory results, imaging, and VCF files.
- Interactive Symptom Mapping: Automatically maps free text to HPO; supports manual correction and review.
- Explainable Output: Provides downloadable, structured diagnostic reports with reasoning chains and references.
- No Special AI Expertise Required: Designed to be used by both rare disease specialists and non-specialists, supporting clinician workflow with AI-powered diagnostic copilot features.
- Privacy and Integration: Supports local deployment for privacy-sensitive clinical environments.
6. Impact and Prospects
DeepRare establishes a new functional and epistemological benchmark for rare disease agentic systems by combining:
- Transparent, verifiable reasoning directly referenced to primary evidence.
- Modular, scalable orchestration of LLM and domain-specific agents, supporting integration of new knowledge resources, tools, and analytical modalities.
- Empirically validated performance across diverse clinical realities with strong generalizability and robustness.
- Clinical accessibility via user-friendly web apps suited for real hospital deployment.
- Bridging the diagnostic expertise gap, it enables improved diagnostic performance even for non-specialist practitioners, accelerating time-to-diagnosis for rare disease patients.
A plausible implication is that agentic systems with transparent, tool-integrated reasoning pipelines could serve as foundational platforms for future AI-driven clinical support in rare and other complex disease domains, pending continued validation and responsible integration into healthcare settings.