RareSeek R1 Overview
- RareSeek R1 is a specialized large language model designed for clinical reasoning in rare diseases, integrating multi-modal inputs such as EHR narratives, genomic data, and imaging findings.
- It utilizes parameter-efficient fine-tuning via LoRA and chain-of-thought learning to enhance interpretability and boost diagnostic accuracy.
- Graph-augmented retrieval from a Neo4j-based knowledge graph underpins its state-of-the-art performance, improving benchmark outcomes in complex clinical scenarios.
RareSeek R1 is a specialized LLM for clinical reasoning and diagnosis in rare diseases. It combines a domain-tailored, parameter-efficient transformer architecture with multi-modal input capacity, chain-of-thought (CoT) learning, and graph-augmented retrieval to deliver state-of-the-art interpretability and performance on real-world clinical narratives, especially for challenging rare-disease cases (Yang et al., 18 Nov 2025).
1. Model Architecture and Input Representation
RareSeek R1 is based on a 70B-parameter decoder-only transformer architecture (LLaMA-3.3), distilled from a 671B-parameter teacher (DeepSeek-R1) using DeepSeek-R1-Distill-LLaMA-70B as the backbone. Parameter-efficient fine-tuning is implemented via Low-Rank Adaptation (LoRA) modules (rank=8, α=32), which are inserted into each linear layer, freezing base weights and tuning approximately 0.25B new parameters.
Input representations are highly structured:
- EHR Narratives: Tokenized, concatenated by clinical section (chief complaint, history of present illness, family history, physical exam, specialty consults, ancillary testing).
- Phenotypes: Extracted using PhenoTagger, mapped to Human Phenotype Ontology (HPO) identifiers.
- Non-HPO features: Imaging findings, interventions/procedures, functional assessments, laboratory/pathology results, environmental exposures are marked as text spans or encoded as special tokens.
- Genomic Variants: Provided as standardized transcript/codon notation (e.g., NM_000053.4:c.715T>G), linked on-the-fly to the knowledge graph (KG) via GraphRAG.
- Graph-Grounded Retrieval: At each generation step, the model issues Cypher queries to a Neo4j-based rare-disease KG, retrieving and serializing relevant subgraphs as knowledge prompts. Unlike conventional vector retrieval (e.g., ), graph-indexed node retrieval is used for exactness.
2. Training Corpus and Knowledge Graph Integration
Three core resources are used for training:
- RareMed-Corpus: 1.49×10⁵ documents (~500M tokens), consisting of 48,852 de-identified, clinician-confirmed EHRs, 35,722 guidelines and medical texts (ChARD, NORD, Orphanet, OMIM), 30,101 PubMed case reports, and 34,666 phenotype-driven synthetic cases assembled from HPO and Orphanet (Yang et al., 18 Nov 2025).
- RareMed-CoT: 17,477 reasoning chains, initially seeded by 500 expert-annotated EHR cases, expanded via LLM self-generation and expert curation (inter-annotator ).
- RareMed-RAG: Neo4j KG fusing ClinVar, HGMD, HPO, OMIM, Orphanet—totaling 8,200 diseases, 16,700 phenotypes, 5,200 genes, and 630,000 variants.
Graph nodes encode diseases, phenotypes, genes, and variants, with edges and frequency annotations curated from major rare-disease sources. The Graph Cypher Retriever expands 1–2-hop neighborhoods starting from phenotype/gene/variant identifiers; subgraphs are ranked by information content (IC) for phenotypes and serialized into the prompt at inference.
3. Instruction Tuning and Chain-of-Thought Learning
RareSeek R1 employs a three-phase progressive transfer learning regime:
- Domain-Specific Instruction Tuning: Models are optimized with AdamW (lr=, batch size=4, 3 epochs). Each sample is framed as maximizing:
using diverse clinical and case-based instructions.
- Chain-of-Thought Fine-Tuning: For each instance, , the model learns to generate reasoning chains and diagnoses :
where .
- GraphRAG-Augmented Reasoning: KG facts are retrieved and prepended to the prompt. Retrieval-augmented scoring combines semantic and graph-based priors:
with as vector similarity and quantifying presence/weight of KG connections.
Ablation studies show instruction tuning yields a +11.8% (EHR-Internal) and +9.6% (EHR-External) Top-1 improvement, with CoT fine-tuning yielding an additional +9.5% and +7.9% respectively (Yang et al., 18 Nov 2025).
4. Performance and Benchmarking
RareSeek R1 achieves state-of-the-art diagnostic accuracy across multiple public and internal benchmarks:
| Benchmark | Top-1 Accuracy | Top-5/Top-10 Accuracy | Notable Features |
|---|---|---|---|
| EHR-Internal (n=4,306) | 0.684 ± 0.014 | Top-5: 0.839 ± 0.011 | In-domain complex hospital records |
| EHR-External (n=283) | 0.719 ± 0.025 | Out-of-domain hospital data | |
| RareBench (n=1,197) | 0.392 ± 0.025 (vs. GPT-5 0.353) | Top-10: near-identical | Outperforms SOTA on rare-disease cases |
| MedEHR-Variant (n=147) | +17.0% with GraphRAG (Top-1 0.770) | KG/variant retrieval largest impact | |
| Phenopacket-Store (n=5,213) | Top-10: 0.910 ± 0.007 | Multi-modal cohort, post-GraphRAG |
Robustness against noisy/overlapping phenotypes is documented—Top-5 accuracies are maintained (0.787–0.840) from sparse to rich key-phenotype cases, while traditional tools (e.g., Exomiser) register Top-5 <0.15 across all tiers. On the 300-hardest complex-phenotype cases, Top-1 is 0.520 (Yang et al., 18 Nov 2025).
Standard diagnostic metrics—precision, recall, —are reported as:
5. Human Studies and Interpretability
In 110-case matched studies spanning neurologic and metabolic disorders, RareSeek R1 achieves Top-1 accuracy of 0.473 (mid-level physician parity), with GraphRAG retrieval elevating Top-1 to 0.582 (senior-attending performance range). Clinician-AI collaboration boosts junior, mid-level, and senior Top-1 accuracy by Δ=+0.169, +0.118, and +0.094, respectively.
Auditable decision paths are enabled by explicit CoT reasoning chains, which closely align with established clinical guidelines. Notably, decisive non-phenotypic (non-HPO) evidence constitutes a median of 23.1% (IQR 0.118–0.400) of features in correct diagnoses. The breakdown by category is as follows:
| Category | Fraction (%) |
|---|---|
| Imaging findings | 26.5 |
| Interventions | 23.0 |
| Functional tests | 17.7 |
| Laboratory results | 9.1 |
| Molecular tests | 8.2 |
| Pathology | 7.2 |
| Environmental exposures | 5.4 |
This highlights the multi-modal and evidence-integrated nature of correct rare-disease diagnostic reasoning in practice (Yang et al., 18 Nov 2025).
6. Clinical Impact and Future Directions
RareSeek R1 substantiates a narrative-first, knowledge-integrated paradigm for rare disease diagnosis, significantly shortening the diagnostic odyssey and supporting clinicians with interpretable, knowledge-grounded decision support. Its integration of graph-augmented retrieval and multi-modal reasoning consistently improves both standalone and assistive diagnostic metrics, with auditability and benchmarking rigorously documented.
A plausible implication is that RareSeek R1's methodology—progressive domain tuning, explicit multi-modal CoT, and dynamic use of curated KGs—could serve as a template for future clinical LLMs tasked with similarly heterogeneous, high-stakes domains.
Ongoing and future developments may involve scaling the KG with additional population-genomic and longitudinal evidence, extending synthetic reasoning chains to rarer disease subtypes, and adopting more adaptable prompt and fine-tuning protocols to further boost clinical domain generalization and real-world deployment (Yang et al., 18 Nov 2025).