Papers
Topics
Authors
Recent
2000 character limit reached

RareSeek R1 Overview

Updated 25 November 2025
  • RareSeek R1 is a specialized large language model designed for clinical reasoning in rare diseases, integrating multi-modal inputs such as EHR narratives, genomic data, and imaging findings.
  • It utilizes parameter-efficient fine-tuning via LoRA and chain-of-thought learning to enhance interpretability and boost diagnostic accuracy.
  • Graph-augmented retrieval from a Neo4j-based knowledge graph underpins its state-of-the-art performance, improving benchmark outcomes in complex clinical scenarios.

RareSeek R1 is a specialized LLM for clinical reasoning and diagnosis in rare diseases. It combines a domain-tailored, parameter-efficient transformer architecture with multi-modal input capacity, chain-of-thought (CoT) learning, and graph-augmented retrieval to deliver state-of-the-art interpretability and performance on real-world clinical narratives, especially for challenging rare-disease cases (Yang et al., 18 Nov 2025).

1. Model Architecture and Input Representation

RareSeek R1 is based on a 70B-parameter decoder-only transformer architecture (LLaMA-3.3), distilled from a 671B-parameter teacher (DeepSeek-R1) using DeepSeek-R1-Distill-LLaMA-70B as the backbone. Parameter-efficient fine-tuning is implemented via Low-Rank Adaptation (LoRA) modules (rank=8, α=32), which are inserted into each linear layer, freezing base weights and tuning approximately 0.25B new parameters.

Input representations are highly structured:

  • EHR Narratives: Tokenized, concatenated by clinical section (chief complaint, history of present illness, family history, physical exam, specialty consults, ancillary testing).
  • Phenotypes: Extracted using PhenoTagger, mapped to Human Phenotype Ontology (HPO) identifiers.
  • Non-HPO features: Imaging findings, interventions/procedures, functional assessments, laboratory/pathology results, environmental exposures are marked as text spans or encoded as special tokens.
  • Genomic Variants: Provided as standardized transcript/codon notation (e.g., NM_000053.4:c.715T>G), linked on-the-fly to the knowledge graph (KG) via GraphRAG.
  • Graph-Grounded Retrieval: At each generation step, the model issues Cypher queries to a Neo4j-based rare-disease KG, retrieving and serializing relevant subgraphs as knowledge prompts. Unlike conventional vector retrieval (e.g., r(q,d)=qdqdr(q,d) = \frac{q \cdot d}{\|q\|\,\|d\|}), graph-indexed node retrieval is used for exactness.

2. Training Corpus and Knowledge Graph Integration

Three core resources are used for training:

  • RareMed-Corpus: 1.49×10⁵ documents (~500M tokens), consisting of 48,852 de-identified, clinician-confirmed EHRs, 35,722 guidelines and medical texts (ChARD, NORD, Orphanet, OMIM), 30,101 PubMed case reports, and 34,666 phenotype-driven synthetic cases assembled from HPO and Orphanet (Yang et al., 18 Nov 2025).
  • RareMed-CoT: 17,477 reasoning chains, initially seeded by 500 expert-annotated EHR cases, expanded via LLM self-generation and expert curation (inter-annotator κ=0.91\kappa=0.91).
  • RareMed-RAG: Neo4j KG fusing ClinVar, HGMD, HPO, OMIM, Orphanet—totaling \sim8,200 diseases, 16,700 phenotypes, 5,200 genes, and 630,000 variants.

Graph nodes encode diseases, phenotypes, genes, and variants, with edges and frequency annotations curated from major rare-disease sources. The Graph Cypher Retriever expands 1–2-hop neighborhoods starting from phenotype/gene/variant identifiers; subgraphs are ranked by information content (IC) for phenotypes and serialized into the prompt at inference.

3. Instruction Tuning and Chain-of-Thought Learning

RareSeek R1 employs a three-phase progressive transfer learning regime:

  1. Domain-Specific Instruction Tuning: Models are optimized with AdamW (lr=10410^{-4}, batch size=4, 3 epochs). Each sample (xi,yi)(x_i, y_i) is framed as maximizing:

Linst=ilogPθ(yixi,"<instruction>")\mathcal{L}_\mathrm{inst} = -\sum_i \log P_\theta (y_i | x_i,\,\texttt{"<instruction>"})

using diverse clinical and case-based instructions.

  1. Chain-of-Thought Fine-Tuning: For each instance, (xi,ri,yi)(x_i, r_i, y_i), the model learns to generate reasoning chains rir_i and diagnoses yiy_i:

LCoT=it=1TilogPθ(ci,txi,ci,<t)\mathcal{L}_\mathrm{CoT} = -\sum_i \sum_{t=1}^{T_i} \log P_\theta (c_{i,t} | x_i, c_{i,<t})

where P(CTx)=tP(ctx,c<t)P(\mathrm{CT}|x) = \prod_t P(c_t|x, c_{<t}).

  1. GraphRAG-Augmented Reasoning: KG facts are retrieved and prepended to the prompt. Retrieval-augmented scoring combines semantic and graph-based priors:

score(q,d)=αsim(q,d)+(1α)Priorgraph(q,d)\mathrm{score}(q,d)=\alpha\,\mathrm{sim}(q,d) + (1-\alpha)\mathrm{Prior}_\mathrm{graph}(q,d)

with sim(q,d)\mathrm{sim}(q,d) as vector similarity and Priorgraph(q,d)\mathrm{Prior}_\mathrm{graph}(q,d) quantifying presence/weight of KG connections.

Ablation studies show instruction tuning yields a +11.8% (EHR-Internal) and +9.6% (EHR-External) Top-1 improvement, with CoT fine-tuning yielding an additional +9.5% and +7.9% respectively (Yang et al., 18 Nov 2025).

4. Performance and Benchmarking

RareSeek R1 achieves state-of-the-art diagnostic accuracy across multiple public and internal benchmarks:

Benchmark Top-1 Accuracy Top-5/Top-10 Accuracy Notable Features
EHR-Internal (n=4,306) 0.684 ± 0.014 Top-5: 0.839 ± 0.011 In-domain complex hospital records
EHR-External (n=283) 0.719 ± 0.025 Out-of-domain hospital data
RareBench (n=1,197) 0.392 ± 0.025 (vs. GPT-5 0.353) Top-10: near-identical Outperforms SOTA on rare-disease cases
MedEHR-Variant (n=147) +17.0% with GraphRAG (Top-1 0.770) KG/variant retrieval largest impact
Phenopacket-Store (n=5,213) Top-10: 0.910 ± 0.007 Multi-modal cohort, post-GraphRAG

Robustness against noisy/overlapping phenotypes is documented—Top-5 accuracies are maintained (0.787–0.840) from sparse to rich key-phenotype cases, while traditional tools (e.g., Exomiser) register Top-5 <0.15 across all tiers. On the 300-hardest complex-phenotype cases, Top-1 is 0.520 (Yang et al., 18 Nov 2025).

Standard diagnostic metrics—precision, recall, F1F_1—are reported as:

Precision=TPTP+FP,Recall=TPTP+FN,F1=2Precision×RecallPrecision+Recall\mathrm{Precision} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FP}}, \quad \mathrm{Recall} = \frac{\mathrm{TP}}{\mathrm{TP} + \mathrm{FN}}, \quad F_1 = 2 \frac{\mathrm{Precision} \times \mathrm{Recall}}{\mathrm{Precision} + \mathrm{Recall}}

5. Human Studies and Interpretability

In 110-case matched studies spanning neurologic and metabolic disorders, RareSeek R1 achieves Top-1 accuracy of 0.473 (mid-level physician parity), with GraphRAG retrieval elevating Top-1 to 0.582 (senior-attending performance range). Clinician-AI collaboration boosts junior, mid-level, and senior Top-1 accuracy by Δ=+0.169, +0.118, and +0.094, respectively.

Auditable decision paths are enabled by explicit CoT reasoning chains, which closely align with established clinical guidelines. Notably, decisive non-phenotypic (non-HPO) evidence constitutes a median of 23.1% (IQR 0.118–0.400) of features in correct diagnoses. The breakdown by category is as follows:

Category Fraction (%)
Imaging findings 26.5
Interventions 23.0
Functional tests 17.7
Laboratory results 9.1
Molecular tests 8.2
Pathology 7.2
Environmental exposures 5.4

This highlights the multi-modal and evidence-integrated nature of correct rare-disease diagnostic reasoning in practice (Yang et al., 18 Nov 2025).

6. Clinical Impact and Future Directions

RareSeek R1 substantiates a narrative-first, knowledge-integrated paradigm for rare disease diagnosis, significantly shortening the diagnostic odyssey and supporting clinicians with interpretable, knowledge-grounded decision support. Its integration of graph-augmented retrieval and multi-modal reasoning consistently improves both standalone and assistive diagnostic metrics, with auditability and benchmarking rigorously documented.

A plausible implication is that RareSeek R1's methodology—progressive domain tuning, explicit multi-modal CoT, and dynamic use of curated KGs—could serve as a template for future clinical LLMs tasked with similarly heterogeneous, high-stakes domains.

Ongoing and future developments may involve scaling the KG with additional population-genomic and longitudinal evidence, extending synthetic reasoning chains to rarer disease subtypes, and adopting more adaptable prompt and fine-tuning protocols to further boost clinical domain generalization and real-world deployment (Yang et al., 18 Nov 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)
Slide Deck Streamline Icon: https://streamlinehq.com

Whiteboard

Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to RareSeek R1.