Doctor-R1: AI Clinical Inquiry Agent

Updated 12 October 2025

Doctor-R1 is an AI clinical inquiry agent that uses a multi-agent interactive environment to simulate realistic patient consultations.
It employs a two-tiered reward architecture to optimize both empathetic communication and accurate diagnostic decision-making.
The system leverages an experience-driven learning pipeline and strong evaluation benchmarks to outperform larger models in clinical dialogue.

Doctor-R1 is an AI doctor agent designed to master professional clinical inquiry by simultaneously optimizing accurate medical decision-making and strategic, empathetic multi-turn patient consultation. The system is architected around a multi-agent interactive environment, a dual-tier reward structure, and an experience-driven learning regimen, and achieves strong state-of-the-art performance and human preference on key clinical dialogue benchmarks.

1. Multi-Agent Interactive Clinical Environment

Doctor-R1 operates within a simulated multi-agent environment that mirrors a realistic outpatient consultation. The doctor agent, implemented as a policy model, interacts dynamically with a simulated patient agent, with the overall clinical exchange formalized as a Partially Observable Markov Decision Process (POMDP). A dedicated Consultation Evaluator agent monitors the interaction and supplies turn-wise and episode-level feedback. This environment captures the full temporal structure and partial observability of true clinical consultations, enforcing the need for the agent to gather information across multiple turns and adapt to encountered uncertainty and patient responses.

2. Two-Tiered Reward Architecture

A core innovation in Doctor-R1 is the separation of optimization objectives via a two-tiered reward system:

Process Rewards: At each dialogue turn, the agent receives a “process” reward scoring communication and inquiry skills along multiple axes, including safety, logical reasoning, medical accuracy, completeness, quality of information gathering, faithfulness, empathy, and humility. This feedback enables learning of soft skills and strategic questioning, not just recall or factual correctness.
Outcome Rewards: Upon dialogue completion, a distinct “outcome” reward evaluates the correctness and completeness of the final diagnostic decision against gold-standard ground truth (with rewards of 1.0 for correct, 0.5 for partially correct, and 0 for incorrect). This decoupling ensures that the agent not only delivers correct diagnoses but learns to arrive at them via detailed, patient-centered inquiry.

This architecture allows Doctor-R1 to simultaneously optimize communicative competence and diagnostic accuracy, properties that are often only loosely coupled in existing LLM-based systems.

3. Experience Repository and Learning Pipeline

Doctor-R1 grounds its policy learning in a dynamically curated experience repository that stores high-quality prior consultation trajectories. The experience retrieval pipeline operates in multiple stages:

Stage 1: Semantic Retrieval Dense embedding models compute the cosine similarity between the current state/trajectory and all previously stored experiences, weighting for both similarity and past trajectory reward.
Stage 2: Reranking Retrieved candidates are re-ranked using a cross-encoder reranker model that attends to token-level matches and context, improving retrieval precision for complex cases.
Stage 3: Novelty and Reward Filtering Only novel and high-reward (i.e., high-quality) trajectories are retained for subsequent learning steps, preventing the agent from overfitting to suboptimal strategies and ensuring continual policy improvement.

This repository enables experience replay and retrieval-augmented policy updates, supporting more rapid convergence to high-quality inquiry policies and robust strategic adaptation to rare or challenging cases.

4. Evaluation Benchmarks and Metrics

Doctor-R1 is assessed using two high-fidelity clinical dialogue evaluation suites:

HealthBench Evaluates across Themes (e.g., emergency, health data, communication, global health, hedging, context seeking, complex response) and Axes (e.g., factual accuracy, instruction following, communication quality, context awareness, completeness).
MAQuE Emphasizes multi-faceted qualities: task success (accuracy, robustness), inquiry proficiency (coverage, relevance), dialogue competence (adherence, coherence), and patient experience (clarity, empathy).

Across these metrics, Doctor-R1’s scores are reported as substantially improved relative to both open-source and proprietary strong baselines, including UltraMedical-70B and models with significantly higher parameter counts.

Benchmark	UltraMedical-70B	Doctor-R1 (Avg.)	Delta
HealthBench	26.38	36.29	+9.91
MAQuE (Accuracy)	52.00	60.00	+8.00

This tabular comparison documents Doctor-R1’s strong relative improvements.

5. Human-Centric Evaluation and Preferred Dialogue Quality

In addition to automated metrics, Doctor-R1 underwent human preference testing via pairwise dialogue comparisons. Annotators assessed model outputs on coherence, adherence to clinical role, clarity, and empathy. Doctor-R1 was consistently preferred, with particular strengths cited in natural, human-like empathy and the ability to structure communication in a way that supported patient understanding and addressed risk checks without resorting to formulaic or rigid scripting.

6. Implications for Clinical Practice

Doctor-R1 addresses key shortcomings in prior LLM-based doctor agents by combining dynamic, multi-turn interactive inquiry with robust medical decision-making. Its dual focus on process and outcome rewards enables adaptive, safety-conscious decision policies and patient-centered communication—essential for real-world deployment in outpatient or triage settings. The high parameter efficiency (an 8B model outperforming 32B/70B baselines) suggests potential for scalable, cost-effective deployment without sacrificing quality. This architecture also supports rapid adaptation to new medical contexts by updating the experience repository or refining the reward schemas.

A plausible implication is that frameworks similar to Doctor-R1 could serve as front-line assistants in clinical intake or telehealth, supporting clinicians by pre-gathering high-yield information and triaging cases more safely and thoroughly than script-based or decision-tree systems.

7. Summary and Outlook

Doctor-R1 establishes a new state-of-the-art in LLM-based clinical inquiry by integrating a realistic multi-agent environment, a dual reward structure separating communication and diagnostic objectives, and an experiential learning pipeline for strategic inquiry refinement. Its performance on benchmarks and human evaluations demonstrates that the agent outperforms leading open-source and powerful proprietary models while using fewer parameters. These findings highlight the value of agentic reinforcement learning and experience-driven retrieval in mastering the nuances of professional doctor–patient interaction, setting a foundation for further advances in autonomous, human-preferred AI clinical agents (Lai et al., 5 Oct 2025).

PDF Markdown Chat (Pro)

References (1)

Doctor-R1: Mastering Clinical Inquiry with Experiential Agentic Reinforcement Learning (2025)

Doctor-R1: AI Clinical Inquiry Agent

1. Multi-Agent Interactive Clinical Environment

2. Two-Tiered Reward Architecture

3. Experience Repository and Learning Pipeline

4. Evaluation Benchmarks and Metrics

5. Human-Centric Evaluation and Preferred Dialogue Quality

6. Implications for Clinical Practice

7. Summary and Outlook

Whiteboard

Follow Topic

Continue Learning

Doctor-R1: AI Clinical Inquiry Agent

1. Multi-Agent Interactive Clinical Environment

2. Two-Tiered Reward Architecture

3. Experience Repository and Learning Pipeline

4. Evaluation Benchmarks and Metrics

5. Human-Centric Evaluation and Preferred Dialogue Quality

6. Implications for Clinical Practice

7. Summary and Outlook

Sponsor

Whiteboard

Follow Topic

Continue Learning

Related Topics