- The paper introduces DiagGym and DiagAgent, a framework leveraging reinforcement learning for multi-turn diagnostic reasoning in simulated clinical settings.
- The methodology integrates dynamic examination recommendation, real-world metrics, and interactive rollouts, achieving up to 85.03% diagnostic accuracy.
- Results demonstrate significant improvements over baseline models with robust full-chain consistency and efficient diagnostic decision-making.
Evolving Diagnostic Agents in a Virtual Clinical Environment
Introduction and Motivation
The paper presents a comprehensive framework for training LLMs as diagnostic agents capable of managing multi-turn diagnostic processes in a virtual clinical environment. The central thesis is that clinical diagnosis is inherently a sequential, long-horizon decision-making task, requiring adaptive examination selection and dynamic hypothesis revision. Existing LLMs, predominantly trained on static, instruction-style corpora, lack the ability to interactively explore diagnostic trajectories, resulting in suboptimal performance in real-world diagnostic workflows.
To address these limitations, the authors introduce DiagGym, a high-fidelity diagnostics world model trained on electronic health records (EHRs), and DiagAgent, a diagnostic agent trained via end-to-end multi-turn reinforcement learning (RL) within DiagGym. The framework is evaluated on DiagBench, a new benchmark comprising 750 physician-validated cases and 99 cases annotated with 973 physician-written rubrics, enabling granular assessment of multi-turn diagnostic reasoning.
Figure 1: Overview of the proposed method, including the DiagGym virtual clinical environment, the diagnostics world model, and the end-to-end RL training of DiagAgent.
DiagGym: High-Fidelity Diagnostics World Model
DiagGym is formulated as a conditional generative model Φenv​ that emits synthetic examination results conditioned on a dynamically evolving patient state. At each step, the model receives the patient profile, past examinations, and the next examination query, and generates plausible examination outcomes. The training objective minimizes the negative log-likelihood of ground-truth examination results, treating all outputs as free text, regardless of modality.
The evaluation of DiagGym focuses on both instance-wise and examination-wise metrics:
- Instance-wise metrics: Step-level similarity and full-chain consistency, assessed by both automated (GPT-4o) and physician raters.
- Examination-wise metrics: Fidelity (1-Wasserstein distance for numerical, FID for free-text) and diversity (normalized variance, Intra-LPIPS).
DiagGym achieves superior performance compared to strong open-source LLM baselines (DeepSeek-v3, Qwen2.5, MedGemma), with 96.91% full-chain consistency and a 1-Wasserstein distance of 0.128, closely matching real-world distributions. Computational efficiency is also notable, requiring only a single A100 GPU and 0.52s per simulation, enabling scalable RL training.
Figure 2: Simulator evaluation settings, including instance-wise and examination-wise metrics for assessing fidelity and diversity of generated examination results.
DiagAgent: Reinforcement Learning for Multi-Turn Diagnostic Reasoning
DiagAgent is trained within DiagGym using RL, where the agent's policy network πθ​ maps the current patient state to the next optimal action (examination recommendation or final diagnosis). The reward function is a weighted sum of diagnostic accuracy, examination recommendation quality, and turn efficiency. The RL training leverages the GRPO algorithm, with a cold-start phase for output format alignment followed by policy optimization via interactive rollouts.
The agent is evaluated in two settings:
- Single-turn evaluation: The agent recommends an examination or renders a diagnosis based on partial ground-truth trajectory.
- End-to-end evaluation: The agent autonomously interacts with DiagGym to complete the full diagnostic workflow.
DiagAgent demonstrates substantial improvements over 10 state-of-the-art LLMs and 2 agentic systems (MedAgents, MDAgents):
- Single-turn: Up to 71.12% hit ratio and 85.03% diagnostic accuracy (DiagAgent-7B), outperforming MedGemma (27.07%/68.90%) and DeepSeek-v3 (20.08%/72.27%).
- End-to-end: Up to 61.63% diagnostic accuracy and 47.89 F1-score (DiagAgent-14B), with longer, more comprehensive interaction trajectories.
Figure 3: Single-turn evaluation settings and results, comparing DiagAgent variants against leading LLMs and agentic systems.
Figure 4: End-to-end evaluation pipeline and results, including automatic and rubric-based metrics for examination recommendation and diagnostic accuracy.
Ablation Studies
Ablation experiments confirm the superiority of RL-based training over supervised fine-tuning (SFT), the necessity of dual reward shaping (diagnosis and examination recommendation), and the generality of the approach across model sizes and families. RL consistently yields higher diagnostic accuracy and F1 scores, with larger base models achieving higher performance ceilings.
Case Studies and Qualitative Analysis
Case studies illustrate the fidelity of DiagGym in generating clinically plausible examination results and the robustness of DiagAgent in dynamic diagnostic reasoning. Successful cases demonstrate efficient evidence gathering and alignment with physician-curated rubrics, while failure cases highlight current limitations in acute management actions, reflecting the diagnostic focus of the agent.
Figure 5: Example case paper from DiagGym, comparing predicted and ground truth examinations.
Figure 6: Interactive diagnostic case paper with DiagAgent, showing model trajectory and reference timeline.
Figure 7: Success case of DiagAgent evaluated by physician-curated rubrics, demonstrating procedural integrity.
Figure 8: Fail case of DiagAgent, showing diagnostic strength but management deficit.
Implications and Future Directions
The framework establishes DiagGym as a scalable in-silico testbed for optimizing diagnostic management strategies prior to clinical validation. RL-based training in interactive environments confers dynamic, clinically meaningful long-term diagnostic management abilities, unattainable through passive training. The results indicate that process-aware, trajectory-level optimization is essential for safe and effective clinical decision support.
Key limitations include the modest scale of evaluated models (up to 14B parameters), the diagnostic (not therapeutic) focus of DiagAgent, and the scope of DiagGym relative to real-world clinical complexity. Future work should explore scaling to larger foundation models, integrating treatment planning, and expanding the virtual environment to encompass broader clinical tasks.
Conclusion
This work presents a rigorous framework for evolving diagnostic agents via RL in a virtual clinical environment, demonstrating strong empirical gains in multi-turn diagnostic reasoning and examination recommendation. The approach advances the state-of-the-art in clinical AI by enabling dynamic, process-aware decision-making, and provides a robust platform for future research in agentic medical AI.