Evolving Diagnostic Agents in a Virtual Clinical Environment (2510.24654v1)

Published 28 Oct 2025 in cs.CL

Abstract: In this paper, we present a framework for training LLMs as diagnostic agents with reinforcement learning, enabling them to manage multi-turn diagnostic processes, adaptively select examinations, and commit to final diagnoses. Unlike instruction-tuned models trained on static case summaries, our method acquires diagnostic strategies through interactive exploration and outcome-based feedback. Our contributions are fourfold: (i) We present DiagGym, a diagnostics world model trained with electronic health records that emits examination outcomes conditioned on patient history and recommended examination, serving as a virtual clinical environment for realistic diagnosis training and evaluation; (ii) We train DiagAgent via end-to-end, multi-turn reinforcement learning to learn diagnostic policies that optimize both information yield and diagnostic accuracy; (iii) We introduce DiagBench, a diagnostic benchmark comprising 750 cases with physician-validated examination recommendations and 99 cases annotated with 973 physician-written rubrics on diagnosis process; (iv) we demonstrate superior performance across diverse diagnostic settings. DiagAgent significantly outperforms 10 state-of-the-art LLMs, including DeepSeek-v3 and GPT-4o, as well as two prompt-engineered agents. In single-turn settings, DiagAgent achieves 9.34% higher diagnostic accuracy and 44.03% improvement in examination recommendation hit ratio. In end-to-end settings, it delivers 15.12% increase in diagnostic accuracy and 23.09% boost in examination recommendation F1 score. In rubric-based evaluation, it surpasses the next-best model, Claude-sonnet-4, by 7.1% in weighted rubric score. These findings indicate that learning policies in interactive clinical environments confers dynamic and clinically meaningful diagnostic management abilities unattainable through passive training alone.

Summary

The paper introduces DiagGym and DiagAgent, a framework leveraging reinforcement learning for multi-turn diagnostic reasoning in simulated clinical settings.
The methodology integrates dynamic examination recommendation, real-world metrics, and interactive rollouts, achieving up to 85.03% diagnostic accuracy.
Results demonstrate significant improvements over baseline models with robust full-chain consistency and efficient diagnostic decision-making.

Evolving Diagnostic Agents in a Virtual Clinical Environment

Introduction and Motivation

The paper presents a comprehensive framework for training LLMs as diagnostic agents capable of managing multi-turn diagnostic processes in a virtual clinical environment. The central thesis is that clinical diagnosis is inherently a sequential, long-horizon decision-making task, requiring adaptive examination selection and dynamic hypothesis revision. Existing LLMs, predominantly trained on static, instruction-style corpora, lack the ability to interactively explore diagnostic trajectories, resulting in suboptimal performance in real-world diagnostic workflows.

To address these limitations, the authors introduce DiagGym, a high-fidelity diagnostics world model trained on electronic health records (EHRs), and DiagAgent, a diagnostic agent trained via end-to-end multi-turn reinforcement learning (RL) within DiagGym. The framework is evaluated on DiagBench, a new benchmark comprising 750 physician-validated cases and 99 cases annotated with 973 physician-written rubrics, enabling granular assessment of multi-turn diagnostic reasoning.

Figure 1: Overview of the proposed method, including the DiagGym virtual clinical environment, the diagnostics world model, and the end-to-end RL training of DiagAgent.

DiagGym: High-Fidelity Diagnostics World Model

DiagGym is formulated as a conditional generative model $\Phi_{\text{env}}$ that emits synthetic examination results conditioned on a dynamically evolving patient state. At each step, the model receives the patient profile, past examinations, and the next examination query, and generates plausible examination outcomes. The training objective minimizes the negative log-likelihood of ground-truth examination results, treating all outputs as free text, regardless of modality.

The evaluation of DiagGym focuses on both instance-wise and examination-wise metrics:

Instance-wise metrics: Step-level similarity and full-chain consistency, assessed by both automated (GPT-4o) and physician raters.
Examination-wise metrics: Fidelity (1-Wasserstein distance for numerical, FID for free-text) and diversity (normalized variance, Intra-LPIPS).

DiagGym achieves superior performance compared to strong open-source LLM baselines (DeepSeek-v3, Qwen2.5, MedGemma), with 96.91% full-chain consistency and a 1-Wasserstein distance of 0.128, closely matching real-world distributions. Computational efficiency is also notable, requiring only a single A100 GPU and 0.52s per simulation, enabling scalable RL training.

Figure 2: Simulator evaluation settings, including instance-wise and examination-wise metrics for assessing fidelity and diversity of generated examination results.

DiagAgent: Reinforcement Learning for Multi-Turn Diagnostic Reasoning

DiagAgent is trained within DiagGym using RL, where the agent's policy network $\pi_\theta$ maps the current patient state to the next optimal action (examination recommendation or final diagnosis). The reward function is a weighted sum of diagnostic accuracy, examination recommendation quality, and turn efficiency. The RL training leverages the GRPO algorithm, with a cold-start phase for output format alignment followed by policy optimization via interactive rollouts.

The agent is evaluated in two settings:

Single-turn evaluation: The agent recommends an examination or renders a diagnosis based on partial ground-truth trajectory.
End-to-end evaluation: The agent autonomously interacts with DiagGym to complete the full diagnostic workflow.

DiagAgent demonstrates substantial improvements over 10 state-of-the-art LLMs and 2 agentic systems (MedAgents, MDAgents):

Single-turn: Up to 71.12% hit ratio and 85.03% diagnostic accuracy (DiagAgent-7B), outperforming MedGemma (27.07%/68.90%) and DeepSeek-v3 (20.08%/72.27%).
End-to-end: Up to 61.63% diagnostic accuracy and 47.89 F1-score (DiagAgent-14B), with longer, more comprehensive interaction trajectories.
Figure 3: Single-turn evaluation settings and results, comparing DiagAgent variants against leading LLMs and agentic systems.

Figure 4: End-to-end evaluation pipeline and results, including automatic and rubric-based metrics for examination recommendation and diagnostic accuracy.

Ablation Studies

Ablation experiments confirm the superiority of RL-based training over supervised fine-tuning (SFT), the necessity of dual reward shaping (diagnosis and examination recommendation), and the generality of the approach across model sizes and families. RL consistently yields higher diagnostic accuracy and F1 scores, with larger base models achieving higher performance ceilings.

Case Studies and Qualitative Analysis

Case studies illustrate the fidelity of DiagGym in generating clinically plausible examination results and the robustness of DiagAgent in dynamic diagnostic reasoning. Successful cases demonstrate efficient evidence gathering and alignment with physician-curated rubrics, while failure cases highlight current limitations in acute management actions, reflecting the diagnostic focus of the agent.

Figure 5: Example case paper from DiagGym, comparing predicted and ground truth examinations.

Figure 6: Interactive diagnostic case paper with DiagAgent, showing model trajectory and reference timeline.

Figure 7: Success case of DiagAgent evaluated by physician-curated rubrics, demonstrating procedural integrity.

Figure 8: Fail case of DiagAgent, showing diagnostic strength but management deficit.

Implications and Future Directions

The framework establishes DiagGym as a scalable in-silico testbed for optimizing diagnostic management strategies prior to clinical validation. RL-based training in interactive environments confers dynamic, clinically meaningful long-term diagnostic management abilities, unattainable through passive training. The results indicate that process-aware, trajectory-level optimization is essential for safe and effective clinical decision support.

Key limitations include the modest scale of evaluated models (up to 14B parameters), the diagnostic (not therapeutic) focus of DiagAgent, and the scope of DiagGym relative to real-world clinical complexity. Future work should explore scaling to larger foundation models, integrating treatment planning, and expanding the virtual environment to encompass broader clinical tasks.

Conclusion

This work presents a rigorous framework for evolving diagnostic agents via RL in a virtual clinical environment, demonstrating strong empirical gains in multi-turn diagnostic reasoning and examination recommendation. The approach advances the state-of-the-art in clinical AI by enabling dynamic, process-aware decision-making, and provides a robust platform for future research in agentic medical AI.