MedAgentBench: Evaluating Agentic Medical AI

Updated 28 November 2025

MedAgentBench is a simulation-based benchmark suite that evaluates agentic capabilities of large language models in realistic, FHIR-compliant clinical environments.
It employs rigorous task suites designed by clinicians to test high-order skills such as planning, tool invocation, and multi-agent collaboration in virtual medical workflows.
Evaluation metrics include task success rates, safety scoring, and multi-modal performance assessments, setting new standards for developing agentic medical AI at scale.

MedAgentBench is a family of benchmarks and simulation environments designed to rigorously evaluate the agentic capabilities of LLMs and agent frameworks in realistic, complex, and clinically grounded medical contexts. Unlike conventional medical QA datasets, MedAgentBench targets high-order skills—planning, tool invocation via APIs, interaction with structured and multimodal data, and safety-critical orchestration in virtual EHRs and simulated clinical settings. This suite encompasses both the original Stanford MedAgentBench, its extension in MedBench v4 for Chinese clinical agents, and subsequent adaptations to multi-agent clinical workflows and federated learning coordination. Together, these resources address the urgent need for robust, unsaturated benchmarks for medical LLM agents, establishing new standards for evaluating and developing agentic medical AI at scale (Jiang et al., 24 Jan 2025, Ding et al., 18 Nov 2025, Almansoori et al., 28 Mar 2025, Tang et al., 10 Mar 2025, Zhu et al., 18 May 2025, Saha et al., 28 Sep 2025).

1. System Architecture and Environment Design

The foundational MedAgentBench framework implements an interactive, FHIR-compliant virtual EHR environment, capable of emulating modern EMR systems at the API level. The server core is a Dockerized HAPI FHIR JPA instance backed by an H2 relational database, supporting RESTful GET and POST operations for standard FHIR resource types: Patient, Observation, Condition, Procedure, and MedicationRequest (Jiang et al., 24 Jan 2025). Data are serialized as FHIR-compliant JSONs, allowing precise emulation of EHR workflows and simplified migration to production environments.

Patient data are sampled and de-identified from real clinical systems (e.g., 100 patients with >700,000 data elements from Stanford STARR), preserving realistic temporal structures (jittered timestamps), event sequences, and standardized medical codes (LOINC, CPT, ICD-10, SNOMED, NDC). Demographic realism is maintained via synthetic MRNs and Fakér-generated metadata. The infrastructure supports up to 8 interaction rounds per agent task, enabling iterative, multi-step clinical workflows and tool use.

MedAgentBench variants such as MedAgentSim (Almansoori et al., 28 Mar 2025) extend this design to multi-agent, game-style hospital environments implemented with Phaser/Tiled and Python APIs, introducing active physician–patient–measurement agent dialogs and visual reasoning on simulated or real cases. The Chinese MedBench v4 agent track (Ding et al., 18 Nov 2025) implements cloud-based orchestration pipelines and concurrent safety/role adaptation modules, supporting >700,000 expert-authored clinical scenarios across 24+91 medical specialties.

2. Agent Task Suites and Interaction Protocols

Tasks are authored and validated by practicing physicians to ensure clinical relevance and verifiability. The original MedAgentBench comprises 100 core tasks across 10 categories including information retrieval, laboratory and data aggregation, recording new data, test and referral ordering, medication management, patient communication, documentation, and analytic reporting (Jiang et al., 24 Jan 2025). Each prompt details the required FHIR resource, patient MRN, time context, and expected output structure. Approximately half of tasks are query-only and evaluated on exact/numeric match, while action tasks require correct resource creation (POST) and are judged by JSON payload validity and FHIR compliance.

Advanced MedAgentBench frameworks (e.g., MedBench v4, MedAgentSim) expand this suite to incorporate clinical dialogue, goal decomposition, tool/API operation, long-horizon memory, multi-agent cooperation, and explicit safety/adversarial challenge tasks (Ding et al., 18 Nov 2025, Almansoori et al., 28 Mar 2025). Tasks are stratified by specialty, workflow category, and complexity, including multi-modal input (e.g., image/lab/clinical report chains) and open-ended real-world scenarios (“plan a multidisciplinary oncology workflow”).

Interaction protocols are grounded in function libraries—agents select from a set of JSON-schema-defined API calls (e.g., patient.search, lab.search, procedure.create) with controlled iteration and finalization steps. All interaction history, API responses, and outputs are fed back via context windows for iterative navigation (Jiang et al., 24 Jan 2025).

3. Evaluation Metrics and Methodologies

Evaluation is centered on task-level success rates and stratified subgroup metrics (query vs action), with primary success calculated as the proportion of correctly completed tasks:

$\text{SuccessRate} = \frac{N_{\mathrm{successful~tasks}}}{N_{\mathrm{total~tasks}}} \times 100\%$

(Jiang et al., 24 Jan 2025)

Advanced tracks such as MedBench v4 introduce multi-dimensional, LLM-as-a-Judge scoring on clinical correctness, planning & decomposition, tool execution, and safety/governance, each mapped 0–5 and rescaled to 0–100 (Ding et al., 18 Nov 2025). Judging is calibrated by cross-validation with licensed clinicians (Cohen’s κ > 0.82), and a rotating evaluation pool prevents overfitting.

MedAgentSim extensions bring in standard diagnostic metrics (Accuracy, Precision, Recall, F1), composite indices for self-evolution (memory gain after replay), and ablation studies dissecting the impact of measurement, memory, chain-of-thought, and ensembling modules on scenario-level performance (Almansoori et al., 28 Mar 2025).

Table: Illustrative Model Performance on MedAgentBench (Stanford)

Model	Overall SR	Query SR	Action SR
GPT-4o	72%	76%	68%
Claude 3.5 Sonnet v2	70%	84%	56%
DeepSeek V3	56%	60%	52%
Llama 3.3	49%	54%	44%
Gemma 2	28%	40%	16%

(Jiang et al., 24 Jan 2025)

In MedBench v4 (Agent Track), top agents reach 85.3/100 average task score, with safety rising from 18.4/100 for vanilla LLMs to 88.9/100 via explicit agentic controls (Ding et al., 18 Nov 2025). MedAgentSim measures 10–15% absolute accuracy gains from explicit memory and CoT/ensembling pipelines (Almansoori et al., 28 Mar 2025).

4. Algorithmic Frameworks and Agent Orchestration

MedAgentBench supports diverse agentic paradigms:

Zero-shot Planning: Agent LLMs solve tasks by following prompt templates and context windows, demonstrating planning ability without additional fine-tuning (Jiang et al., 24 Jan 2025).
Chain-of-thought (CoT), Multi-persona, and Self-consistency: Used in MedAgentsBench and MedAgentBoard for multi-step reasoning and ensembling among multiple agent roles, promoting robust consensus solutions (Tang et al., 10 Mar 2025, Zhu et al., 18 May 2025).
Explicit Tool Managers and Safety Layers: Pipelines in MedBench v4 parse prompts, decompose plans, dispatch tool/API calls, store subgoal context, and invoke safety/governance filters (e.g., dosing checks, adversarial prompt detection, escalation protocols) (Ding et al., 18 Nov 2025).
Multi-Agent Collaboration: MedAgentSim and MedAgentBoard implement multi-agent dialog, majority voting, round-table debate, and role-specific pipelines (e.g., planner, coder, validator), demonstrating benefits in workflow automation and complex code generation but not reliably in standalone QA or EHR prediction (Zhu et al., 18 May 2025, Almansoori et al., 28 Mar 2025).

MedAgentBench agent interfaces expose FHIR/API functions as robust JSON schemas, with iterated GET/POST cycles and explicit output-form enforcement for reliable downstream integration (Jiang et al., 24 Jan 2025).

5. Comparative Results and Empirical Analysis

State-of-the-art LLM agents (GPT-4o, Claude 3.5, DeepSeek V3) exhibit success rates of 56–72% in realistic agentic EHR environments. Queries (retrieval) remain consistently easier than actions (write, modify), with closed-weight API models outperforming open-source alternatives (Jiang et al., 24 Jan 2025). In MedBench v4, agent framework orchestration yields 20–25 point gains over vanilla backbone models, especially in safety/ethics (Ding et al., 18 Nov 2025).

Self-evolving agent benchmarks (MedAgentSim) extend these advantages to multi-turn, context-aware diagnostic dialogues and multi-modal scenarios, showing improved accuracy, reduced cognitive bias, and higher reproducibility in open-world tasks (Almansoori et al., 28 Mar 2025). MedAgentsBench differentiates model families on hard, multi-step reasoning, showing open-source thinking models (DeepSeek-R1, o3-mini) reach 30–45% accuracy at an order of magnitude lower cost per sample than closed-source LLMs (Tang et al., 10 Mar 2025).

Multi-agent collaboration frameworks provide incremental gains in workflow automation and complex multi-stage pipelines, particularly for data extraction and reporting, but do not universally outperform advanced single-LLM or conventional methods in QA or EHR prediction domains (Zhu et al., 18 May 2025). Agentic orchestration is most justified for tasks requiring explicit decomposition, tool integration, and iterative validation.

6. Extensions: Federated Learning and Distributed Medical Workflows

FedAgentBench operationalizes agentic LLM orchestration for real-world federated medical image analysis, using server-client agent roles to autonomously coordinate across institutionally separated datasets. Seven specialized agent roles cover task parsing, client selection, data preprocessing, label harmonization, algorithm selection, and federated model training using a registry of 40 FL algorithms (FedAvg, FedProx, SCAFFOLD, Ditto, personalized FL, etc.) (Saha et al., 28 Sep 2025).

FedAgentBench benchmarks 24 LLMs for success rate, client selection metrics (Precision/Recall/F1), data-schema compliance, duplicate removal, label mapping, and training initiation. GPT-4.1 and DeepSeek V3 automate most phases under fine-grained guidance (99–100% and 94% overall success in dermatology, respectively), though label harmonization remains a limiting challenge even for the strongest LLMs. Large, proprietary models consistently outperform mid-sized open-source agents, but guidance granularity and agent architecture/instruction-following are more predictive of success than model size alone.

7. Limitations, Challenges, and Future Directions

Key limitations identified include single-site or regional cohort bias, focus on EHR-and diagnostic task scope (omitting longitudinal care and interprofessional workflows), and current lack of production-grade security/auditing. Agentic frameworks need further development in multi-modal and full workflow orchestration (imaging, genomics, longitudinal management), compliance auditing, adversarial robustness, and resource-adaptive planning (Jiang et al., 24 Jan 2025, Ding et al., 18 Nov 2025, Saha et al., 28 Sep 2025).

Future work will expand MedAgentBench to additional specialties (e.g., surgery, nursing), integrate advanced agentic strategies (hierarchical planners, knowledge graph grounding), adapt to prospective validation in real clinical settings, and extend functional coverage to include privacy audit agents, dynamic resource orchestration, and human–agent cooperative workflows.

MedAgentBench establishes a rigorous, extensible foundation for measurable progress in agentic medical AI, acting as a reference for researchers, vendors, and regulatory stakeholders in certifying medical LLM agents for clinical deployment (Jiang et al., 24 Jan 2025, Ding et al., 18 Nov 2025, Almansoori et al., 28 Mar 2025, Tang et al., 10 Mar 2025, Zhu et al., 18 May 2025, Saha et al., 28 Sep 2025).