AIPatient Arena: EHR-grounded evaluation of large language models in end-to-end clinical consultation workflows

Published 16 Jun 2026 in cs.CL and cs.AI | (2606.17474v1)

Abstract: LLMs are increasingly considered for use in clinical consultation tasks, yet most medical evaluations remain static, single-turn, or narrowly outcome-based, limiting their ability to reflect the sequential, uncertain, and interactive nature of real-world care. Here, we propose AIPatient Arena, an EHRs-grounded evaluation framework for assessing the clinical utility of LLMs across eight dimensions of clinical competence. The framework integrates EHR data into patient-specific knowledge graphs, enabling multi-turn physician-patient interactions. We applied AIPatient Arena on a primary cohort of 437 patients and two out-of-distribution validation cohorts of 119 and 67 patients. We observe that LLMs performed well in medical interview questioning skills (QS; mean scores, 4.43-4.99/5), ethical and professional conduct (ET; 4.38-4.93/5), and clarity and transparency of clinical explanations (EX; 3.80-4.72/5). Performance was moderate in information integration (II; 3.19-4.21/5) and medication safety and justification (MS; 3.13-3.78/5), but persistent weaknesses were observed in handling of ambiguous patient responses (HR; 2.57-3.32/5), information coverage (IC; 2.08-3.02/5), and diagnostic accuracy and reasoning (Dx; 2.63-3.55/5). Process-based evaluation revealed recurrent interaction failures, including repetitive questioning, omission of past medical history, and inadequate handling of uncertainty. Richer conversational context improved diagnostic reasoning but yielded limited gains in treatment planning. These findings indicate that final-answer accuracy alone is insufficient for evaluating clinical readiness and highlight the importance of assessing how models gather, interpret, and communicate information throughout a consultation. AIPatient Arena provides an EHR-grounded framework for workflow-oriented pre-deployment evaluation of medical LLMs.

Abstract PDF Upgrade to Chat

Authors (14)

Summary

The paper introduces a multidimensional, EHR-grounded evaluation framework that systematically assesses LLMs’ clinical consultation skills across eight defined competence dimensions.
The methodology utilizes structured and unstructured data from MIMIC-III to generate patient-specific knowledge graphs, enabling realistic, zero-shot simulations of physician-patient dialogues.
Findings reveal significant trade-offs where models exhibit high conversational fluency but struggle with ambiguous responses, diagnostic accuracy, and medication safety, underscoring the need for human oversight.

EHR-Grounded Multidimensional Assessment of LLMs in Clinical Consultation: An Analysis of "AIPatient Arena" (2606.17474)

Introduction

The paper "AIPatient Arena: EHR-grounded evaluation of LLMs in end-to-end clinical consultation workflows" introduces a comprehensive, workflow-oriented evaluation framework for LLMs within the context of simulated clinical consultations. Current medical AI benchmarks predominantly rely on static, often single-turn or multiple-choice tasks, which do not reflect the sequential, ambiguous, and interactive demands of real-world clinical consultations as encountered in Electronic Health Records (EHR). This work operationalizes an EHR-grounded, knowledge-graph-based evaluation protocol—the AIPatient Arena—which systematically interrogates the clinical process, not just outcome accuracy, across eight explicit dimensions of clinical competence.

Framework Architecture and Methodology

AIPatient Arena builds on structured and unstructured EHR data, primarily leveraging MIMIC-III records, to generate patient-specific knowledge graphs representing symptoms, medical history, treatments, lab values, and other salient clinical entities. These knowledge graphs instantiate highly realistic, complex clinical profiles for simulated consultations. The framework orchestrates interactions where LLMs act as virtual physicians in multi-turn dialogs, executing core tasks: clinical communication, information gathering, summarization, diagnosis, and treatment plan generation.

Clinical competence is dissected along eight axes:

Medical Interview Questioning Skills (QS)
Information Coverage (IC)
Handling of Ambiguous Responses (HR)
Ethical and Professional Conduct (ET)
Clarity and Transparency of Explanations (EX)
Information Integration (II)
Diagnostic Accuracy and Reasoning (Dx)
Medication Safety and Justification (MS)

Each dimension is anchored by well-defined failure patterns, such as repetitive questioning, omission of past medical history, inadequate clarification, unsafe drug recommendations, and lack of empathy, among others. The evaluation process is rubric-driven and incorporates penalty schemes for each failure pattern, facilitating granular process analysis and cross-model comparisons.

LLMs are evaluated in a standardized zero-shot paradigm, with models ranging from general-purpose (e.g., GPT-5.5, Claude 4.x, DeepSeek, Qwen) to medically specialized, each subjected to the same prompt protocol and patient cohorts. Three datasets are used: a 437-patient MIMIC-III-derived main cohort (CCQA), and two out-of-distribution cohorts for external validation.

Numerical Results and Failure Mode Analysis

LLMs demonstrate marked variability in clinical performance profiles across dimensions. Notably:

High mean scores are consistently observed in interaction-centric metrics—QS (4.43–4.99/5), ET (4.38–4.93/5), and EX (3.80–4.72/5)—indicating robust command of conversational structure and professional demeanor.
Information Integration (3.19–4.21/5) and Medication Safety (3.13–3.78/5) achieve only intermediate performance.
Persistent, clinically critical weaknesses are evident in handling ambiguity (HR: 2.57–3.32), information coverage (IC: 2.08–3.02), and especially in diagnostic accuracy and reasoning (Dx: 2.63–3.55). Failure to address uncertain patient answers occurs at a rate exceeding 90% across nearly all models.

GPT-5.5 attains the highest overall weighted average (3.76), reflecting balanced (though not complete) clinical proficiency. However, no evaluated LLM approaches ceiling in all dimensions. For example, DeepSeek-V3 maximizes EX, Qwen3.5 excels in MS, and HuatuoGPT-o1 has the highest HR, yet all models display severe trade-offs. Model competence across axes exhibits minimal pairwise correlation (< 0.13 in all cases), evidencing the necessity of multidimensional evaluation; aggregate or uniaxial scores obscure domain-critical failure modes.

Process analysis identifies repetitive questioning (up to 52% in HuatuoGPT-o1) and omission of past medical history (rates >90% in several models) as dominant communication failures. Ethical/professional impairments mainly entail lack of empathy, with HuatuoGPT-o1 and several large generalist LLMs showing especially high deficits. Drug recommendation analysis shows that high medication-F1 does not equate to medication safety—highlighting the non-equivalence of outcome-level and process-based metrics.

Importantly, restricting downstream evaluation to "dialogue-sufficient" consultations—where information sufficiency for diagnosis is explicitly satisfied—raises Dx scores across all models (ΔDx: 0.15–0.28); MS, however, remains static, indicating that correct information gathering alone does not guarantee improved medication safety or appropriateness.

Human Assessment and External Validity

Expert review confirms high alignment of the rubric-driven automated evaluator with seasoned clinician judgment across all interaction-centric and most reasoning dimensions (median human satisfaction >4.2/5). Score variances increase for reasoning and medication axes, reflecting inherent subjectivity and the high stakes of clinical judgment.

AIPatient Arena exhibits stable ranking and profiling across out-of-distribution datasets (PMC-Patients, PCI), preserving relative model strengths and revealing context-dependent performance shifts—most notably, substantially lower IC on psychological counseling records versus general medicine datasets. This stability substantiates the robustness and discriminatory power of the framework, while context sensitivity confirms clinical realism.

Implications for Evaluation and Deployment

The core empirical claim is that LLMs presently display a dissociation between conversational fluency and robust, clinically consequential reasoning. Over-reliance on endpoint accuracy metrics, such as diagnostic accuracy or medication-level F1, risks overstating real-world readiness; plausible outputs can mask breakdowns in information integration, context management, or treatment safety. The multidimensional, process-oriented rubrics of AIPatient Arena surface model frailties that outcome-only metrics occlude. For instance, models attaining similar Dx scores can diverge meaningfully in diagnosis accuracy or safety, highlighting the multifactorial nature of clinical quality.

From a deployment perspective, current LLMs may augment specific clinical workflow fragments, e.g., structured history-taking, documentation, preliminary summarization, or patient explanation—but should not be deployed for unsupervised autonomous consultation, especially in roles entailing ambiguity resolution, longitudinal context synthesis, or therapeutic decision-making. Human-in-the-loop oversight is essential. Contextual, process-level stress-testing, as enabled by AIPatient Arena, is necessary to align LLM capabilities with clinical task requirements.

Directions for Future Research

AIPatient Arena's architecture suggests several future research directions:

Subgroup, equity, and fairness analysis: Extension to more diverse patient populations, specialties, and care settings.
Regression and audit frameworks: Longitudinal model evaluation amid version changes, domain shift, or prompt drift.
Prospective and continual deployment assessment: Beyond pre-deployment simulation, incorporating real-world post-deployment monitoring and error feedback loops.
Benchmarking for skill alignment: Mapping rubric failures to concrete model or prompt engineering interventions, data augmentation strategies, or hybrid approaches integrating structured knowledge sources.

Conclusion

AIPatient Arena represents a significant advancement in the clinical evaluation of LLMs—eschewing static, knowledge-focused benchmarks in favor of EHR-grounded, workflow-oriented, multidimensional assessment. The findings draw a clear boundary between conversational proficiency and clinical trustworthiness, revealing that LLMs are not yet fit for autonomous use in genuine patient-facing consultation tasks. Careful, multiparametric stress-testing and detailed failure analysis must precede claims of real-world readiness, and clinical deployment must remain highly circumscribed and supervised. The framework outlined in this paper sets a new standard for model evaluation in healthcare AI and frames the path toward more reliable, safe, and contextually aligned medical LLMs.

Markdown Report Issue