- The paper introduces a multidimensional, EHR-grounded evaluation framework that systematically assesses LLMs’ clinical consultation skills across eight defined competence dimensions.
- The methodology utilizes structured and unstructured data from MIMIC-III to generate patient-specific knowledge graphs, enabling realistic, zero-shot simulations of physician-patient dialogues.
- Findings reveal significant trade-offs where models exhibit high conversational fluency but struggle with ambiguous responses, diagnostic accuracy, and medication safety, underscoring the need for human oversight.
EHR-Grounded Multidimensional Assessment of LLMs in Clinical Consultation: An Analysis of "AIPatient Arena" (2606.17474)
Introduction
The paper "AIPatient Arena: EHR-grounded evaluation of LLMs in end-to-end clinical consultation workflows" introduces a comprehensive, workflow-oriented evaluation framework for LLMs within the context of simulated clinical consultations. Current medical AI benchmarks predominantly rely on static, often single-turn or multiple-choice tasks, which do not reflect the sequential, ambiguous, and interactive demands of real-world clinical consultations as encountered in Electronic Health Records (EHR). This work operationalizes an EHR-grounded, knowledge-graph-based evaluation protocol—the AIPatient Arena—which systematically interrogates the clinical process, not just outcome accuracy, across eight explicit dimensions of clinical competence.
Framework Architecture and Methodology
AIPatient Arena builds on structured and unstructured EHR data, primarily leveraging MIMIC-III records, to generate patient-specific knowledge graphs representing symptoms, medical history, treatments, lab values, and other salient clinical entities. These knowledge graphs instantiate highly realistic, complex clinical profiles for simulated consultations. The framework orchestrates interactions where LLMs act as virtual physicians in multi-turn dialogs, executing core tasks: clinical communication, information gathering, summarization, diagnosis, and treatment plan generation.
Clinical competence is dissected along eight axes:
- Medical Interview Questioning Skills (QS)
- Information Coverage (IC)
- Handling of Ambiguous Responses (HR)
- Ethical and Professional Conduct (ET)
- Clarity and Transparency of Explanations (EX)
- Information Integration (II)
- Diagnostic Accuracy and Reasoning (Dx)
- Medication Safety and Justification (MS)
Each dimension is anchored by well-defined failure patterns, such as repetitive questioning, omission of past medical history, inadequate clarification, unsafe drug recommendations, and lack of empathy, among others. The evaluation process is rubric-driven and incorporates penalty schemes for each failure pattern, facilitating granular process analysis and cross-model comparisons.
LLMs are evaluated in a standardized zero-shot paradigm, with models ranging from general-purpose (e.g., GPT-5.5, Claude 4.x, DeepSeek, Qwen) to medically specialized, each subjected to the same prompt protocol and patient cohorts. Three datasets are used: a 437-patient MIMIC-III-derived main cohort (CCQA), and two out-of-distribution cohorts for external validation.
Numerical Results and Failure Mode Analysis
LLMs demonstrate marked variability in clinical performance profiles across dimensions. Notably:
- High mean scores are consistently observed in interaction-centric metrics—QS (4.43–4.99/5), ET (4.38–4.93/5), and EX (3.80–4.72/5)—indicating robust command of conversational structure and professional demeanor.
- Information Integration (3.19–4.21/5) and Medication Safety (3.13–3.78/5) achieve only intermediate performance.
- Persistent, clinically critical weaknesses are evident in handling ambiguity (HR: 2.57–3.32), information coverage (IC: 2.08–3.02), and especially in diagnostic accuracy and reasoning (Dx: 2.63–3.55). Failure to address uncertain patient answers occurs at a rate exceeding 90% across nearly all models.
GPT-5.5 attains the highest overall weighted average (3.76), reflecting balanced (though not complete) clinical proficiency. However, no evaluated LLM approaches ceiling in all dimensions. For example, DeepSeek-V3 maximizes EX, Qwen3.5 excels in MS, and HuatuoGPT-o1 has the highest HR, yet all models display severe trade-offs. Model competence across axes exhibits minimal pairwise correlation (< 0.13 in all cases), evidencing the necessity of multidimensional evaluation; aggregate or uniaxial scores obscure domain-critical failure modes.
Process analysis identifies repetitive questioning (up to 52% in HuatuoGPT-o1) and omission of past medical history (rates >90% in several models) as dominant communication failures. Ethical/professional impairments mainly entail lack of empathy, with HuatuoGPT-o1 and several large generalist LLMs showing especially high deficits. Drug recommendation analysis shows that high medication-F1 does not equate to medication safety—highlighting the non-equivalence of outcome-level and process-based metrics.
Importantly, restricting downstream evaluation to "dialogue-sufficient" consultations—where information sufficiency for diagnosis is explicitly satisfied—raises Dx scores across all models (ΔDx: 0.15–0.28); MS, however, remains static, indicating that correct information gathering alone does not guarantee improved medication safety or appropriateness.
Human Assessment and External Validity
Expert review confirms high alignment of the rubric-driven automated evaluator with seasoned clinician judgment across all interaction-centric and most reasoning dimensions (median human satisfaction >4.2/5). Score variances increase for reasoning and medication axes, reflecting inherent subjectivity and the high stakes of clinical judgment.
AIPatient Arena exhibits stable ranking and profiling across out-of-distribution datasets (PMC-Patients, PCI), preserving relative model strengths and revealing context-dependent performance shifts—most notably, substantially lower IC on psychological counseling records versus general medicine datasets. This stability substantiates the robustness and discriminatory power of the framework, while context sensitivity confirms clinical realism.
Implications for Evaluation and Deployment
The core empirical claim is that LLMs presently display a dissociation between conversational fluency and robust, clinically consequential reasoning. Over-reliance on endpoint accuracy metrics, such as diagnostic accuracy or medication-level F1, risks overstating real-world readiness; plausible outputs can mask breakdowns in information integration, context management, or treatment safety. The multidimensional, process-oriented rubrics of AIPatient Arena surface model frailties that outcome-only metrics occlude. For instance, models attaining similar Dx scores can diverge meaningfully in diagnosis accuracy or safety, highlighting the multifactorial nature of clinical quality.
From a deployment perspective, current LLMs may augment specific clinical workflow fragments, e.g., structured history-taking, documentation, preliminary summarization, or patient explanation—but should not be deployed for unsupervised autonomous consultation, especially in roles entailing ambiguity resolution, longitudinal context synthesis, or therapeutic decision-making. Human-in-the-loop oversight is essential. Contextual, process-level stress-testing, as enabled by AIPatient Arena, is necessary to align LLM capabilities with clinical task requirements.
Directions for Future Research
AIPatient Arena's architecture suggests several future research directions:
- Subgroup, equity, and fairness analysis: Extension to more diverse patient populations, specialties, and care settings.
- Regression and audit frameworks: Longitudinal model evaluation amid version changes, domain shift, or prompt drift.
- Prospective and continual deployment assessment: Beyond pre-deployment simulation, incorporating real-world post-deployment monitoring and error feedback loops.
- Benchmarking for skill alignment: Mapping rubric failures to concrete model or prompt engineering interventions, data augmentation strategies, or hybrid approaches integrating structured knowledge sources.
Conclusion
AIPatient Arena represents a significant advancement in the clinical evaluation of LLMs—eschewing static, knowledge-focused benchmarks in favor of EHR-grounded, workflow-oriented, multidimensional assessment. The findings draw a clear boundary between conversational proficiency and clinical trustworthiness, revealing that LLMs are not yet fit for autonomous use in genuine patient-facing consultation tasks. Careful, multiparametric stress-testing and detailed failure analysis must precede claims of real-world readiness, and clinical deployment must remain highly circumscribed and supervised. The framework outlined in this paper sets a new standard for model evaluation in healthcare AI and frames the path toward more reliable, safe, and contextually aligned medical LLMs.