Papers
Topics
Authors
Recent
Search
2000 character limit reached

ClinEnv: An Interactive Multi-Stage Long Horizon EHR Environment for Agents

Published 1 Jun 2026 in cs.AI, cs.CL, cs.ET, and cs.MA | (2606.02568v1)

Abstract: Clinical practice is not the selection of an answer from enumerated options: a physician gathers heterogeneous information incrementally and commits to sequential, irreversible decisions under uncertainty. Static benchmarks cannot probe and existing interactive medical benchmarks each compromise on at least one of them. We present ClinEnv, an interactive benchmark that evaluates LLMs as attending physicians over real inpatient admissions under a paradigm we term Longitudinal Inpatient Simulation. Each case is automatically constructed into an ordered sequence of decision stages; at every stage the model must actively query four specialized agents before committing to medications, procedures, and diagnoses. ClinEnv scores both what the model decides, through deterministic ontology-grounded matching, and how it gathers information. Across seven models, the strongest reaches only 0.31 decision F1, and outcome quality is sharply decoupled from process quality. Difficulty concentrates in management decisions and later stages, where models recover discharge diagnoses far more reliably than management actions (0.51 vs. 0.17 F1) and continue to issue redundant queries as cases progress. ClinEnv makes this information-acquisition gap, invisible to outcome-only evaluation, directly measurable.

Summary

  • The paper introduces ClinEnv, a multi-stage simulation environment using real EHR data to decompose patient admissions into sequential decision stages, enabling evaluation of LLMs as attending physicians.
  • Its methodology leverages automated case construction via MIMIC-IV and interactive agent querying, with dual metrics that assess both clinical outcomes and process efficiency.
  • Key results reveal a significant performance gap in long-horizon decision-making, underscoring the need for improved multi-step planning and resource-efficient information gathering.

ClinEnv: An Interactive Multi-Stage Long Horizon EHR Environment for Agents

Problem Setting and Motivation

LLMs have achieved notable performance on static multiple-choice and knowledge-centric benchmarks within the medical domain. However, these benchmarks are fundamentally limited in their abstraction of clinical tasks: real-world inpatient practice involves making sequential, irreversible decisions under uncertainty, active information search, and continuous management over long event horizons. Static question answering does not probe the agent's ability to plan, contextualize, or selectively acquire information in temporally extended patient trajectories.

Existing interactive and medical agent benchmarks (e.g., EHRSQL, AgentClinic) offer partial engagement with information seeking or simulated interaction, but none instantiate the full challenge of attending-physician-level longitudinal management grounded in real EHR (Electronic Health Record) data. Benchmarks either collapse clinical reasoning into atomized database operations or restrict ‘ground truth’ to simulated, limited vignettes, thus failing to test process-aware agentic competence in hospital care.

The ClinEnv Benchmark: Architecture and Pipeline

ClinEnv introduces the Longitudinal Inpatient Simulation (LIS) paradigm, which operationalizes a multi-stage, interactive simulation for evaluating LLMs as attending physicians on real-world EHR data. The environment applies the following pipeline:

  1. Automated Case Construction: Using MIMIC-IV, each admission is decomposed via a pipeline into an ordered sequence of decision stages. Decisions are automatically extracted from discharge notes and temporally anchored using structured EHR events, without manual annotation. Each stage is classified by type (diagnosis, medication, procedure, plan) and then enriched deterministically from EHR tables. This process ensures complete traceability of ground-truth actions and highly granular temporal localization.
  2. Interactive Multi-Agent Environment: At each stage, clinical information is siloed between specialized agents (patient, nurse, lab, history), and the model must actively query these sources to gather facts before making management decisions. The environment only exposes agent roles with available context in the current stage.
  3. Dual Evaluation Metrics: Outcome quality is deterministically scored using ontology-grounded schema—ATC for medications, ICD hierarchy for diagnoses/procedures—via Hungarian matching, enabling type-specific and partial credit. Process quality is quantified by information coverage, efficiency (coverage per agent query rate), and financial/resource costs (e.g., unnecessary laboratory expenditure). Figure 1

    Figure 1: Overview of ClinEnv. Admissions are processed into event timelines, segmented by a stage-construction pipeline, and evaluated interactively with agent querying and dual scoring.

Experimental Protocol, Data, and Composition

ClinEnv is constructed from 3,509 admissions in MIMIC-IV, producing 9,297 decision stages and 26,043 ground truth decision points (71.7% diagnosis, 21.4% medication, 6.9% procedure). The mean case horizon is 2.65 stages, stratifying for both short and long admissions.

The benchmark supports two evaluation modes: direct (oracle access to all context) and interactive (only accessible through agent querying; models begin with zero data and must accumulate observations through explicit actions). Figure 2

Figure 2: Distribution of frequent diagnoses, medications, and procedures in structured ground truth across ClinEnv.

Figure 3

Figure 3: Decision density as a function of timeline span. Longer trajectories concentrate information, requiring integration of broader context per decision.

Key Results

Overall Model Performance

Seven LLM variants, including proprietary (GPT-5.4) and open-source (Llama-3.1, Gemma) families, are benchmarked. The strongest model (GPT-5.4) reaches an overall decision F1 of only 0.31, with 0.51 on diagnosis and 0.097 on medication stages. Importantly, outcome quality is decoupled from process quality: Llama-3.1-70B achieves the top medication match but with the lowest information coverage and highest lab waste, while GPT-5.4-nano achieves competitive accuracy with minimal queries and waste.

Long-Horizon and Stage-Wise Analysis

Case difficulty is sharply horizon-sensitive. In long-horizon cases (≥3 decision stages), F1 scores drop dramatically (GPT-5.4: 0.306 → 0.235). Performance tracks stage index: decision F1 drops monotonically through sequential management episodes, not due to shrunken information access (coverage remains stable or climbs), but due to efficiency decay—models make increasingly redundant/futile queries as the clinical trajectory extends. Figure 4

Figure 4: Per-stage analysis: Decision F1 declines over stages, information coverage rises, but coverage efficiency collapses—highlighting mounting difficulty in clinical reasoning relative to information access.

Information Seeking and Resource Utilization

A core claim is that clinically targeted information gathering both raises relevant evidence acquisition and reduces resource waste. Models with higher coverage systematically achieve lower wasted laboratory cost ratio, rather than simply ordering more tests. Figure 5

Figure 5: Information coverage-waste frontier; higher coverage correlates with lower laboratory waste, identifying the desirable frontier for clinically competent agents.

Decision Type Decomposition

Diagnosis recovery is achievable (GPT-5.4: 0.51), but management actions remain intractable (medication/procedure F1 ≈ 0.10–0.17). The primary failure is not action selection (e.g., start/stop) but agent selection—the models frequently propose reasonable drug classes but seldom match ground-truth agents, even with partial (ATC) credit. Increased information retrieval does not translate to improved management decisions, underscoring the distinction between ‘knowing’ and ‘deciding’ under constraints.

Practical and Theoretical Implications

ClinEnv demonstrates that static and outcome-only medical LLM benchmarks substantially overestimate clinical readiness. Benchmarking that ignores information-seeking process and sequential reasoning not only fails to audit resource stewardship but also misjudges real-world medical competence, especially under long time horizons.

Decisive progress in agentic clinical LLMs requires improvement in multi-step planning and context-resolved management selection, not only factual recall or entity extraction. The persistent gap between information retrieval and decision execution underlines the need for research into agentized LLMs capable of sample-efficient strategy acquisition, adaptive plan revision, and explicit cost-benefit reasoning—in effect, models must engage in process-oriented evidence management at parity with practicing clinicians.

Limitations and Future Directions

ClinEnv measures alignment with observed (not always optimal) clinical practice; a competent but alternative management strategy may not be captured by the ground truth. Data is drawn from a single US academic medical center with localized coding and billing practices, warranting evaluation in broader, international hospital systems. Further, while construction automates EHR translation to cases, some typological decisions rely on LLM classifiers, inviting scrutiny for potential pipeline bias.

Prospective research includes (1) training LLM-based agents with integrated process feedback from ClinEnv, (2) developing more granular scoring schemas for sub-decision logic (e.g., diagnostic-therapeutic concordance), and (3) expanding the agent framework to interdisciplinary, team-based clinical scenarios. Figure 6

Figure 6: Stage-level runtime example: The agent queries information channels before committing a medication action; both decision error and process metrics are recorded.

Conclusion

ClinEnv sets a rigorous new standard for evaluating LLMs as agentic clinicians, emphasizing the dissociation between what models can recognize and what they can decide under realistic, sequential, and information-limited settings. By revealing the persistent gap in simulated management and process efficiency, ClinEnv provides a critical infrastructure for the iterative advancement of clinically aligned, process-aware medical foundation models.

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 2 likes about this paper.