HealthAdminBench: Evaluating Computer-Use Agents on Healthcare Administration Tasks

Published 10 Apr 2026 in cs.AI | (2604.09937v1)

Abstract: Healthcare administration accounts for over $1 trillion in annual spending, making it a promising target for LLM-based computer-use agents (CUAs). While clinical applications of LLMs have received significant attention, no benchmark exists for evaluating CUAs on end-to-end administrative workflows. To address this gap, we introduce HealthAdminBench, a benchmark comprising four realistic GUI environments: an EHR, two payer portals, and a fax system, and 135 expert-defined tasks spanning three administrative task types: Prior Authorization, Appeals and Denials Management, and Durable Medical Equipment (DME) Order Processing. Each task is decomposed into fine-grained, verifiable subtasks, yielding 1,698 evaluation points. We evaluate seven agent configurations under multiple prompting and observation settings and find that, despite strong subtask performance, end-to-end reliability remains low: the best-performing agent (Claude Opus 4.6 CUA) achieves only 36.3 percent task success, while GPT-5.4 CUA attains the highest subtask success rate (82.8 percent). These results reveal a substantial gap between current agent capabilities and the demands of real-world administrative workflows. HealthAdminBench provides a rigorous foundation for evaluating progress toward safe and reliable automation of healthcare administrative workflows.

Abstract PDF Upgrade to Chat

Authors (15)

Summary

The paper presents a novel benchmark that rigorously evaluates LLM-based computer-use agents on complex healthcare administration tasks, highlighting a gap between task-level and end-to-end performance.
It uses four deterministic interactive environments and 135 curated tasks to simulate real-world workflows such as prior authorization, claims appeals, and DME order processing.
Significant performance differences across agents and subtasks underscore the challenges of multi-step reasoning, information retention, and cross-system coordination in healthcare settings.

HealthAdminBench: Rigorous Evaluation of Computer-Use Agents on Healthcare Administration Tasks

Motivation and Benchmark Design

HealthAdminBench is introduced as a rigorous, comprehensive benchmark for evaluating LLM-based computer-use agents (CUAs) on real-world healthcare administrative workflows (2604.09937). Unlike existing web and enterprise agent benchmarks—which focus on single-site navigation or generic enterprise systems—HealthAdminBench targets the high-value, under-explored domain of healthcare administration. Administrative operations like prior authorization, claims appeals, and durable medical equipment (DME) order processing require complex multi-step and cross-system workflows that are largely absent from the current agent benchmarking landscape.

HealthAdminBench abstracts four deterministic interactive environments modeled after dominant platforms in healthcare administration: an electronic health record (EHR) system, two distinct payer portals, and a faxing system. Tasks are grounded in extensive observational research and validated by domain experts. Task objectives are decomposed into fine-grained, verifiable subtasks, supporting precise evaluation and actionable error analysis. CUAs interact with these environments via either screenshot-based or accessibility-tree (DOM-based) observation modalities, simulating both constrained real-world settings and more tractable research scenarios.

Figure 1: HealthAdminBench evaluation loop: an agent iteratively perceives environment state, selects actions, and interacts with simulated EHR, payer portals, and fax; success is determined by a combination of deterministic and LLM-based verifiers.

Environment and Task Composition

The four web-based environments simulate heterogeneous, schema-constrained interfaces, enforcing realistic input validation and operational logic. The EHR provides granular modules for prior authorization and denials management, while payer portals enforce authentic workflow constraints, including information gating and documentation transfer.

Figure 2: HealthAdminBench environments: simulated EHR, two payer portals, and an eFax system emulate standard administrative applications following the REAL framework.

Collectively, HealthAdminBench defines 135 expertly curated tasks spanning three broad administrative categories:

Prior Authorization (60 tasks): Ranging from eligibility checks to complex multi-system submissions.
Appeals and Denials Management (60 tasks): Encompassing denial review, triage, and appeal documentation.
DME Order Processing (15 tasks): Focusing on cross-environment document retrieval, compliance verification, and supplier submission.

Tasks are further decomposed into 1,698 subtasks, categorized as Information Retrieval, Documentation, Form Completion, Task Resolution, Document Handling, or Clinical Reasoning. Deterministic programmatic verifiers handle structured subtasks, while 521 require rubric-driven LLM evaluation with demonstrated human-level agreement.

Agent Evaluation, Metrics, and Findings

Seven agent configurations based on five leading LLMs (Claude Opus 4.6, GPT-5.4, Gemini 3.1 Pro, Kimi K2.5, and Qwen-3.5-27B) are evaluated using both a standardized harness and providers’ native CUA implementations. The benchmark enforces a strict binary definition for task success: all (not merely some) subtasks must be completed correctly for a workflow to be considered successful, mirroring the unforgiving requirements of real healthcare administration.

Figure 3 presents aggregate agent performance:

The best full-task success rate is 36.3% (Claude Opus 4.6 in CUA mode).
The highest subtask success rate is 82.8% (GPT-5.4 CUA).
There is a consistent gap: agents routinely perform individual subtasks correctly but fail to maintain end-to-end reliability across multi-step, cross-system workflows.
Figure 3: Comparison of agent performance in HealthAdminBench; (a) task-level and (b) subtask-level success rates reveal strong subtask competence but low end-to-end reliability.

Analysis by Task and Subtask Type

Task and subtask-level breakdown demonstrates that task complexity, required reasoning, and coordination are major determinants of agent performance:

DME Order Processing is less complex and yields the highest success rates.
Appeals and Denials Management tasks—requiring robust clinical reasoning and nuanced workflow navigation—are most challenging.
Information retrieval is generally tractable for all agents, while clinical reasoning, task resolution, and document handling subtasks exhibit substantially lower success rates and higher variance across models.

Prompting and Observation Modality Effects

Ablation studies indicate that both domain-specific prompting (portal guidance) and structured, accessibility-tree-based observations yield significant improvements in reliability and efficiency. The gap between strict end-to-end evaluation and subtask-level success persists even under idealized settings, implying that deficits are not solely a product of prompt or observation limitations but reflect deeper difficulties in long-horizon credit assignment, information retention, and multi-system coordination.

Figure 4: Task success rates across agent, prompting, and observation settings demonstrate the additive benefits of domain guidance and structured observations.

Resource Utilization

Step-count and cost analyses reveal the practical trade-offs of deploying CUAs. Models like Gemini 3.1 Pro tend to give up early on complex tasks—lowering step counts at the expense of task completion—while agents like Claude Opus 4.6 incur substantial cost due to persistent attempts and elevated input size consumption.

Figure 5: Step counts for agents by task difficulty and model, illustrating agent persistence and efficiency characteristics.

Figure 6: API cost analysis (non-CUA models) across task difficulties, reflecting resource intensiveness for complex workflows.

Failure Modes and Error Analysis

Qualitative auditing reveals three predominant failure modes:

Hidden long-term dependencies: Agents fail to gather requisite information early, unable to recover from policy-constrained sequences requiring backtracking.
Avoidance of file operations: Frequent omission or mishandling of downloads/uploads critically undermines cross-portal coordination.
Information loss over long horizons: Limited context window and infrequent utilization of explicit memory lead to omitted values and propagation of early errors.

Impact of Domain-Specific Training

Fine-tuning an open model (Qwen-3.5-27B) on 100 HealthAdminBench trajectories yields a 23% absolute improvement in held-out task success, even exceeding closed-source competitors. This demonstrates that explicit exposure to domain workflows, even at modest scales, can yield substantial CUA efficiency gains—though the experiment’s limited scale urges caution in generalization.

Toward Reliable Administrative Automation: Implications and Outlook

HealthAdminBench sets a new standard for agent evaluation in healthcare administration, surfacing the persistent and quantitatively large reliability gap between step-level and end-to-end workflow automation. The findings have several implications:

Research Implications: Multi-system memory, explicit state management, and cross-application coordination represent rich directions for advancing CUA sequences modeling. The fine-tuning result suggests substantial headroom for domain adaptation, and the observed persistent error modes reveal that stateful planning, workflow grounding, and robust handling for non-stationary environments remain open research areas.
Practical Implications: Despite high subtask-level accuracy, CUAs are not deployment-ready for safety-critical and high-liability healthcare settings—unsupervised automation of administrative tasks could result in operational and legal risk due to brittle error recovery, file-handling lapses, and non-compliance with evolving workflows.
Future Directions: Progress on HealthAdminBench will require integration of external memory, continual learning over workflow data, adaptive prompting, and improved recovery from execution errors. Future benchmarks could extend to include real interface drift, policy evolution, and active deployment constraints (MFA, CAPTCHAs), as well as more nuanced partial credit and reward shaping for reinforcement learning.
Figure 7: Subtask success rates across prompting/observation regimes: strong performance improvements are achieved with structured representations and domain guidance.

Conclusion

HealthAdminBench provides an actionable, fine-grained, and reproducible testbed for CUA development in one of the costliest and error-prone domains of the U.S. economy. The benchmark’s strong numerical results and revealed limitations underscore the challenging gap between current CUA competence and the rigorous demands of real-world healthcare administration. This work will serve as a foundation for developing safer, more reliable, and ultimately deployable administrative AI systems.

Markdown Report Issue