UI-CUBE: Computer Use Benchmark
- UI-CUBE is a systematic benchmark assessing both functional correctness and operational reliability of Computer Use Agents through 226 tasks spanning simple interactions to complex workflows.
- It employs an execution-based framework with JavaScript oracles and multi-resolution UI variations to rigorously validate task performance and business logic compliance.
- Empirical results reveal a significant performance gap between agent models and human operators, highlighting critical architectural challenges such as memory management and state coordination.
UI-CUBE (UiPath Computer Use BEnchmark) is a systematic benchmark designed to evaluate Computer Use Agents (CUAs) not only on functional correctness, as traditional benchmarks do, but on their operational reliability in enterprise settings. Spanning 226 tasks and two difficulty tiers, UI-CUBE foregrounds the discontinuity between simple user interface (UI) manipulations and real-world business workflows—highlighting architectural limitations inherent in current agent designs and establishing concrete performance ceilings relative to human operators (Cristescu et al., 21 Nov 2025).
1. Motivation and Design Objectives
UI-CUBE was developed to address the insufficiency of existing CUA benchmarks, which predominantly measure whether an agent can eventually execute a target action (e.g., clicking a button) but neglect the system-level attributes demanded by production enterprise environments. These include robustness to unseen UI configurations, variability in screen resolutions, and maintenance of coherent state throughout multi-step workflows. UI-CUBE extends the evaluation axis from mere “task accuracy” to “operational reliability,” defined as the agent’s capacity to repeatedly succeed across systematically varied interfaces and resolutions, and to reliably coordinate multi-stage enterprise procedures. Rigor is ensured through the automated execution of oracular tests embedded in the application state, eschewing self-reports and trajectory- or LLM-based judgment.
2. Benchmark Composition and Task Taxonomy
The benchmark encompasses 226 tasks partitioned into two main tiers:
- Tier 1 – Simple UI Interactions: 136 atomic control manipulation tasks, systematically spanning 22 control types (including buttons, combo boxes, and date pickers), 27 layout motifs (forms, tables, modals, wizards, custom), and 27 unique action types (select, navigate, type, scroll, extract, etc.).
- Tier 2 – Complex Workflows: 90 tasks, comprised of 50 copy-paste/business-process procedures and 40 enterprise application scenarios directly reflecting real business software (Salesforce, SAP, Concur, Workday, Kanban boards).
Within Tier 2, tasks present advanced requirements such as aggregation, conditional logic, and iterative state tracking—emulative of core enterprise functions like sales forecasting or expense report generation. All tasks are executed across three resolution settings (1024×768, 1920×1080, 3840×2160), with variations frequently producing 10–20 percentage point performance differentials in current agent models.
3. Methodological Framework and Validation Protocol
Evaluation in UI-CUBE is governed by an execution-based validation infrastructure:
- Each benchmarked task specifies a JavaScript
test()oracle function, which inspects a globalwindow.app_stateobject to deterministically verify application conditions post-execution.- Atomic tasks invoke strict equality checks on fields or selection states.
- Workflow tasks are validated via required submission of structured JSON followed by shape and content assertions.
- Enterprise scenarios enforce business logic (e.g., filtering, batch processing, or error recovery) using semantically targeted, regex-tolerant assertions, thereby capturing both correctness and contextual compliance.
- Systematic UI variation is enforced by parameterizing controls along 8–9 discrete design-space axes (for instance, date-pickers differing in range, embedding context, or input modality).
This process eliminates ambiguity inherent to behavior- or LLM-annotated success, ensuring reproducible and rigorous evaluation.
4. Formal Metrics and Evaluation Criteria
Let , , . Denote as an agent's successful completions and for humans on task type . Key metrics include:
- Overall agent success rate:
- Per-tier agent rates:
- Human baseline (“performance ceiling”):
- Capability-cliff gap:
- Capability-cliff ratio (relative to humans):
These formalisms allow precise quantification of both absolute and normalized agent weaknesses compared to human performance.
5. Empirical Results and Capability-Cliff Analysis
Five state-of-the-art CUAs have been systematically assessed on UI-CUBE. The following table summarizes average results at each tier:
| Model | Simple (%) | Complex (%) |
|---|---|---|
| Claude Computer Use 4.0 | 66.7 | 9.5 |
| OpenAI-computer-use-preview | 70.3 | 10.5 |
| UIPathScreenAgent – Gemini 2.5 Flash | 68.6 | 11.9 |
| UIPathScreenAgent – GPT-5 mini | 77.0 | 18.4 |
| UIPathScreenAgent – GPT-5 | 84.8 | 19.4 |
When compared to the human ceiling (97.9% on simple, 61.2% on complex), mean agent results are:
- Absolute cliff: percentage points
- Relative:
- Simple: human
- Complex: human
Human evaluators with no prior application familiarity also exhibit performance cliffs, suggesting that the complex tier realistically bounds upper performance for agents lacking substantial domain knowledge.
6. Diagnosed Architectural Limitations
Analysis of error traces and model failures identifies three primary deficiencies contributing to the observed performance cliff:
- Memory management: Inadequate persistent, structured working memory causes agents to lose track of processed items within iterative loops, yielding redundancy or omission.
- Hierarchical planning: Current CUAs lack stable subgoal decomposition, often “drifting” from intended sequence in the presence of unanticipated UI events, such as error dialogs or dynamic layout changes, with insufficient mechanisms for task replanning or context recovery.
- State coordination: Agents exhibit poor cross-resolution grounding, resulting in misaligned actions when confronted with occlusions or altered visual geometry. The absence of safe-exploration and rollback policies leads to trapping in unrecoverable states.
These architectural bottlenecks render the completion of multi-step enterprise workflows brittle; a single error or missed state transition can precipitate total failure, explaining the dramatic drop in complex task success.
7. Implications for Enterprise Deployment and Future Agent Design
UI-CUBE functions as a diagnostic platform, revealing not only where agents fail but also why—pointing towards architectural innovation as the necessary path to enterprise-grade reliability. Recommendations for advancing CUA design to close the gap to the 61% human ceiling on complex workflows include:
- Integration of persistent, structured working memory to track loops, intermediate results, and subtask delineation.
- Deployment of hierarchical planners with explicit subgoal management and robust error recovery logic.
- Advancement of visual grounding to achieve resolution invariance, enabling semantic binding of coordinates across differently scaled environments.
- Embedding comprehensive error-handling subsystems, capable of detecting drift, invoking replanning, and rolling back partial execution.
By quantifying both absolute and human-relative capability discontinuities, UI-CUBE provides a targeted research agenda emphasizing architectural remedies over marginal improvements in training dataset scale or model size (Cristescu et al., 21 Nov 2025).