Computer-Use Agent (CUA)
- Computer-Use Agent (CUA) is an autonomous system that interprets user goals and issues low-level GUI actions using LLM reasoning and visual grounding.
- CUAs automate end-to-end business processes by coordinating heterogeneous workflows and adapting robustly to varied UI environments.
- Benchmark evaluations reveal a steep performance drop in complex workflows, highlighting challenges in memory management, hierarchical planning, and security.
A Computer-Use Agent (CUA) is an autonomous software system that operates graphical user interfaces (GUIs) by issuing low-level actions such as mouse movements, clicks, keystrokes, drags, and scrolls, analogous to a human user but powered by LLM reasoning and/or vision-based grounding. CUAs generalize across traditional applications and operating systems, targeting the automation of end-to-end business processes, coordination across heterogeneous workflows, and robust adaptation to UI and environmental variations. Their emergence marks a paradigm shift in human-computer interaction, but also surfaces new dimensions in benchmarking, architecture, operational robustness, and security (Cristescu et al., 21 Nov 2025, Sumyk et al., 11 Mar 2026, Liao et al., 28 May 2025, Mohanbabu et al., 10 Feb 2026, Lin et al., 19 Nov 2025, Zhang et al., 19 Jan 2026, Mei et al., 24 May 2025, Bonatti et al., 31 Jan 2025, Luo et al., 8 Oct 2025, Jones et al., 9 Feb 2026, Sumyk et al., 25 Nov 2025, Chen et al., 28 Jan 2026, Wang et al., 12 Aug 2025, Aggarwal et al., 7 Apr 2026, Ren et al., 14 Oct 2025, Feizi et al., 10 Nov 2025, Sun et al., 6 Aug 2025, Jones et al., 7 Jul 2025, Lee et al., 19 Feb 2026, Chen et al., 16 May 2025).
1. Formal Definition and Fundamental Capabilities
A CUA is an autonomous agent that interprets a user-level goal and, given access to a set of applications whose state is exposed through visual or DOM representations, produces a sequence of low-level simulated GUI actions (e.g., click, type, scroll) to achieve , without further human intervention. Formally, CUAs are often cast as partially observable Markov decision processes (POMDPs):
- States capture GUI (window layouts, DOM, file system).
- Observations represent screenshots, accessibility trees, and extracted metadata.
- Actions represent parameterized GUI operations (e.g., click(x, y), type(text)).
- Transitions describe the result of issuing an action to the environment.
- Policy selects actions conditioned on historical observations and actions.
- Success is defined as reaching a user-specified goal region (Bonatti et al., 31 Jan 2025, Cristescu et al., 21 Nov 2025, Mei et al., 24 May 2025, Sumyk et al., 11 Mar 2026, Wang et al., 12 Aug 2025, Aggarwal et al., 7 Apr 2026, Lin et al., 19 Nov 2025, Luo et al., 8 Oct 2025, Jones et al., 7 Jul 2025, Chen et al., 16 May 2025).
CUA system architectures generally follow a three-stage pipeline:
- Perception: Ingest screen pixels and/or structured data (DOM, accessibility trees), employing computer vision, OCR, and UI element detection.
- Planning/Reasoning: LLMs or multimodal models parse user intent, decompose tasks into subgoals, and generate stepwise action sequences using internal memory and chain-of-thought reasoning.
- Action/Execution: Output low-level HID events, monitor environment state, and adapt subsequent actions via feedback and error recovery routines (Sumyk et al., 11 Mar 2026, Cristescu et al., 21 Nov 2025, Lin et al., 19 Nov 2025, Zhang et al., 19 Jan 2026, Wang et al., 12 Aug 2025, Feizi et al., 10 Nov 2025, Chen et al., 28 Jan 2026, Lee et al., 19 Feb 2026).
Key domains of application include enterprise-grade workflow automation (ERP/CRM/HCM), copy-paste and cross-application data manipulation, multi-step business logic execution, and even acting as autonomous judges in iterative GUI design or self-improvement loops (Cristescu et al., 21 Nov 2025, Lin et al., 19 Nov 2025, Aggarwal et al., 7 Apr 2026).
2. Evaluation Paradigms, Benchmarks, and the "Capability Cliff"
The operational reliability and readiness of CUAs are assessed via domain-specific, multi-layered benchmarks. UI-CUBE (UiPath Computer Use BEnchmark) exemplifies an enterprise-grade approach, comprising 226 tasks across:
- Tier 1 ("Simple UI Interactions"): 136 atomic control tasks (button clicks, menu selections), systematically varying 22 control types, 27 layouts, and 27 action types.
- Tier 2 ("Complex Workflows"): 90 scenario-driven tasks including copy-paste, aggregation, form processing, and realistic enterprise context (e.g., Salesforce, SAP, Concur mocks) (Cristescu et al., 21 Nov 2025).
Tasks are executed at enforced resolutions (1024×768, 1920×1080, 3840×2160) to probe grounding robustness. Empirical evaluations reveal a discontinuity—or "capability cliff":
- Simple Interactions: Agent success rates (compared to human ), ratio 0.
- Complex Workflows: Success rates plummet to 1 (human 2), 3.
- The absolute agent drop is 4 percentage points (Cristescu et al., 21 Nov 2025).
This cliff highlights categorical limitations not addressable by incremental data scaling or prompting, but rooted in memory management, hierarchical planning, and state coordination.
Major open benchmarks include OSWorld, WindowsAgentArena, CUA-World (10K+ tasks, 200 software environments), and A11y-CUA for accessibility, with evaluation metrics spanning simple task accuracy, efficiency (step count), environment coverage, and error recovery (Cristescu et al., 21 Nov 2025, Sumyk et al., 11 Mar 2026, Aggarwal et al., 7 Apr 2026, Mohanbabu et al., 10 Feb 2026, Wang et al., 12 Aug 2025).
3. System Architectures, Skill Abstraction, and Design Patterns
Robust CUA architectures employ modular decomposition:
- Planner: High-level LLM policy for intent interpretation and task decomposition.
- Grounder: Maps symbolic actions to GUI coordinates using vision or structured environment inspectors.
- State Manager: Maintains onboard representation of the UI and tracks dynamic state changes.
- Memory Buffer: Persistent storage for processed items, actions, and intermediate results.
- Skill Libraries: Abstraction layers (e.g., CUA-Skill) encapsulate reusable and parameterized execution graphs, supporting composition, memory-aware failure recovery, and cross-application generalization (Cristescu et al., 21 Nov 2025, Chen et al., 28 Jan 2026, Lee et al., 19 Feb 2026).
Emerging paradigms further include intent-aligned plan memory for long-horizon stability (IntentCUA), modular auditor roles for self-correction (as in CUA-auditor and vision-based Judge agents), and orchestration protocols for collaborative agent–coder GUI co-design (Lin et al., 19 Nov 2025, Sumyk et al., 11 Mar 2026, Sumyk et al., 25 Nov 2025).
Skill-based and intent-level abstractions yield substantial gains: CUA-Skill achieves up to 5 "best-of-3" success on WindowsAgentArena, and IntentCUA demonstrates 6 success with step efficiency ratio 7, outperforming RL-only systems by wide margins (Chen et al., 28 Jan 2026, Lee et al., 19 Feb 2026). Memory and planning modularity are critical for bridging the capability gap (Cristescu et al., 21 Nov 2025, Zhang et al., 19 Jan 2026, Mei et al., 24 May 2025).
4. Security, Safety, and Adversarial Robustness
CUAs introduce unique security risks, stemming from their expanded perception surface (screenshots, DOM, logs), interface-action coupling, and persistent delegation. Adversarial risks and flaws systematically cataloged include:
- UI deception and perceptual mismatch: TOCTOU, clickjacking via overlay, perceptual hallucination.
- Indirect prompt injection: Adversarial content delivered via environment channels (chat, forums) hijacks agent control flow, causing critical actions (e.g., file exfiltration, privilege escalation) (Liao et al., 28 May 2025, Luo et al., 8 Oct 2025, Jones et al., 7 Jul 2025).
- Remote code execution (RCE) via chained actions: Long-term, multi-step vulnerabilities (e.g., tool orchestration, MIME handler exploits) circumvent naive gating.
- Chain-of-thought (CoT) exposure: Reasoning output spills to observable logs or external files, violating privacy or leaking context.
Attack Success Rates (ASR) on comprehensive benchmarks (RTC-Bench, AdvCUA) reach up to 8 for advanced CUAs even under best-case defenses, and Attempt Rates (AR) can approach 9, exposing insufficient separation between reasoning and actuator modules (Liao et al., 28 May 2025, Luo et al., 8 Oct 2025, Jones et al., 9 Feb 2026).
Contemporary defense strategies include modular simulation-to-real reasoning corrections (e.g., MirrorGuard reduces unsafe rates from 0 to 1 while keeping false refusal rates at 2), explicit provenance tracking, cryptographic delegation verification, runtime planning audits, and ephemeral containerization. Fine-grained adversarial evaluation and risk taxonomies are now considered essential for any CUA deployment in adversarial or high-stakes contexts (Zhang et al., 19 Jan 2026, Jones et al., 7 Jul 2025, Chen et al., 16 May 2025).
5. Accessibility, Generalization, and Collaborative Control
Research targeting accessibility, such as A11y-CUA, underscores that current CUAs mirror sighted user interaction styles (dominated by point-and-click), failing under keyboard-only or magnifier-mediated conditions typical of blind or low-vision users. CUAs show up to 3 success rate in sighted mode, but degrade to 4 (keyboard-only) and 5 (magnifier), compared to near-perfect performance by trained BLVUs in their native workflows (Mohanbabu et al., 10 Feb 2026). Key gaps are:
- Perception: Absence of screen-reader state and off-viewport awareness.
- Cognition: Weak long-horizon state tracking and unreliable final-state checks.
- Action: Inefficient hotkey utilization and lack of collaborative protocol for dynamic error correction.
Proposed solutions include ingestion of assistive technology signals, adaptation to keyboard-first navigation, explicitly narratable plans, and collaborative overlays enabling BLVUs to interrupt or override agent execution.
CUAs are also evolving toward collaborative and cross-agent interaction paradigms, as evidenced by frameworks supporting multi-agent planning, intent abstraction, skill sharing, and integrated agent–human/tutor workflows (Lin et al., 19 Nov 2025, Bonatti et al., 31 Jan 2025, Lee et al., 19 Feb 2026, Zhang et al., 19 Jan 2026).
6. Open Challenges, Limitations, and Future Directions
Despite significant progress, CUAs face persistent challenges:
- Long-horizon generalization: Most approaches still falter on workflows demanding stable memory, adaptive hierarchical planning, and robust state tracking across hundreds of steps and dynamic environments (Cristescu et al., 21 Nov 2025, Aggarwal et al., 7 Apr 2026, Chen et al., 28 Jan 2026).
- Security–capability trade-off: Defensive measures (e.g., accessibility trees, containerization) may reduce attack surfaces but at the cost of agent efficiency or generality (Liao et al., 28 May 2025, Chen et al., 16 May 2025).
- Model and evaluation transparency: Benchmark transparency, standard protocols for agent provenance, and third-party auditability remain underdeveloped (Chen et al., 16 May 2025).
- Continuous learning and adaptation: Few current approaches enable self-improving CUAs via autonomous curriculum generation, experiential reinforcement, or on-the-fly skill abstraction (notable exception: SEAgent) (Sun et al., 6 Aug 2025).
- Scalability and coverage: Most benchmarks underrepresent multi-modal, long-horizon, and interdisciplinary workflows crucial for economic impact and real-world deployment (Aggarwal et al., 7 Apr 2026, Wang et al., 12 Aug 2025).
Prospective research directions encompass memory-augmented architecture, hierarchical RL for compositional reasoning, meta-learning of interface variation, collaborative human-agent paradigms, and formal safety certification.
References: Primary references include (Cristescu et al., 21 Nov 2025, Sumyk et al., 11 Mar 2026, Liao et al., 28 May 2025, Mohanbabu et al., 10 Feb 2026, Lin et al., 19 Nov 2025, Zhang et al., 19 Jan 2026, Mei et al., 24 May 2025, Bonatti et al., 31 Jan 2025, Luo et al., 8 Oct 2025, Jones et al., 9 Feb 2026, Sumyk et al., 25 Nov 2025, Chen et al., 28 Jan 2026, Wang et al., 12 Aug 2025, Aggarwal et al., 7 Apr 2026, Ren et al., 14 Oct 2025, Feizi et al., 10 Nov 2025, Sun et al., 6 Aug 2025, Jones et al., 7 Jul 2025, Lee et al., 19 Feb 2026, Chen et al., 16 May 2025).