Papers
Topics
Authors
Recent
Search
2000 character limit reached

Computer-Use Agent (CUA)

Updated 18 April 2026
  • Computer-Use Agent (CUA) is an autonomous system that interprets user goals and issues low-level GUI actions using LLM reasoning and visual grounding.
  • CUAs automate end-to-end business processes by coordinating heterogeneous workflows and adapting robustly to varied UI environments.
  • Benchmark evaluations reveal a steep performance drop in complex workflows, highlighting challenges in memory management, hierarchical planning, and security.

A Computer-Use Agent (CUA) is an autonomous software system that operates graphical user interfaces (GUIs) by issuing low-level actions such as mouse movements, clicks, keystrokes, drags, and scrolls, analogous to a human user but powered by LLM reasoning and/or vision-based grounding. CUAs generalize across traditional applications and operating systems, targeting the automation of end-to-end business processes, coordination across heterogeneous workflows, and robust adaptation to UI and environmental variations. Their emergence marks a paradigm shift in human-computer interaction, but also surfaces new dimensions in benchmarking, architecture, operational robustness, and security (Cristescu et al., 21 Nov 2025, Sumyk et al., 11 Mar 2026, Liao et al., 28 May 2025, Mohanbabu et al., 10 Feb 2026, Lin et al., 19 Nov 2025, Zhang et al., 19 Jan 2026, Mei et al., 24 May 2025, Bonatti et al., 31 Jan 2025, Luo et al., 8 Oct 2025, Jones et al., 9 Feb 2026, Sumyk et al., 25 Nov 2025, Chen et al., 28 Jan 2026, Wang et al., 12 Aug 2025, Aggarwal et al., 7 Apr 2026, Ren et al., 14 Oct 2025, Feizi et al., 10 Nov 2025, Sun et al., 6 Aug 2025, Jones et al., 7 Jul 2025, Lee et al., 19 Feb 2026, Chen et al., 16 May 2025).

1. Formal Definition and Fundamental Capabilities

A CUA is an autonomous agent that interprets a user-level goal gg and, given access to a set of applications whose state is exposed through visual or DOM representations, produces a sequence of low-level simulated GUI actions (e.g., click, type, scroll) to achieve gg, without further human intervention. Formally, CUAs are often cast as partially observable Markov decision processes (POMDPs):

CUA system architectures generally follow a three-stage pipeline:

  1. Perception: Ingest screen pixels and/or structured data (DOM, accessibility trees), employing computer vision, OCR, and UI element detection.
  2. Planning/Reasoning: LLMs or multimodal models parse user intent, decompose tasks into subgoals, and generate stepwise action sequences using internal memory and chain-of-thought reasoning.
  3. Action/Execution: Output low-level HID events, monitor environment state, and adapt subsequent actions via feedback and error recovery routines (Sumyk et al., 11 Mar 2026, Cristescu et al., 21 Nov 2025, Lin et al., 19 Nov 2025, Zhang et al., 19 Jan 2026, Wang et al., 12 Aug 2025, Feizi et al., 10 Nov 2025, Chen et al., 28 Jan 2026, Lee et al., 19 Feb 2026).

Key domains of application include enterprise-grade workflow automation (ERP/CRM/HCM), copy-paste and cross-application data manipulation, multi-step business logic execution, and even acting as autonomous judges in iterative GUI design or self-improvement loops (Cristescu et al., 21 Nov 2025, Lin et al., 19 Nov 2025, Aggarwal et al., 7 Apr 2026).

2. Evaluation Paradigms, Benchmarks, and the "Capability Cliff"

The operational reliability and readiness of CUAs are assessed via domain-specific, multi-layered benchmarks. UI-CUBE (UiPath Computer Use BEnchmark) exemplifies an enterprise-grade approach, comprising 226 tasks across:

  • Tier 1 ("Simple UI Interactions"): 136 atomic control tasks (button clicks, menu selections), systematically varying 22 control types, 27 layouts, and 27 action types.
  • Tier 2 ("Complex Workflows"): 90 scenario-driven tasks including copy-paste, aggregation, form processing, and realistic enterprise context (e.g., Salesforce, SAP, Concur mocks) (Cristescu et al., 21 Nov 2025).

Tasks are executed at enforced resolutions (1024×768, 1920×1080, 3840×2160) to probe grounding robustness. Empirical evaluations reveal a discontinuity—or "capability cliff":

  • Simple Interactions: Agent success rates 66.7%–84.8%66.7\%–84.8\% (compared to human 97.9%97.9\%), ratio gg0.
  • Complex Workflows: Success rates plummet to gg1 (human gg2), gg3.
  • The absolute agent drop is gg4 percentage points (Cristescu et al., 21 Nov 2025).

This cliff highlights categorical limitations not addressable by incremental data scaling or prompting, but rooted in memory management, hierarchical planning, and state coordination.

Major open benchmarks include OSWorld, WindowsAgentArena, CUA-World (10K+ tasks, 200 software environments), and A11y-CUA for accessibility, with evaluation metrics spanning simple task accuracy, efficiency (step count), environment coverage, and error recovery (Cristescu et al., 21 Nov 2025, Sumyk et al., 11 Mar 2026, Aggarwal et al., 7 Apr 2026, Mohanbabu et al., 10 Feb 2026, Wang et al., 12 Aug 2025).

3. System Architectures, Skill Abstraction, and Design Patterns

Robust CUA architectures employ modular decomposition:

  • Planner: High-level LLM policy for intent interpretation and task decomposition.
  • Grounder: Maps symbolic actions to GUI coordinates using vision or structured environment inspectors.
  • State Manager: Maintains onboard representation of the UI and tracks dynamic state changes.
  • Memory Buffer: Persistent storage for processed items, actions, and intermediate results.
  • Skill Libraries: Abstraction layers (e.g., CUA-Skill) encapsulate reusable and parameterized execution graphs, supporting composition, memory-aware failure recovery, and cross-application generalization (Cristescu et al., 21 Nov 2025, Chen et al., 28 Jan 2026, Lee et al., 19 Feb 2026).

Emerging paradigms further include intent-aligned plan memory for long-horizon stability (IntentCUA), modular auditor roles for self-correction (as in CUA-auditor and vision-based Judge agents), and orchestration protocols for collaborative agent–coder GUI co-design (Lin et al., 19 Nov 2025, Sumyk et al., 11 Mar 2026, Sumyk et al., 25 Nov 2025).

Skill-based and intent-level abstractions yield substantial gains: CUA-Skill achieves up to gg5 "best-of-3" success on WindowsAgentArena, and IntentCUA demonstrates gg6 success with step efficiency ratio gg7, outperforming RL-only systems by wide margins (Chen et al., 28 Jan 2026, Lee et al., 19 Feb 2026). Memory and planning modularity are critical for bridging the capability gap (Cristescu et al., 21 Nov 2025, Zhang et al., 19 Jan 2026, Mei et al., 24 May 2025).

4. Security, Safety, and Adversarial Robustness

CUAs introduce unique security risks, stemming from their expanded perception surface (screenshots, DOM, logs), interface-action coupling, and persistent delegation. Adversarial risks and flaws systematically cataloged include:

  • UI deception and perceptual mismatch: TOCTOU, clickjacking via overlay, perceptual hallucination.
  • Indirect prompt injection: Adversarial content delivered via environment channels (chat, forums) hijacks agent control flow, causing critical actions (e.g., file exfiltration, privilege escalation) (Liao et al., 28 May 2025, Luo et al., 8 Oct 2025, Jones et al., 7 Jul 2025).
  • Remote code execution (RCE) via chained actions: Long-term, multi-step vulnerabilities (e.g., tool orchestration, MIME handler exploits) circumvent naive gating.
  • Chain-of-thought (CoT) exposure: Reasoning output spills to observable logs or external files, violating privacy or leaking context.

Attack Success Rates (ASR) on comprehensive benchmarks (RTC-Bench, AdvCUA) reach up to gg8 for advanced CUAs even under best-case defenses, and Attempt Rates (AR) can approach gg9, exposing insufficient separation between reasoning and actuator modules (Liao et al., 28 May 2025, Luo et al., 8 Oct 2025, Jones et al., 9 Feb 2026).

Contemporary defense strategies include modular simulation-to-real reasoning corrections (e.g., MirrorGuard reduces unsafe rates from SS0 to SS1 while keeping false refusal rates at SS2), explicit provenance tracking, cryptographic delegation verification, runtime planning audits, and ephemeral containerization. Fine-grained adversarial evaluation and risk taxonomies are now considered essential for any CUA deployment in adversarial or high-stakes contexts (Zhang et al., 19 Jan 2026, Jones et al., 7 Jul 2025, Chen et al., 16 May 2025).

5. Accessibility, Generalization, and Collaborative Control

Research targeting accessibility, such as A11y-CUA, underscores that current CUAs mirror sighted user interaction styles (dominated by point-and-click), failing under keyboard-only or magnifier-mediated conditions typical of blind or low-vision users. CUAs show up to SS3 success rate in sighted mode, but degrade to SS4 (keyboard-only) and SS5 (magnifier), compared to near-perfect performance by trained BLVUs in their native workflows (Mohanbabu et al., 10 Feb 2026). Key gaps are:

  • Perception: Absence of screen-reader state and off-viewport awareness.
  • Cognition: Weak long-horizon state tracking and unreliable final-state checks.
  • Action: Inefficient hotkey utilization and lack of collaborative protocol for dynamic error correction.

Proposed solutions include ingestion of assistive technology signals, adaptation to keyboard-first navigation, explicitly narratable plans, and collaborative overlays enabling BLVUs to interrupt or override agent execution.

CUAs are also evolving toward collaborative and cross-agent interaction paradigms, as evidenced by frameworks supporting multi-agent planning, intent abstraction, skill sharing, and integrated agent–human/tutor workflows (Lin et al., 19 Nov 2025, Bonatti et al., 31 Jan 2025, Lee et al., 19 Feb 2026, Zhang et al., 19 Jan 2026).

6. Open Challenges, Limitations, and Future Directions

Despite significant progress, CUAs face persistent challenges:

  • Long-horizon generalization: Most approaches still falter on workflows demanding stable memory, adaptive hierarchical planning, and robust state tracking across hundreds of steps and dynamic environments (Cristescu et al., 21 Nov 2025, Aggarwal et al., 7 Apr 2026, Chen et al., 28 Jan 2026).
  • Security–capability trade-off: Defensive measures (e.g., accessibility trees, containerization) may reduce attack surfaces but at the cost of agent efficiency or generality (Liao et al., 28 May 2025, Chen et al., 16 May 2025).
  • Model and evaluation transparency: Benchmark transparency, standard protocols for agent provenance, and third-party auditability remain underdeveloped (Chen et al., 16 May 2025).
  • Continuous learning and adaptation: Few current approaches enable self-improving CUAs via autonomous curriculum generation, experiential reinforcement, or on-the-fly skill abstraction (notable exception: SEAgent) (Sun et al., 6 Aug 2025).
  • Scalability and coverage: Most benchmarks underrepresent multi-modal, long-horizon, and interdisciplinary workflows crucial for economic impact and real-world deployment (Aggarwal et al., 7 Apr 2026, Wang et al., 12 Aug 2025).

Prospective research directions encompass memory-augmented architecture, hierarchical RL for compositional reasoning, meta-learning of interface variation, collaborative human-agent paradigms, and formal safety certification.


References: Primary references include (Cristescu et al., 21 Nov 2025, Sumyk et al., 11 Mar 2026, Liao et al., 28 May 2025, Mohanbabu et al., 10 Feb 2026, Lin et al., 19 Nov 2025, Zhang et al., 19 Jan 2026, Mei et al., 24 May 2025, Bonatti et al., 31 Jan 2025, Luo et al., 8 Oct 2025, Jones et al., 9 Feb 2026, Sumyk et al., 25 Nov 2025, Chen et al., 28 Jan 2026, Wang et al., 12 Aug 2025, Aggarwal et al., 7 Apr 2026, Ren et al., 14 Oct 2025, Feizi et al., 10 Nov 2025, Sun et al., 6 Aug 2025, Jones et al., 7 Jul 2025, Lee et al., 19 Feb 2026, Chen et al., 16 May 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Computer-Use Agent (CUA).