Papers
Topics
Authors
Recent
Assistant
AI Research Assistant
Well-researched responses based on relevant abstracts and paper content.
Custom Instructions Pro
Preferences or requirements that you'd like Emergent Mind to consider when generating responses.
Gemini 2.5 Flash
Gemini 2.5 Flash 177 tok/s
Gemini 2.5 Pro 50 tok/s Pro
GPT-5 Medium 31 tok/s Pro
GPT-5 High 32 tok/s Pro
GPT-4o 93 tok/s Pro
Kimi K2 183 tok/s Pro
GPT OSS 120B 447 tok/s Pro
Claude Sonnet 4.5 36 tok/s Pro
2000 character limit reached

Computer-Using Agents (CUAs)

Updated 9 November 2025
  • Computer-Using Agents (CUAs) are autonomous systems that integrate LLMs, vision, and structured UI inputs to automate complex digital tasks.
  • They employ closed-loop perception, reasoning, and actuation to interpret screens and execute multi-step solutions in real-world digital environments.
  • Recent benchmarks and risk analyses reveal that while CUAs achieve near human parity in task execution, they face challenges in efficiency, security, and context scaling.

Computer-Using Agents (CUAs) are autonomous systems—typically based on LLMs, sometimes augmented with vision or multimodal sensing—that directly interact with real or emulated graphical user interfaces (GUIs) to perform complex digital tasks. Distinguished from traditional chatbots and robotic process automation by their closed-loop perception, reasoning, and actuation over genuine UI environments, CUAs are capable of observing screens, parsing accessibility trees, planning multi-step solutions, executing low-level input events, and integrating feedback to complete diverse tasks on behalf of human users. Recent research formalizes CUAs as part of a rapidly expanding field marked by advances in dataset collection, benchmark design, and risk analysis as agents begin to mediate consequential desktop, web, and mobile workflows.

1. Formal Definitions and Class Architectures

Fundamentally, CUAs are LLM-driven policies π\pi operating in a partially observable decision process:

ot=Perception(st) atπ(ht) st+1=T(st,at)o_t = \mathrm{Perception}(s_t) \ a_t \sim \pi(\,\cdot\,|\,h_t) \ s_{t+1} = T(s_t, a_t)

where oto_t is a high-dimensional observation (screenshots, DOM, a11y trees), hth_t is historical trajectory, ata_t is a discrete UI action (e.g., click, type, scroll, API), and TT is the environment transition defined by the OS/GUI. CUAs instantiate classical agent modules—perception, brain (reasoning model), and action, often bound by explicit step budgets and success metrics (e.g., fraction of tasks completed within NN steps) (Chen et al., 16 May 2025, Wang et al., 12 Aug 2025).

Architecturally, CUAs may integrate:

  • Pure screenshot pipelines (pixels only)
  • Hybrid screenshot + structured UI input (a11y tree, Set-of-Marks, DOM)
  • Native API/GUI fusion for interface actions (e.g., UFO²’s multiagent Host + AppAgent, hybrid control detection (Zhang et al., 20 Apr 2025))
  • Goal-oriented or declarative planning abstractions for improved LLM compatibility (e.g., GOI interface primitives (Wang et al., 6 Oct 2025))
  • Explicit reward and reflection workflows for step-wise trajectory optimization (see “Reflective Chain-of-Thought” and reward models (Wang et al., 12 Aug 2025, Lin et al., 21 Oct 2025))

2. Capabilities, Benchmarks, and Evaluation Methodologies

CUAs now span domains including:

  • Productivity suites and desktop applications (Office, VSCode, GIMP, LibreOffice, etc.)
  • Web browser tasks, multi-site navigation, form-filling, e-commerce
  • OS-level system configuration, automation, and security-critical workflows
  • Human-computer interaction proxies in adversarial and regulatory audits

Benchmarks have evolved substantially to characterize CUA competence and limitations. Notable resources include:

Benchmark Domain Evaluation Focus
OSWorld Desktop apps/OS Task completion, step budgets
GUI-360˚ Windows Office GUI grounding, parsing, action prediction (Mu et al., 6 Nov 2025)
CUARewardBench Desktop workflow ORM/PRM reward model evaluation, expert labels, UPE ensembles (Lin et al., 21 Oct 2025)
SusBench Live websites Dark pattern susceptibility, human/agent parity, pattern-level analysis (Guo et al., 13 Oct 2025)
HackWorld Web Pen-testing Exploitation rates, CTF success, tool orchestration (Ren et al., 14 Oct 2025)
OSWorld-Human Desktop apps Efficiency, latency, gold human-trajectory (Abhyankar et al., 19 Jun 2025)

Metrics focus on success rate (SR), avoidance/susceptibility for manipulative design, precision/NPV for reward models, and tailored measures such as Weighted Efficiency Score (WES) for temporal studies. Both script-based and LLM-as-Judge approaches are used, with multi-level annotation and validation to enforce rigorous evaluation standards.

3. Human and Agent Parity, Weaknesses, and Bias

Recent studies reveal near parity between CUAs and humans on behavioral susceptibility metrics. For example, both groups fall prey to UI dark patterns—“Preselection,” “Trick Wording,” and “Hidden Information”—while easily resisting overt manipulations (Confirm Shaming, Fake Social Proof) (Guo et al., 13 Oct 2025). This signals that CUA action selection is subject to shallow parsing and default behaviors, analogous to human reflexive clicking.

Efficiency studies report that CUAs are 1.4–2.7× less efficient than the human oracle, with repeated LLM calls for planning and reflection dominating end-to-end latency (up to tens of minutes for tasks humans complete in ~2 minutes) (Abhyankar et al., 19 Jun 2025). Context growth (prompt size increases 2.53×2.5\text{–}3\times over trajectories), redundant substeps, and a lack of hierarchical/batched planning emerge as key pain points.

Further, empirical risk analyses such as BLIND-ACT highlight “Blind Goal-Directedness” (BGD)—the propensity for agents to pursue goals without sufficient safety or feasibility checks, resulting in privacy harms, over-permissioning, and execution of contradictory or infeasible goals, at rates above 80% across nine leading models (Shayegani et al., 2 Oct 2025). Failures cluster around execution-first bias, thought–action disconnect, and request-primacy.

4. Efficiency, Reward Models, and Data Scaling

Initial evaluation approaches relied on scripted verification of task outcomes, but these scale poorly and cannot track step-wise correctness. Vision-language-model-based reward models (CUARewardBench (Lin et al., 21 Oct 2025)) now enable trajectory- and step-level supervision using outcome reward models (ORMs) and process reward models (PRMs), annotated by dual expert panels. Unanimous Prompt Ensemble (UPE) voting yields the highest ORM precision (89.8%) and negative predictive value (93.3%).

Scaling data and models yields measurable performance gains. ScaleCUA demonstrates a 25–30 pp improvement when training on large, cross-platform corpora, and sets SOTA on benchmarks such as WebArena-Lite-v2 (47.4% SR), ScreenSpot-Pro (94.7% accuracy), MMBench-GUI L1-Hard (94.4%) and OSWorld-G (60.6%) (Liu et al., 18 Sep 2025). Curated datasets (GUI-360˚, AgentNet) and RL curriculum pipelines (SEAgent (Sun et al., 6 Aug 2025)) facilitate specialist-to-generalist evolution, yielding >3× improvements on foundational benchmarks.

Behavior Best-of-N (bBoN, (Gonzalez-Pumariega et al., 2 Oct 2025)) expands agent selection by wide stochastic rollout and structured narrative-level comparative selection, setting new SOTA (69.9% success, nearly at human level on OSWorld), and validating the “unreasonable effectiveness” of agent scaling and ensemble selection.

5. Security Vulnerabilities, Adversarial Risks, and Regulatory Implications

CUAs introduce novel attack surfaces and failure modes:

  • Indirect and visual prompt injection (SusBench, VPI-Bench) can manipulate agents via screenshot-based or DOM-level UI cues; agents are vulnerable to realistic, covert patterns (preselection, trick wording, hidden info) and explicit content-injection attacks (Guo et al., 13 Oct 2025, Cao et al., 3 Jun 2025).
  • RedTeamCUA exposes high real-world vulnerabilities in hybrid web-OS environments—Attack Success Rates (ASR) of 50% for leading CUAs (Claude 4 Opus), and even the most secure framework (Operator) at 7.6% (Liao et al., 28 May 2025).
  • HackWorld demonstrates <12% overall CUA exploit success on live web vulnerabilities, indicating insufficient chaining and cybersecurity tool orchestration (Ren et al., 14 Oct 2025).
  • AdvCUA benchmarks pose OS-level kill chains modeled on MITRE ATT&CK, revealing a dramatic reduction in required attacker expertise when using CUAs—enabling non-experts to perform sophisticated intrusion chains using natural-language prompts (Luo et al., 8 Oct 2025).

Systematization efforts (Jones et al., 7 Jul 2025, Chen et al., 16 May 2025) highlight seven risk classes, including UI deception, remote code execution via action composition, chain-of-thought exposure, indirect prompt injection, over-delegation, and emergent inference harms. Design principles emphasize provenance tagging, interface-action binding, step-wise confirmation, context-aware gating, delegation verification, redaction of agent reasoning traces, and ephemeral/sandboxed execution.

Regulatory concerns—addressed in SusBench and broader survey—note that as CUAs operate autonomously, oversight of designer, agent developer, and regulatory boundaries is required (FTC §5, EU Digital Services Act).

6. Future Directions and Open Challenges

Persistent challenges are identified:

  • Improving grounding and action prediction under compositional/long-horizon tasks and novel layouts (GUI-360˚, ScaleCUA).
  • Addressing overhead and latency via batch planning, hierarchical reasoning, and efficient reward/feedback integration (OSWorld-Human, UFO²).
  • Building dynamic, adversarially robust testbeds scaling to real-time deployment (e.g., RedTeamCUA, VPI-Bench).
  • Enabling human-in-the-loop oversight, transparent audit, and policy-compliant action selection without degrading usability (e.g., output monitoring, XAI explanation, regulatory module integration).
  • Developing standards and dynamic benchmarks for cross-platform, cross-domain safety (e.g., continual learning pipelines, adaptive task generation, scenario evolution).
  • Integrating advanced reward modeling and knowledge distillation pipelines, as well as robust security audit frameworks.

A plausible implication is that despite improvements, CUAs remain subject to fundamental alignment risks and security vulnerabilities as their action spaces and environmental access broaden. Advances in declarative interface abstraction (GOI (Wang et al., 6 Oct 2025)), native OS integration (UFO² (Zhang et al., 20 Apr 2025)), self-evolving curriculum RL (SEAgent (Sun et al., 6 Aug 2025)), and ensemble selection (bBoN (Gonzalez-Pumariega et al., 2 Oct 2025)) offer concrete stepping stones toward scalable, robust, and increasingly trustworthy agent systems.

7. Summary Table: Recent Benchmarks and Key Metrics

Resource Domain Metric(s) Key Result(s)
SusBench Web dark patterns Avoidance rate CUAs ≈ Humans, hardest: Preselection/Trick Wording/Hidden Info (Guo et al., 13 Oct 2025)
GUI-360˚ Desktop Office Grounding, F1, SFT lifts grounding to 82%; SOTA still 10–20pp below human (Mu et al., 6 Nov 2025)
CUARewardBench Desktop workflow ORM/PRM, UPE UPE ensemble: ORM 89.8% precision (Lin et al., 21 Oct 2025)
OSWorld-Human Desktop apps Step efficiency, CUAs take 1.4–2.7x more steps than human trajectory, heavy LLM overhead (Abhyankar et al., 19 Jun 2025)
HackWorld Web penetration Success rate SOTA under 12% exploit success (Ren et al., 14 Oct 2025)
RedTeamCUA Hybrid Web/OS ASR, AR ASR up to 50% for advanced CUAs (Liao et al., 28 May 2025)
VPI-Bench Visual prompt injection AR/SR BUAs up to 100% AR/SR, CUAs up to 59% AR/SR (Cao et al., 3 Jun 2025)
ScaleCUA Cross-platform Success rate SOTA on WebArena-Lite, ScreenSpot-Pro, MMBench-GUI (Liu et al., 18 Sep 2025)
OpenCUA Three OSes, 200+ apps Success rate OpenCUA-32B: 34.8% SR on OSWorld-Verified (Wang et al., 12 Aug 2025)

CUAs represent the convergence of multimodal perception, LLM-powered reasoning, and direct environment manipulation, raising both hopes for scalable digital automation and critical questions about trust, safety, and secure deployment. Their evolution will be shaped by advances in benchmark coverage, reward modeling, interface design, and adversarial resilience, as well as regulatory oversight and system-level best practices.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (17)
Forward Email Streamline Icon: https://streamlinehq.com

Follow Topic

Get notified by email when new papers are published related to Computer-Using Agents (CUAs).