Papers
Topics
Authors
Recent
Search
2000 character limit reached

EvoCUA: Evolving Computer-Use Agents

Updated 23 January 2026
  • EvoCUA is a native computer-use agent model that integrates verifiable synthetic experience generation with dynamic evolutionary policy updates to overcome static data limitations.
  • It employs a closed-loop system featuring a dual-stream synthesis engine, parallel sandbox rollouts, and on-policy policy optimization for scalable, adaptive learning.
  • Empirical evaluations on the OSWorld benchmark demonstrate state-of-the-art performance, with significant gains in efficiency and success rates over traditional models.

EvoCUA is a native computer-use agentic model that integrates verifiable synthetic experience generation and iterative policy evolution to advance the capabilities of multimodal AI agents operating in open-ended desktop environments. Unlike prior paradigms that rely on passive imitation of static human demonstration datasets, EvoCUA employs a self-sustaining evolutionary cycle that couples large-scale synthetic data generation, massive asynchronous sandbox rollouts, and dynamic policy refinement. This methodology enables EvoCUA to address intrinsic limitations of static data scaling—most notably, the lack of causal feedback and the rapid onset of diminishing returns in complex, long-horizon computer tasks. Empirical results on the OSWorld benchmark establish EvoCUA as the state-of-the-art among open-source computer use agents, with demonstrated effectiveness and generalizability across foundation model scales (Xue et al., 22 Jan 2026).

1. Conceptual Foundations and Motivation

Native computer-use agents (CUA) are designed to interpret screen pixels, generate low-level input sequences (mouse/keyboard), and complete complex workflows in desktop GUIs. Formally, CUAs are modeled as partially observable Markov decision processes (POMDPs):

(S,A,O,P,R),R(sT;g)=I[Vg(sT)=True](\mathcal{S}, \mathcal{A}, \mathcal{O}, \mathcal{P}, \mathcal{R}), \quad \mathcal{R}(s_T;g)=\mathbb{I}[V_g(s_T)=\mathrm{True}]

where gg is a natural-language instruction and VgV_g is an executable validator mapping terminal state sTs_T to success/failure.

Traditional CUA training via static imitation learning clones human demonstrations on fixed datasets. Two major drawbacks are documented:

  • Absence of causal feedback: The agent does not observe the GUI state aftermath of its actions and cannot adapt based on error.
  • Diminishing gains on scale: Expanding passive datasets fails to improve agent competence for novel or long-horizon workflows.

EvoCUA introduces an evolving agentic paradigm, continuously self-generating new verifiable tasks, collecting rich experience through on-policy rollouts, and focusing learning at the agent’s emerging capability boundary (Xue et al., 22 Jan 2026).

2. System Architecture

EvoCUA architecture consists of three primary modules operating in a closed evolutionary loop:

Module Core Function Key Outputs
Verifiable Synthesis Engine Samples diverse task specifications gg and validators VgV_g Task-instruction, Executable checker
Asynchronous Sandbox Rollouts Conducts NN parallel GUI simulations under πθ\pi_\theta Trajectory pool {τi}\{\tau_i\}
Policy Optimizer Updates model weights from success/failure traces Refined policy θ\theta
  • Verifiable Synthesis Engine (Tsyn\mathcal{T}_{syn}): Samples instructive and executable task-validators through a VLM-based dual-generation process and rigorous filtering.
  • Massively Parallel Sandboxing: Up to 10510^5 QEMU-KVM VMs execute agent policies in parallel, producing high-throughput trajectory data.
  • Dynamic On-Policy Optimizer: Consumes successful/failed rollouts, applying supervised and preference-based loss terms to refine policy parameters.

This system constructs a feedback cycle: each policy iteration informs the task generator, which yields harder or novel exemplars guiding further policy adaptation.

3. Synthetic Experience Generation

The synthesis pipeline comprises the following stages:

  • Structured Task Space Construction: Automatically combines hierarchical taxonomies of desktop apps (Excel, Word, Browser) with atomic UI capabilities (e.g., formula editing, table manipulation). Both parametric (e.g., synthetic tables) and non-parametric (Internet images, slides) content contribute to task realism.
  • Agentic Dual-Stream Synthesis: A large vision-LLM (VLM) operating in a ReAct loop generates paired instructions gg and validators VgV_g, where VgV_g is executable (Python or shell). Syntax errors are resolved in closed feedback between synthesis and execution.
  • Quality Assurance Pipeline: Multi-stage consistency checks remove unsolvable samples, duplicate configurations, and overlap with OSWorld benchmarks. Validators are tested for executability and novelty.

The canonical sampling procedure:

TaskSample(k):zkScenarioSampler, (gk,Vgk)Gen(zk), Execute(Vgk); if pass D={(gk,Vgk)}\text{TaskSample}(k): z_k\sim\text{ScenarioSampler},\ (g_k, V_{g_k}) \leftarrow \text{Gen}(z_k),\ \text{Execute}(V_{g_k}); \ \text{if pass}\ \mathcal{D} \cup= \{(g_k,V_{g_k})\}

This approach ensures high task quality and diverse, open-ended supervision.

4. Asynchronous Rollout Infrastructure

Scalable experience accrual is facilitated via:

  • Distributed Scheduling: Up to 10510^5 sandboxes are managed via microservice gateways, each simulating an independent GUI environment.
  • Streamed Trajectories: Each VM captures tuples (ot,zt,at,ot+1)(o_t, z_t, a_t, o_{t+1}) at every step; full rollouts are aggregated into an experience pool B\mathcal{B}.
  • Dataset Scale: The infrastructure supports collection of millions of trajectory steps per day.

This configuration supports real-time learning from both successful and failed agent interactions, substantially enriching the training corpus and enabling direct observation of causal consequences for action sequences.

5. Iterative Experience-Driven Policy Evolution

EvoCUA’s iterative learning strategy alternates between consolidation of success and exploitation of failure. The two main phases are:

  • Rejection Sampling Fine-Tuning (RFT): Samples a compute-targeted subset of successful trajectories, applies step-masking to filter out redundant actions, and performs supervised updates:

LRFT(θ)=1Mj=1Mlogπθ(zj,ajhj,oj)\mathcal{L}_{\mathrm{RFT}}(\theta) = -\frac{1}{M} \sum_{j=1}^M \log \pi_\theta(z_j, a_j \mid h_j, o_j)

A capability-boundary detector shifts rollout allocation toward tasks at the agent’s current limits.

  • Step-Level Direct Preference Optimization (DPO): For failure trajectories, identifies the first divergence point tt^* compared to a successful trace, and forms action-correction and reflection preference pairs. The joint DPO loss is:

J(θ)=E[logσ(βlogπθ(zw,aw)πref(zw,aw)βlogπθ(zl,al)πref(zl,al))]\mathcal{J}(\theta) = -\mathbb{E}\left[\log \sigma\left(\beta\log\frac{\pi_\theta(z_w, a_w)}{\pi_{\mathrm{ref}}(z_w, a_w)} - \beta\log\frac{\pi_\theta(z_l, a_l)}{\pi_{\mathrm{ref}}(z_l, a_l)}\right)\right]

The net policy update aggregates RFT, DPO, and KL regularization terms:

L(θ)=LRFT(θ)+λJ(θ)+ηKL(πθπold)\mathcal{L}(\theta) = \mathcal{L}_{\mathrm{RFT}}(\theta) + \lambda\,\mathcal{J}(\theta) + \eta\,\mathrm{KL}(\pi_\theta\|\pi_{\mathrm{old}})

This evolving learning loop dynamically reinforces successful routines and transforms boundary errors into structured supervision, thereby promoting robust generalization.

6. Empirical Results and Comparative Performance

EvoCUA’s empirical evaluation on the OSWorld-Verified benchmark demonstrates:

  • State-of-the-Art Performance: EvoCUA-32B achieves a 56.7% success rate (50-step cap), outperforming prior open-source models (OpenCUA-72B at 45.0%) and even surpassing leading closed-source systems (UI-TARS-2 at 53.1%).
  • Scalability: Smaller variants benefit from the approach. EvoCUA-8B attains 46.1%, outperforming Step-GUI-8B (40.2%) and OpenCUA-72B (45.0%).
  • Consistent Gains Across Metrics: EvoCUA maintains +3–5 pp improvement over baselines across pass-at-k statistics, e.g., +4.93 pp at k = 16.
  • Efficiency with Inference Scaling: Increasing inference steps from 15 to 50 yields a +16.25 pp absolute gain for the 32B model.

Ablation studies indicate that each evolutionary stage—Unified Action Space, Cold Start, RFT, DPO, Iterative Training—contributes monotonic improvements (e.g., RFT +3.13%, DPO +3.02% on Qwen3-VL-32B), with multi-round RFT giving +8.12% on OpenCUA-72B (Xue et al., 22 Jan 2026).

7. Generalizability, Key Insights, and Extensions

EvoCUA establishes that evolutionary, synthetic-experience-driven training is generally applicable and consistently beneficial across architectures and scales. Insights include:

  • Dual-Processing of Trajectories: Filtering low-noise successful traces and extracting high-signal error steps is critical.
  • Cold Start Patterns: Lightweight, pattern-centric initialization procedures promote stability before reinforcement learning.
  • On-Policy Data Integrity: Strict on-policy data curation prevents performance drift in extended task horizons.
  • Diagnostic Tooling: Visualization tools aligning ot,zt,ato_t, z_t, a_t are essential for analyzing hallucinations and failure modes.

Future extensions referenced in the data include the development of online agentic RL regimes, particularly Step-Level Policy Optimization (STEPO), allowing direct training with verifiable rewards in the policy loop to further approach human-level reliability.

In summary, EvoCUA’s approach—evolving cycle of verifiable synthetic task generation, large-scale sandboxes, and iterative experience-driven learning—constitutes a robust and scalable foundation for the advancement of native computer-use agents (Xue et al., 22 Jan 2026).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)

Topic to Video (Beta)

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to EvoCUA.