Papers
Topics
Authors
Recent
Gemini 2.5 Flash
Gemini 2.5 Flash 99 tok/s
Gemini 2.5 Pro 48 tok/s Pro
GPT-5 Medium 36 tok/s
GPT-5 High 40 tok/s Pro
GPT-4o 99 tok/s
GPT OSS 120B 461 tok/s Pro
Kimi K2 191 tok/s Pro
2000 character limit reached

ComputerRL: Unified RL for Desktop Automation

Updated 21 August 2025
  • The paper introduces a unified API-GUI paradigm that merges precise API calls with flexible GUI actions to improve agent-environment interaction.
  • It leverages a large-scale distributed RL infrastructure with containerized desktops and multi-node coordination to scale training efficiently.
  • The framework employs the Entropulse strategy to mitigate entropy collapse, achieving superior performance on the OSWorld benchmark with fewer execution steps.

ComputerRL is a reinforcement learning (RL) framework developed to enable large-scale, robust, and generalizable training of autonomous agents for complex digital desktop tasks. By introducing the API-GUI paradigm, ComputerRL unifies both programmatic API invocation and flexible GUI manipulation, addressing the inherent mismatch between machine agents and human-centric desktop environments. The framework is designed for scalable, end-to-end online RL training over distributed virtual desktops, and incorporates a novel entropy-preserving optimization regime ("Entropulse") to support stable and continued policy improvement. State-of-the-art results are demonstrated on the OSWorld benchmark, underscoring the framework's impact for modern desktop automation and general agent research (Lai et al., 19 Aug 2025).

1. API-GUI Unified Action Paradigm

ComputerRL's architectural core is the API-GUI paradigm, which integrates two operational modalities for agent actions:

  • API Mode: Agents can utilize programmatically exposed APIs, automatically generated with LLM assistance. During API construction, requirements are derived from exemplar tasks, library-specific APIs are implemented (with error handling and verification), and comprehensive tests are generated to ensure coverage. This enables high-efficiency, low-variance operations whenever possible.
  • GUI Mode: For scenarios where APIs are unavailable, agents revert to GUI-based manipulation (e.g., mouse, keyboard, drag/drop), preserving generality and coverage over legacy, third-party, or poorly-documented software.

This dual action space addresses the agent-environment interface mismatch by providing both precise, semantically-robust API actions and flexible, human-like GUI actions. The paradigm is depicted in Figure \ref{figs:framework} of the source publication, showing unified agent control across the virtual desktop.

2. Large-Scale Distributed RL Infrastructure

To scale end-to-end RL training to the necessary magnitude for generalization and robustness, ComputerRL introduces a distributed training infrastructure capable of orchestrating thousands of parallel virtual desktops:

  • Containerized Ubuntu Desktops: The environments are hosted using lightweight qemu-in-docker images, facilitating rapid deployment and minimal per-instance resource usage.
  • Multi-node Coordination: Clusters of CPUs/GPUs are coordinated via gRPC protocol; a centralized controller manages job assignment and environment orchestration, dynamically balancing the load.
  • AgentBench API Interface: The RL infrastructure standardizes environment interfaces using the AgentBench API, providing compatibility with varying agent architectures and RL algorithms.

The system supports training at cloud scale, enabling sustained experimentation across a diverse array of desktop environments and operating system states, as depicted in the architectural diagrams of the publication.

3. Entropulse Alternating Training Strategy

A critical problem in long-horizon RL over complex environments is entropy collapse—i.e., diminishing exploration and premature convergence. ComputerRL mitigates this via the Entropulse strategy, a hybrid training loop alternating between reinforcement learning (RL) and supervised fine-tuning (SFT) phases:

  • RL Phase: Agents collect successful trajectories via RL policy optimization, aggregating them as high-quality rollouts.
  • SFT Phase: Rollouts from the RL phase are incorporated into a supervised training regimen, temporarily restoring entropy and exploration diversity by re-exposing the agent to its own successful behaviors.
  • Alternating Cycle: Multiple cycles of RL and SFT are executed, ensuring entropy recovery and continued improvement.

Optimization employs a step-level Group Relative Policy Optimization (StepGRPO) algorithm. The loss is defined by:

JStepGRPO(θ)=Et{1i=1GLii=1Gj=1Li[min(rijπθ(oijq)πθold(oijq),clip(πθ(oijq)πθold(oijq),1ϵ,1+ϵ)rij)βDKL(πθπref)]}J_{\text{StepGRPO}}(\theta) = \mathbb{E}_{t} \left\{ \frac{1}{\sum_{i=1}^G L_i} \sum_{i=1}^G \sum_{j=1}^{L_i} \left[ \min \left(r_{ij} \frac{\pi_\theta(o_{ij}|q)}{\pi_{\theta_{old}}(o_{ij}|q)}, \operatorname{clip}\left(\frac{\pi_\theta(o_{ij}|q)}{\pi_{\theta_{old}}(o_{ij}|q)}, 1-\epsilon, 1+\epsilon\right) r_{ij} \right) - \beta D_{\text{KL}}(\pi_\theta \parallel \pi_{ref}) \right] \right\}

The step-wise advantage is normalized as

Ai,j=ri,jmean(R)std(R),R={ru,v(u,v)}A_{i,j} = \frac{r_{i,j} - \text{mean}(\mathcal{R})}{\text{std}(\mathcal{R})}, \quad \mathcal{R} = \{ r_{u,v} \,|\, \forall(u,v) \}

This training regime demonstrably prevents entropy collapse, facilitating robust policy refinement through extended training horizons.

4. Evaluation Protocols and Empirical Results

ComputerRL is validated on the OSWorld benchmark, which evaluates agent proficiency across a suite of desktop tasks. In experiments, the framework is paired with open models such as GLM-4-9B-0414 (yielding AutoGLM-OS-9B) and Qwen2.5-14B. Key results include:

  • AutoGLM-OS-9B achieves a success rate of 48.1% ± 1.0 on OSWorld.
  • Agents trained with ComputerRL require at most one-third the execution steps compared to strong baselines such as OpenAI CUA and UI-TARS.
  • Multi-stage training (combining behavior cloning, RL, and Entropulse cycles) consistently yields gains in both reward curves and recovered policy entropy.

Performance is reported relative to strong baselines, with improvements documented in Table 1 and illustrated in reward/entropy trend figures in the publication.

5. Applications and Implications

Adoption of ComputerRL has multiple practical implications:

  • Desktop Automation: Agents trained in this scheme can carry out various desktop tasks—document formatting, file management, cross-application workflows—leveraging hybrid API-GUI control for precision and coverage.
  • Generalization: The infrastructure and training paradigm enable the development of agents capable of handling previously unseen desktop states, software combinations, and task requirements.
  • Operational Efficiency: The use of programmatic APIs dramatically reduces necessary interaction steps, improving both training and inference efficiency.
  • Scalability: The distributed design supports massive parallelism, crucial for training state-of-the-art, large capacity policies.
  • Autonomous Agents: The technology underlies frameworks such as AutoGLM, contributing to the advancement of intelligent desktop assistants and office automation agents.

A plausible implication is that ComputerRL may accelerate the practical deployment of general computer-use agents in real enterprise and productivity settings.

6. Framework Integration and Broader Impact

ComputerRL's modular design and standardized interfaces (e.g., AgentBench) foster interoperability with various RL and LLM architectures. The approach has implications for research trends at the intersection of LLMs, RL, and user-interface automation. By supporting both direct, high-level compositional actions (APIs) and fallback granular interface manipulations (GUI), ComputerRL is positioned to serve as a foundational infrastructure in the evolving landscape of intelligent human-computer collaboration.

The framework’s demonstrated empirical advances, robust distributed engineering, and training methodology represent a substantial step toward scalable, general-purpose, desktop agent research. Its integration into AutoGLM and adoption for the OSWorld benchmark signal its relevance and impact within the applied RL and AI systems community (Lai et al., 19 Aug 2025).

Definition Search Book Streamline Icon: https://streamlinehq.com
References (1)