ComputerRL: Scaling End-to-End Online Reinforcement Learning for Computer Use Agents (2508.14040v1)

Published 19 Aug 2025 in cs.AI

Abstract: We introduce ComputerRL, a framework for autonomous desktop intelligence that enables agents to operate complex digital workspaces skillfully. ComputerRL features the API-GUI paradigm, which unifies programmatic API calls and direct GUI interaction to address the inherent mismatch between machine agents and human-centric desktop environments. Scaling end-to-end RL training is crucial for improvement and generalization across diverse desktop tasks, yet remains challenging due to environmental inefficiency and instability in extended training. To support scalable and robust training, we develop a distributed RL infrastructure capable of orchestrating thousands of parallel virtual desktop environments to accelerate large-scale online RL. Furthermore, we propose Entropulse, a training strategy that alternates reinforcement learning with supervised fine-tuning, effectively mitigating entropy collapse during extended training runs. We employ ComputerRL on open models GLM-4-9B-0414 and Qwen2.5-14B, and evaluate them on the OSWorld benchmark. The AutoGLM-OS-9B based on GLM-4-9B-0414 achieves a new state-of-the-art accuracy of 48.1%, demonstrating significant improvements for general agents in desktop automation. The algorithm and framework are adopted in building AutoGLM (Liu et al., 2024a)

Collections

Sign up for free to add this paper to one or more collections.

Sign Up

Summary

The paper introduces a hybrid API-GUI paradigm and a scalable distributed RL framework that enables efficient training for autonomous desktop automation agents.
It employs a three-stage training methodology combining behavior cloning, step-level GRPO, and the Entropulse phase to overcome entropy collapse and enhance exploration.
Empirical evaluations on the OSWorld benchmark show a 64% performance gain over baselines, with improved task completion efficiency and reduced operational steps.

ComputerRL: End-to-End Online Reinforcement Learning for Desktop Automation Agents

Introduction and Motivation

The paper presents ComputerRL, a comprehensive framework for training autonomous agents to operate complex desktop environments using end-to-end online reinforcement learning (RL). The motivation stems from the limitations of existing GUI agents, which are constrained by human-centric interface designs and the inefficiency of behavior cloning (BC) and model distillation approaches. ComputerRL addresses these challenges by introducing a hybrid API-GUI interaction paradigm, a scalable distributed RL infrastructure, and a novel training strategy—Entropulse—to mitigate entropy collapse and sustain learning in extended RL runs.

Framework Architecture and API-GUI Paradigm

ComputerRL's architecture is built around three pillars: the API-GUI action space, a large-scale parallel desktop environment, and a fully asynchronous RL training framework.

The API-GUI paradigm unifies programmatic API calls with direct GUI actions, allowing agents to leverage the efficiency of APIs while retaining the flexibility of GUI operations. API construction is automated via LLMs, which analyze user-supplied task exemplars, generate interface definitions, implement APIs using application-specific Python libraries, and produce test cases for validation. This workflow significantly reduces manual effort and enables rapid expansion of agent capabilities across diverse applications.

Figure 1: Overview of ComputerRL framework, integrating API-GUI actions and large-scale parallel desktop environments for efficient agent training.

The infrastructure supports thousands of concurrent Ubuntu desktop instances, orchestrated via Docker and gRPC, with a web-based controller for real-time monitoring. The environment is decoupled from the computational backend, enabling flexible resource management and seamless integration with learning algorithms.

Training Methodology: Behavior Cloning, RL, and Entropulse

Training in ComputerRL proceeds in three stages:

Behavior Cloning (BC) Cold-Start: Initial policy is trained via supervised imitation of trajectories collected from multiple general LLMs. Stratified sampling and model collaboration are used to maximize trajectory diversity and coverage, especially for hard or unsolved tasks.
Step-Level Group Relative Policy Optimization (GRPO): RL is performed using a step-level extension of GRPO, where each action in a trajectory receives a reward based on verifiable, rule-based task completion. The loss aggregates step advantages, normalized across all actions in a batch, and includes a KL penalty to regularize policy updates.
Entropulse: To counteract entropy collapse and stagnation in RL, Entropulse alternates RL with supervised fine-tuning (SFT) on successful rollouts. This restores policy entropy, enhances exploration, and enables further performance gains in subsequent RL phases.
Figure 2: ComputerRL training pipeline, illustrating BC cold-start, RL with step-level GRPO, and Entropulse for entropy recovery.

Figure 3: Training curves showing reward and entropy dynamics, with Entropulse restoring exploration and enabling continued improvement.

Empirical Results and Performance Analysis

ComputerRL is evaluated on the OSWorld benchmark using open-source LLMs (GLM-4-9B-0414 and Qwen2.5-14B). The AutoGLM-OS-9B agent achieves a state-of-the-art success rate of 48.1% on OSWorld, representing a 64% performance gain from RL and outperforming proprietary and open baselines such as OpenAI CUA o3 (42.9%), UI-TARS-1.5 (42.5%), and Claude Sonnet 4 (30.7%).

Figure 4: Success rates of agents on OSWorld, with AutoGLM-OS trained via ComputerRL surpassing prior models.

Ablation studies demonstrate that the API-GUI paradigm yields a 134% improvement over GUI-only approaches, with the largest gains in Office and Professional domains. Multi-stage training, particularly the Entropulse phase, is critical for maintaining exploration and achieving optimal performance. The agent exhibits strong proficiency in long-range planning and complex reasoning, completing tasks with up to 1/3 fewer steps than baselines.

Qualitative Case Studies and Error Analysis

Case studies illustrate the agent's ability to execute diverse desktop tasks, including table creation in LibreOffice Calc, text formatting in Writer, system resource monitoring via Terminal, and image transfer between GIMP and Writer.

Figure 5: Examples of AutoGLM-OS executing multi-application user tasks in the Ubuntu desktop environment.

Figure 6: Stepwise execution of a table creation task in LibreOffice Calc.

Figure 7: Execution of a text case conversion task in LibreOffice Writer.

Figure 8: System resource monitoring and report generation via Terminal.

Figure 9: Image transfer from GIMP to LibreOffice Writer and document saving.

Error analysis reveals that failures are primarily due to visual perception errors (25.8%), multi-application coordination (34.4%), operational illusions (14.2%), and other factors (25.6%). Representative failure cases include misunderstanding task requirements and incorrect click operations.

Figure 10: Failure due to question misunderstanding in a text processing task.

Figure 11: Failure due to incorrect click operations when changing GIMP's theme.

Implications and Future Directions

ComputerRL establishes a scalable and robust foundation for training generalist desktop agents. The hybrid API-GUI paradigm and distributed RL infrastructure enable efficient learning and generalization across heterogeneous applications. The Entropulse strategy provides a principled solution to entropy collapse, facilitating sustained policy improvement.

Practical implications include the potential for persistent, autonomous desktop assistants capable of orchestrating complex workflows, adapting to novel interfaces, and integrating multimodal perception. Theoretical implications involve the design of RL algorithms and training curricula that balance exploration and exploitation in high-dimensional, long-horizon environments.

Future research should focus on expanding training diversity, enhancing multimodal perception, developing hierarchical planning for long-horizon autonomy, and establishing safety and alignment protocols for agent actions in sensitive digital environments.

Conclusion

ComputerRL advances the state of the art in desktop automation agents by integrating API-based and GUI actions, scalable data collection, and multi-stage RL training. The framework demonstrates superior reasoning, accuracy, and generalization on OSWorld, laying the groundwork for more capable and persistent autonomous computer use agents. The work highlights promising directions for future research in intelligent human-computer interaction and RL-based agent training.

PDF Markdown

Follow-up Questions

Related Papers

Authors (10)

Tweets

https://twitter.com/Zai_org/status/1958175133706891613

https://twitter.com/gm8xx8/status/1958299060215128333

https://twitter.com/arxivsanitybot/status/1958360333828968767