ClawGUI-RL: Open-Source GUI RL Framework
- ClawGUI-RL is a reinforcement learning framework that enables GUI agents to operate complex software by interacting through user interfaces on both emulated and real devices.
- It integrates an Environment Manager, a dual reward system providing dense per-step and binary outcome rewards, and supports policy-gradient algorithms like GiGPO for efficient training.
- Benchmark results demonstrate its superior performance over similar frameworks, highlighting improved MobileWorld success rates via scalable and stable training pipelines.
ClawGUI-RL is the reinforcement learning (RL) backbone of the ClawGUI open-source framework, designed to enable training, evaluation, and deployment of GUI agents capable of interacting with complex software applications through their user interfaces rather than through programmatic APIs. This system is the first validated, open-source infrastructure supporting both parallel virtual environments and real physical devices for GUI online RL, integrating advanced advantage estimation and dense reward supervision to address environment instability, non-standardized evaluation, and deployment bottlenecks in the field (Tang et al., 13 Apr 2026).
1. Architecture and Data Flow
ClawGUI-RL is structured around three primary subsystems that together form a scalable, unified RL pipeline:
- Environment Manager: Abstracts both emulator-based (Dockerized Android) and real-device backends. The interface unifies environment reset, step execution, and rendering operations via a Python API. Key features include health monitoring, automated crash recovery, and spare-server management.
- Reward Module: Computes both a sparse, binary episode-level outcome reward () and a dense per-step reward via the Process Reward Model (PRM). Success at the episode level is determined either by system checks (on rooted emulators) or LLM-based judgement (on real devices).
- RL Trainer: Supports a policy-gradient loop with Reinforce++, PPO, GRPO, or GiGPO for policy optimization. It aggregates trajectories, estimates hierarchical advantages (as required by GiGPO), and applies gradient-based updates to policy parameters.
Data flows with parallel workers sampling tasks from the Environment Manager, executing policy actions, receiving observations and dense rewards, and, upon episode completion, aggregating the outcome reward. These trajectories are used for hierarchical advantage computation and batched policy updates.
2. Formal MDP Specification for GUI Agents
ClawGUI-RL defines GUI interaction as a Markov Decision Process (MDP) where:
- : Internal states (e.g., device memory, screen image)
- : Discrete action space (tap, swipe, text input, navigation)
- : Observation space (, RGB screenshot)
- State-action transition function ()
- State-to-observation mapping
- Reward function (0)
- 1: Discount factor 2
At each timestep 3, the agent observes 4, executes 5, transitions according to 6, and receives 7. Episodes terminate after 8 steps or at success/failure.
3. Support for Virtual and Real Devices
ClawGUI-RL's Environment Manager unifies two backend types:
- Virtual Environments: Each environment is a Docker-based emulator, exposing a REST API for state resets and action stepping. Full root access allows deterministic verification of task success via direct state or UI-tree queries, and spare-containers are automatically rotated on failure.
- Real Devices: Managed via ADB over USB/TCP, with task scenarios labeled for LLM-based judgement due to lack of root/system access. Each device responds to input commands, and visual state is captured via remote screencap. The outcome judge relies on LLM prompting against the goal state.
This design allows seamless interleaving of emulator and real-device workers within the same training job, covering both scalable simulation and deployment realism.
4. Credit Assignment and Reward Shaping
GiGPO Advantage Estimation
Group-in-Group Policy Optimization (GiGPO) is centrally integrated, providing a hierarchical advantage estimator:
- Global (Episode-Level) Advantage: Across 9 rollouts of a task, total return 0 is normalized within the group to produce 1.
- Micro (Step-Level) Advantage: Steps across rollouts are clustered by anchor state; within each micro-group, future returns are normalized to yield 2.
- Convex Combination: Final advantage is 3, with tunable 4.
This facilitates fine-grained, step-level credit assignment without value functions, supporting dense and efficient policy learning.
Process Reward Model and Reward Function
Rewards combine:
- Episode Outcome Reward: 5 for success or 0 otherwise (based on perfect verification or LLM judge).
- Step-Level PRM Reward: At each step, a pretrained LLM judge assesses whether 6 advances the task, returning 7; 8 if 9, 0 otherwise.
Total return per episode:
0
If training a new PRM, cross-entropy loss between LLM outputs 1 and human/script labels 2 is minimized.
5. Training Protocol, Hyperparameters, and Empirical Results
End-to-End Training Loop
The standard pipeline alternates between parallel trajectory collection and policy updates. The pseudocode is as follows:
7
Experimental Hyperparameters and Reward Ablation
The main ClawGUI-2B run employs:
- 64 Docker emulators (no real devices for this run)
- 8×A6000 GPUs (48 GB RAM each)
- GiGPO with group size 8, 3
- Sampling temperature 4
- Learning rate 5
- Batch size: 8 trajectories per update
- Discount factor: 6
- 3 training epochs
Ablation on reward types indicates that dense, step-level PRM rewards with GiGPO yield a +2.6 point absolute gain in MobileWorld GUI-Only Success Rate over binary (episode-level) rewards (17.1% vs. 14.5%).
Benchmark Comparison
| Model | MobileWorld SR (GUI-Only) |
|---|---|
| MAI-UI-2B | 11.1 |
| Qwen3-VL-32B | 11.9 |
| UI-Venus-72B | 16.4 |
| ClawGUI-2B (ours) | 17.1 |
ClawGUI-2B exceeds the same-scale MAI-UI-2B baseline by 6.0 points and outperforms larger untrained models, demonstrating that infrastructure and credit shaping have greater impact than model scale alone in this regime.
6. Usage Example and Pipeline Launch
Launching a ClawGUI-RL training job requires minimal setup:
8 This script provisions all emulators, streams trajectories, computes dense rewards with LLMs, applies GiGPO, and updates the policy network over multiple epochs.
7. Limitations and Future Directions
ClawGUI-RL's open-source RL infrastructure is the first to address scalable training across both emulators and real devices. However, notable challenges include:
- On real devices, reward verification currently relies on LLM-based judgement in the absence of root/system signals.
- RL loop is Android-only; extending to iOS requires new device drivers.
- Agents are strictly reactive—there is no learned world model for multi-step look-ahead or planning.
- LLM-based PRM and outcome judges impose significant CPU/GPU overhead.
Proposed future enhancements include device-privacy-preserving RL, GUI-world model learning to enable planning, unified support for CLI and GUI interaction, and extending the environment abstraction to iOS, Linux, Windows, and HarmonyOS ecosystems. This suggests rapid advances are plausible as the community builds on the open ClawGUI-RL foundation (Tang et al., 13 Apr 2026).