ClawGUI: Open-Source GUI Agent Framework

Updated 4 July 2026

ClawGUI is an open-source, full-stack framework that integrates online reinforcement learning training, standardized evaluation, and real-device deployment for GUI agents using visual control.
It features three key modules—ClawGUI-RL, ClawGUI-Eval, and ClawGUI-Agent—that collectively tackle challenges like emulator drift, evaluation inconsistencies, and disconnected deployment.
The framework employs dense step-level rewards and a hierarchical credit assignment method (GiGPO) to achieve significant improvements in mobile GUI task success rates over baseline models.

Searching arXiv for papers on ClawGUI and closely related systems. ClawGUI is an open-source, full-stack framework for GUI agents that unifies three stages that are commonly separated in prior work: online reinforcement learning training, standardized evaluation, and real-device deployment. In the framework’s formulation, GUI agents operate software through visual interfaces rather than programmatic APIs, using taps, swipes, and keystrokes to reach applications that CLI-based agents cannot. ClawGUI organizes this lifecycle through three modules—ClawGUI-RL, ClawGUI-Eval, and ClawGUI-Agent—and validates the stack with ClawGUI-2B, which achieves 17.1% Success Rate on MobileWorld GUI-Only and is reported to outperform the same-scale MAI-UI-2B baseline by 6.0% (Tang et al., 13 Apr 2026).

1. Conceptual basis and problem setting

ClawGUI is framed around a specific diagnosis of the GUI-agent field. The paper argues that the central bottleneck is not merely model capacity, but the absence of coherent infrastructure across training, evaluation, and deployment. Three gaps are emphasized. First, online RL for GUI agents is destabilized by environment management problems such as emulator drift, container crashes, unhealthy states, and sparse long-horizon rewards. Second, nominally standardized evaluations drift through differences in prompt formatting, coordinate normalization conventions, image resolution, sampling configuration, temperature, and post-processing rules. Third, trained agents rarely reach real users on real devices, leaving research loops disconnected from deployment (Tang et al., 13 Apr 2026).

This framing also clarifies what ClawGUI is not. It is not presented as a single grounding model, planner, or benchmark. It is instead a systems framework that treats GUI-agent development as an end-to-end lifecycle problem. A recurrent misconception in the area is that better architectures alone are sufficient; ClawGUI’s position is that unstable training backends, silently drifting evaluation protocols, and weak deployment paths can each dominate measured progress.

GUI agents themselves are defined through their interaction substrate. They act through the same visual layer humans use, rather than through structured APIs or shell-level abstractions. The paper treats this as the source of both their reach and their difficulty: GUI control can cover the long tail of applications with no exposed API, but it also incurs sequential decision-making, delayed consequences, and more expensive interaction trajectories than CLI mediation.

2. Architectural organization

At the top level, ClawGUI is divided into three integrated subsystems. Figure 1 in the paper presents them as a single open-source stack spanning scalable online RL, reproducible benchmarking, and user-facing deployment (Tang et al., 13 Apr 2026).

Component	Primary role	Salient details
ClawGUI-RL	Online RL training	Parallel virtual environments and real physical devices; GiGPO with a Process Reward Model
ClawGUI-Eval	Standardized evaluation	Infer $\rightarrow$ Judge $\rightarrow$ Metric across 6 benchmarks and 11+ models
ClawGUI-Agent	Real-device deployment	Android, HarmonyOS, and iOS via 12+ chat platforms; hybrid CLI-GUI control; persistent personalized memory

The architecture is explicitly cyclical rather than linear. Models trained inside ClawGUI-RL can be evaluated under the fixed ClawGUI-Eval protocol and then surfaced to users through ClawGUI-Agent. The deployment layer can also expose evaluation as an operable skill, so that benchmarking itself becomes a user-invokable function rather than a separate offline script.

Figure 2 further decomposes ClawGUI-RL into an RL Infrastructure coupled to a Real / Virtual Environment backend. Figure 3 depicts ClawGUI-Eval as a strict three-stage pipeline. Figure 4 shows ClawGUI-Agent as a message-driven agent loop with persistent memory and skills, controlling virtual and real devices across phone, web browser, and desktop. This suggests a deliberate separation between environment management, correctness measurement, and interactive orchestration rather than a monolithic agent runtime.

3. ClawGUI-RL and online GUI reinforcement learning

ClawGUI-RL is described as the first open-source GUI-agent RL infrastructure with validated support for both large-scale parallel virtual environments and real physical devices (Tang et al., 13 Apr 2026). It is built on top of verl and verl-agent and supports Reinforce++, PPO, GSPO, GRPO, and GiGPO. In the reported experiments, the principal comparison is between GRPO and GiGPO.

For virtual training, the framework launches dozens of Docker-based Android emulators in parallel via MobileWorld. The reported ClawGUI-2B run uses 64 parallel virtual environments on 8× A6000 (48GB) GPUs. The environment lifecycle has four explicit stages: Task Reset, Task Evaluation, Spare Server Rotation, and Teardown. In virtual environments, the system can use root-level inspection of app state and database records, complemented by an MLLM-as-judge that compares the final screen against the instruction. Spare server rotation is used because emulator containers can become unhealthy during long training runs, and periodic teardown is used to prevent state accumulation and fidelity degradation.

Real-device training is also supported, but under different assumptions. Because real devices do not expose the same procedural reset and root-level verification paths, tasks must be manually authored and final-state assessment relies on MLLM-as-judge rather than system-level inspection. This support is significant, but the paper is careful not to equate it with the automated scalability of emulator training.

Reward design combines sparse episode-level outcome reward with dense step-level supervision from a Process Reward Model. After each action, the PRM receives the previous screenshot, the current screenshot, and the full action history so far, then judges whether the step meaningfully contributes to task completion. The paper gives the reward combination as $R = R_{\text{outcome} + R_{\text{step}$, a malformed line in the extracted text; the surrounding prose states that total reward combines episode-level binary reward with dense per-step process reward. This dense signal is important because GUI tasks are long-horizon and binary terminal rewards are extremely sparse.

GiGPO is introduced as the main credit-assignment mechanism. GRPO normalizes returns within groups of rollouts sharing the same task, which means every step in a successful trajectory can receive essentially the same episode-level advantage. GiGPO instead uses a two-level hierarchical advantage scheme: episode-level macro advantage across trajectories and step-level micro advantage within anchor-state sub-groups. Steps from different rollouts that reach the same intermediate state are clustered into anchor-state groups, and discounted return normalization is applied within those sub-groups. The paper argues that this yields fine-grained credit assignment without a learned value network and without additional rollouts.

For the reported ClawGUI-2B experiment, the training configuration is: MAI-UI-2B as the base model, 64 parallel virtual environments, 8× A6000 (48GB) GPUs, GiGPO, rollout group size 8, sampling temperature 0.7, learning rate 1e-6, epochs 3, training batch size 8, and Qwen3.5-72B as the PRM judge model. These are the framework’s most concrete RL implementation details in the paper.

4. ClawGUI-Eval and evaluation standardization

ClawGUI-Eval is the subsystem intended to eliminate silent evaluation drift by pinning all evaluation choices per model and decomposing benchmarking into the fixed pipeline $\text{Infer} \rightarrow \text{Judge} \rightarrow \text{Metric}$ (Tang et al., 13 Apr 2026). This decomposition is both procedural and archival: the framework releases not only evaluation code but also inference outputs, enabling re-judging and metric recomputation without rerunning inference.

The Infer stage supports both local GPU inference through transformers and remote API inference through OpenAI-compatible endpoints. Multi-GPU inference is implemented through Python multiprocessing, with one process pinned to one GPU and shard-level checkpointing so interrupted runs can resume without recomputing completed shards. The Judge stage is benchmark-specific. It includes a point-in-box judge for standard GUI grounding, a polygon and refusal-aware judge for OSWorld-G, and a multi-action judge for AndroidControl. The Metric stage aggregates per-sample correctness into final benchmark metrics and also provides breakdowns by platform, UI element type, and task category.

ClawGUI-Eval covers six benchmarks: ScreenSpot-Pro, ScreenSpot-V2, UI-Vision, MMBench-GUI, OSWorld-G, and AndroidControl. It supports 11+ model families, including Qwen3-VL, Qwen2.5-VL, UI-TARS, MAI-UI, GUI-G $^2$ , UI-Venus, GUI-Owl, StepGUI, Gemini, and Seed 1.8. For closed frontier systems, the framework uses a Zoom paradigm—a two-stage crop-then-ground strategy—with 25% crop tiles for Gemini and 50% crop tiles for Seed.

The headline reproducibility claim is 95.8% reproduction against official baselines, operationalized as 46/48 cells with official baselines. A reproduced result counts as successful if it is greater than or equal to the official value, or if the absolute difference is less than or equal to 2%. The paper reports 95.7% reproduction for open-source models and 100% for frontier models on ScreenSpot-Pro. The two failed reproductions are Qwen3-VL-2B on ScreenSpot-Pro and UI-TARS 1.5-7B on ScreenSpot-Pro, which the paper attributes to undisclosed official prompt or resolution settings. This is an important limitation: standardization inside ClawGUI-Eval cannot fully compensate for missing disclosure in external baselines.

5. ClawGUI-Agent and user-facing deployment

ClawGUI-Agent is the deployment subsystem that operationalizes GUI agents on Android, HarmonyOS, and iOS through 12+ chat platforms (Tang et al., 13 Apr 2026). The explicitly named platforms are Feishu, DingTalk, Telegram, Discord, Slack, and QQ, with 12+ platforms stated in total. The architecture is message-driven: users issue natural-language tasks through chat, and a server-side agent loop uses persistent memory and skills to plan and execute actions across real or virtual devices.

A defining feature is hybrid CLI-GUI control. The paper argues that CLI is fast and precise where supported, while GUI control is universal and interpretable when programmatic interfaces are absent. ClawGUI-Agent therefore routes between these two substrates rather than treating them as mutually exclusive. The paper does not specify a formal routing policy, but the architectural claim is that practical agents need both modalities.

Deployment supports both remote and local control modes. In remote mode, a user issues tasks from a separate device through a chat platform to control a target phone. In local mode, the user interacts with a chat application on the target phone itself, so no separate controlling device is required. This distinction matters because it shifts ClawGUI-Agent from being merely a remote-control stack to being a local assistant substrate.

Persistent personalized memory is another central component. During execution, the system extracts structured facts such as contact names and relationships, frequently used applications, and user habits and preferences. These are stored as vector embeddings in persistent storage. At inference time, the system retrieves the top- $k$ semantically similar memories, injects them into system context, and merges duplicates to avoid redundant accumulation. The paper does not specify the value of $k$ , the embedding model, or the storage backend, but it clearly positions memory as a first-class operational resource rather than a loose conversational history.

ClawGUI-Agent also exposes ClawGUI-Eval as a deployable tool skill. A user command such as “benchmark Qwen3-VL on ScreenSpot-Pro” can trigger environment verification, multi-GPU inference, judging, metric computation, and a structured report against official baselines. This feature is significant because it turns evaluation into an interactively callable capability inside the same deployment layer used for ordinary agent tasks.

6. Empirical results, limitations, and position within adjacent research

The paper validates the stack through ClawGUI-2B, a model trained from MAI-UI-2B inside ClawGUI-RL and evaluated on the MobileWorld GUI-Only split, which contains 117 tasks with a maximum interaction horizon of 50 steps (Tang et al., 13 Apr 2026). ClawGUI-2B reaches 17.1% Success Rate, compared with 11.1% for MAI-UI-2B. The paper also reports that ClawGUI-2B surpasses several larger end-to-end models, including Qwen3-VL-32B at 11.9% and UI-Venus-72B at 16.4%, while remaining below proprietary agentic frameworks such as Gemini-3-Pro + UI-Ins-7B at 55.6% and GPT-5 + UI-Ins-7B at 54.0%.

An ablation compares GRPO with binary episode-level reward against GiGPO with dense episode- plus step-level reward. The reported values are 14.5 for GRPO and 17.1 for GiGPO, a gain of 2.6 points and 17.9% relative improvement. The paper attributes this to better credit assignment and denser supervision rather than to a new base architecture, since MAI-UI-2B and ClawGUI-2B share the same base weights.

The framework also carries explicit limitations. Real-device RL requires manually authored tasks and relies on MLLM-as-judge because root-level verification is unavailable. Evaluation reproducibility remains imperfect when official settings are hidden. Deployment details are high-level: the paper does not provide low-level discussion of communication protocols, latency, safety policies, or human override mechanisms beyond the high-level hybrid CLI-GUI design. Future directions named in the paper include a unified GUI-CLI harness, scaling online RL beyond emulator-heavy settings, always-present on-device agents with stronger local-first privacy, and GUI world models for predictive planning.

Within adjacent research, ClawGUI occupies the infrastructure layer of a broader ecosystem. X-OmniClaw describes an edge-native Android agent architecture with hybrid UI grounding through XML metadata, OCR, and visual perception, whereas ClawGUI emphasizes the lifecycle infrastructure for training, evaluation, and deployment rather than a single mobile execution stack (Ren et al., 7 May 2026). Claw-Anything provides complementary benchmark evidence that cross-device GUI+CLI tasks are materially harder than CLI-only tasks, and reports that removing GUI access drops Pass@1 for cross-device tasks from 16.0 to 2.0 (Lin et al., 25 May 2026). Security and privacy concerns remain orthogonal but consequential for deployment: one OpenClaw study reports an average defense rate of only 17% for sandbox boundary attacks (Shan et al., 11 Mar 2026), and MaskClaw proposes edge-side Allow/Mask/Ask arbitration before raw screenshots leave a trusted environment (Zhao et al., 27 May 2026). These neighboring lines of work suggest that ClawGUI’s infrastructure agenda is necessary but not sufficient: scalable GUI-agent systems also require explicit security, privacy, and cross-device robustness layers.