ClawGUI-Eval Benchmarking Harness

Updated 9 May 2026

ClawGUI-Eval is a standardized, open evaluation harness that benchmarks GUI agents by decoupling inference, judging, and metric computation.
It eliminates protocol drift and sampling variances by pinning configurations and publishing all raw outputs for transparent evaluations.
The modular pipeline supports diverse benchmarks and both open- and closed-source models, enabling rigorous, cross-paper comparison and error analysis.

ClawGUI-Eval is a standardized, fully open evaluation harness for benchmarking the capabilities of graphical user interface (GUI) agents. It constitutes the evaluation component of the ClawGUI full-stack system, addressing key challenges in GUI agent research: eliminating protocol drift across papers, pinning all configurations, and publishing all raw outputs. ClawGUI-Eval decouples inference, judging, and metric computation, ensuring reproducibility and comparability of results across multiple benchmarks and agent models (Tang et al., 13 Apr 2026).

1. Core Purpose and Motivation

ClawGUI-Eval is designed as the enforcement layer for rigorous, replicable evaluation of GUI agents. It supports the entire lifecycle: from running agent models on benchmark datasets, through granular output labeling, to aggregate reporting and error analysis. The module addresses three primary obstacles faced by the field:

Protocol Drift: Inconsistent prompt formats, image resolutions, and coordinate systems have historically rendered cross-paper results incomparable.
Sampling and Post-Processing Variance: Undocumented sampling strategies and result normalization affect accuracy by non-trivial margins.
Opaque Output: Absence of shared raw agent outputs prevents external validation or in-depth error analysis.

To mitigate these, ClawGUI-Eval pins all relevant parameters—prompt templates, image resolutions, temperature, etc.—and systematically decouples inference, judgment, and metric computation stages, archiving all raw and judged data (Tang et al., 13 Apr 2026).

2. Pipeline Architecture and Workflow

The ClawGUI-Eval pipeline is structured into three strict phases: Infer, Judge, and Metric.

Inference Stage: Takes benchmark test sets (images, instructions, UI element metadata) as input and performs model inference—locally on GPU or via remote API—chunked by data shards. Each shard yields raw JSON prediction files.
Judging Stage: Parses prediction outputs and compares them to ground truth using benchmark-specific logic (e.g., point-in-box or action-sequence matching), emitting binary correctness labels.
Metric Calculation: Aggregates label data into top-line metrics (accuracy, success rate) and supports breakdowns by UI element type, action type, or platform.

Pseudocode reflecting the three-stage flow:

for benchmark in benchmarks:
    for model in models:
        predictions = Infer.run(...)
        save_json(predictions, ...)
        labels = Judge.run(...)
        save_json(labels, ...)
        metrics = Metric.compute(...)
        save_json(metrics, ...)

Shard-level checkpointing ensures that evaluation can be resumed after interruption, preventing recomputation of completed shards. The decoupled design allows, for example, the same raw predictions to be re-judged under modified rules or new error taxonomies (Tang et al., 13 Apr 2026).

3. Supported Benchmarks and Agent Models

ClawGUI-Eval is benchmark-agnostic and supports plug-and-play evaluation for a broad suite of six established datasets:

Benchmark	Domain	Judge Type
ScreenSpot-Pro	High-res desktop grounding	Point-in-box
ScreenSpot-V2	Desktop grounding (updated)	Point-in-box
UI-Vision	Element localization (desktop)	Point-in-box
MMBench-GUI	Multi-platform GUI navigation	Point-in-box
OSWorld-G	Grounding, refusal detection	Polygon/tagged
AndroidControl	Mobile action control	Action-sequence

The evaluation harness accommodates open- and closed-source models:

Open-source: GUI-G², GUI-Owl (1.5B–8B), Qwen3-VL (2B–8B), Qwen2.5-VL, UI-TARS, UI-Venus, MAI-UI, StepGUI.
Closed-source: Gemini-3-Pro, Seed-1.8 (evaluated via a Zoom crop-then-ground workaround) (Tang et al., 13 Apr 2026).

Each benchmark prescribes a fixed test set, judge type, and recommended image resolution and coordinate normalization, thus eliminating confounders in comparative evaluation.

4. Evaluation Metrics, Reproducibility, and Data Reporting

ClawGUI-Eval enforces rigorous metric conventions:

Per-sample correctness is determined by rule: for grounding tasks, prediction is correct if the point lies within a ground-truth box or polygon; for control tasks, only exact sequence matches are correct.
Aggregate accuracy for grounding tasks:

$\mathrm{Acc} = \frac{1}{N}\sum_{i=1}^N \mathbf{1}\{\text{correct}_i\}$

Success Rate (SR) for multi-action tasks:

$\mathrm{SR} = \frac{N_\text{success}}{N_\text{total}} \times 100\%$

Reproduction Rate (RR), the proportion of evaluation points where ClawGUI-Eval's measured metric matches or is within 2% of the official baseline:

$\mathrm{RR} = \frac{\# \text{cells reproduced}}{\# \text{cells with official baselines}} \times 100\% \approx 95.8\%$

Confidence intervals are estimated by resampling, enabling users to report mean accuracy with 95% CI.

All outputs—predictions, judgment labels, and aggregate metrics—are persistently stored in structured JSON. This enables third-party re-judging, extension to new error typologies, and granular reporting (Tang et al., 13 Apr 2026).

5. Usage, Integration, and Automation

ClawGUI-Eval exposes a robust interface for both command-line and programmatic usage. Typical CLI invocation proceeds in three steps:

1
2
3

clawgui-eval infer --benchmark ss-pro --model qwen3-vl-8b [options...]
clawgui-eval judge --benchmark ss-pro --raw ... --out ...
clawgui-eval metric --labels ... --out ...

The Python API allows fine-grained configuration:

from clawgui.eval import Evaluator

cfg = {
    'benchmark': 'ss-pro',
    'model': 'qwen3-vl-8b',
    'infer': {...},
    'judge': {'type':'point_in_box'},
    'metric': {'breakdowns':['element_type', 'layout_area']}
}

e = Evaluator(cfg)
results = e.run_all()
print(results['accuracy'])

All configuration parameters (benchmark, model, backend, devices, templating, evaluation rules) are explicit, ensuring full reproducibility of results. The system supports integration points for automated regression testing after each RL training run and as a backend skill for deployment or chat-based agent evaluation (Tang et al., 13 Apr 2026).

6. Key Empirical Findings and Impact

ClawGUI-Eval has demonstrated high reproduction fidelity, with an overall 95.8% reproduction rate against published results on open and frontier models (e.g., SS-Pro: 100% for frontier models, 44/46 for open-source). Discrepancy cases are attributed to documentation gaps, not implementation error.

Decoupling inference, judgment, and metric stages exposes a small set of failure cases arising typically from model prompt or post-processing ambiguities. The system’s architecture has enabled:

Consistent, cross-paper comparison of agent capabilities.
Efficient extension to new models, datasets, or error taxonomies by adopting the same carried-over configurations.
Transparent publication and granular reanalysis of both per-sample and aggregate outcomes.

By standardizing evaluation, ClawGUI-Eval is cited as raising the bar for transparency, reproducibility, and extensibility in GUI agent benchmarking (Tang et al., 13 Apr 2026).

7. Relationship to Environment Generation and Broader Context

ClawGUI-Eval, as the primary evaluation module of ClawGUI, pairs with environment generation pipelines such as ClawEnvKit—an automated, LLM-driven system for generating and validating diverse, parameterized environments for claw-like agents (Li et al., 20 Apr 2026). While ClawEnvKit produces the dynamic specification $E = (P, M, C)$ —where $P$ is a natural language task prompt, $M$ the tool interface, and $C$ a scored evaluation functional—ClawGUI-Eval provides the instantiation-agnostic test harness required to compare agent performance under tightly controlled, reproducible conditions.

A plausible implication is that harmonized pipelines such as ClawGUI-Eval and ClawEnvKit will facilitate rapid experimentation, analysis, and deployment of GUI agents at scale, while preserving rigor and comparability across models and research groups. This convergence is central to ongoing efforts to benchmark complex, language-driven agent systems on real-world GUI tasks.

Markdown Report Issue Upgrade to Chat

References (2)

ClawGUI: A Unified Framework for Training, Evaluating, and Deploying GUI Agents (2026)

ClawEnvKit: Automatic Environment Generation for Claw-Like Agents (2026)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ClawGUI-Eval.