ClawGym-Bench: Evaluating Claw Agents

Updated 4 July 2026

ClawGym-Bench is a diagnostic benchmark consisting of 200 synthetic tasks designed to evaluate Claw agents on multi-step tool use, file manipulation, and stateful workflow execution.
It utilizes hybrid verification combining code-based checks and rubric-based assessments to ensure robust, deterministic evaluation of workspace state transitions.
The benchmark is built from persona-driven and skill-grounded task synthesis, with automated filtering and human review to calibrate task difficulty and maintain relevance.

Searching arXiv for the benchmark and closely related papers to ground the article. Searching for "ClawGym-Bench" and related "Claw" benchmark papers on arXiv. ClawGym-Bench most precisely denotes the diagnostic evaluation component of the ClawGym framework, a system for developing and assessing Claw-style agents that operate over local files, tools, and persistent workspace state (Bai et al., 29 Apr 2026). In that canonical usage, it is a 200-instance benchmark calibrated through automated filtering and human-LLM review, designed to measure whether an agent can carry out multi-step tool use, precise file manipulation, and stateful workflow execution under task-specific verification. At the same time, the label is not stable across the 2026 Claw literature: adjacent papers use “ClawGym-Bench” as an internal mapping to SafeClawArena, as a prospective Gym-style wrapper for repository-repair evaluation, or as a non-canonical shorthand for broader always-on assistant benchmarks. This suggests that the term functions both as a specific benchmark title and as a looser label for gym-like evaluation in the Claw ecosystem (Niu et al., 29 Jun 2026).

1. Terminological status and scope

Within the literature, “ClawGym-Bench” has one canonical use and several non-canonical or prospective ones. The canonical usage is the one introduced in "ClawGym: A Scalable Framework for Building Effective Claw Agents" (Bai et al., 29 Apr 2026). Other papers either map the term to a different artifact or explicitly state that it is not their canonical name.

Usage in literature	Status	Description
ClawGym-Bench	Canonical	200-instance diagnostic benchmark within ClawGym
SafeClawArena	Internal mapping	406-task security benchmark; “ClawGym-Bench” maps to it in that work
Claw-SWE-Bench	Prospective	Foundation for a Gym-style benchmark over dockerized repository repair
Claw-Anything	Non-canonical	Authors state “ClawGym-Bench” is not the canonical name

The ambiguity is explicit in the source texts. "Understanding and Evaluating Claw-like Agent Security Through a Computer-Systems Lens" states that “ClawGym-Bench refers to SafeClawArena” and further clarifies that the paper itself consistently uses the name SafeClawArena (Niu et al., 29 Jun 2026). "Claw-SWE-Bench" presents a controlled SWE-bench-style protocol for OpenClaw-style harnesses and states that a “ClawGym-Bench” can be built by treating the dockerized repository workspace as an RL/Gym environment state (Zheng et al., 10 Jun 2026). "Claw-Anything" states that “ClawGym-Bench” is not the canonical name used in the paper and that the canonical artifact is Claw-Anything (Lin et al., 25 May 2026). For encyclopedic purposes, the unqualified term is therefore best anchored to the ClawGym paper, while the surrounding ambiguity remains important for disambiguation.

2. Canonical formulation inside ClawGym

In the canonical formulation, ClawGym-Bench is one component of a larger three-part framework: ClawGym-SynData, ClawGym-Agents, and ClawGym-Bench (Bai et al., 29 Apr 2026). ClawGym-SynData contains 13.5K verified, executable tasks synthesized via persona-driven intents and skill-grounded operations. ClawGym-Agents are models trained from black-box rollouts on SynData through supervised fine-tuning, with a lightweight RL pipeline also explored. ClawGym-Bench is the evaluation layer intended to provide trustworthy diagnostic assessment.

The benchmark is defined for “Claw-style environments,” described as harness-based computer-use settings such as OpenClaw in which an agent receives a user instruction $p$ and operates within an initialized local workspace $s_0$ using a set of tools or actions $A$ , producing a final state $s_H$ . The formal task instance is

$\tau = \langle p, s_0, A, F, V_\tau \rangle,$

with state transition

$s_t = F(s_{t-1}, a_t),$

and a task-specific verifier $V_\tau$ producing a score $v \in [0,1]$ .

The benchmark objectives are correspondingly workspace-grounded rather than dialogue-grounded. ClawGym-Bench measures whether agents can perform multi-step tool use reliably, manipulate local files and directories with precision, manage and reason about persistent workspace state across runs, and satisfy robust verification through deterministic checks plus rubric-based qualitative criteria when needed (Bai et al., 29 Apr 2026). This design places final workspace state, rather than a single textual answer, at the center of evaluation.

3. Instance construction, coverage, and calibration

ClawGym-Bench contains 200 benchmark instances across six categories: Productivity and Collaboration (44), Systems and Automation (42), Analysis and Reasoning (35), Content and Domain Support (28), Planning and Knowledge (26), and Software Development (25) (Bai et al., 29 Apr 2026). The instances are drawn from ClawGym-SynData, not authored independently, which ties benchmark construction to the same synthesis pipeline used for training data generation.

That synthesis pipeline combines two sources of structure. The first is persona-driven, top-down intent generation, organized around user personas, scenario categories, and atomic operations. The second is skill-grounded, bottom-up operation synthesis, using 16K synthesizable skills curated from approximately 30K OpenClaw skills. The resulting candidate tasks are paired with realistic mock workspaces and hybrid verifiers. ClawGym-Bench is then selected from this larger pool through difficulty-aware filtering and review rather than by direct random sampling.

The difficulty calibration procedure uses rollout-based filtering with a strong agent and a smaller agent, each run with 4 rollouts per task. A task is retained only if it satisfies

$\bar{s}_{\text{strong}}(\tau) \ge 0.2,\qquad \bar{s}_{\text{small}}(\tau) \le 0.6,\qquad \bar{s}_{\text{strong}}(\tau) > \bar{s}_{\text{small}}(\tau).$

The stated purpose is to remove tasks that are trivial, unstable, or non-discriminative. After automated filtering, GPT-5.4 performs structured diagnostics on instruction, resources, and verifiers, and human reviewers accept, modify, or reject tasks to ensure clarity, feasibility, verifier alignment, and rubric complementarity.

Each task includes a lightweight, task-specific mock workspace. Files may include CSV, JSON, YAML, Markdown, text requirements, or scripts placed at specified paths to instantiate $s_0$ . These artifacts encode schemas, constraints, and values so that deterministic checking is possible. Representative benchmark instructions include: “Merge all ‘sales’ CSVs, add source_file column, save as output/sales_report.csv.”; “Compute reorder plan from inventory.csv and bulk_discounts.csv, then emit per-supplier orders as output/orders/*.json.”; and “Process support ticket batches into rewrites, metrics, notifications, and an idempotent state file” (Bai et al., 29 Apr 2026).

4. Verification, scoring, and execution protocol

A defining feature of ClawGym-Bench is hybrid verification. Code-based verification decomposes requirements into atomic checks $s_0$ 0, with each check returning

$s_0$ 1

and the code score

$s_0$ 2

Rubric-based verification uses qualitative rules $s_0$ 3 with anchors $s_0$ 4 and score

$s_0$ 5

with equal weights by default. If only code checks exist, then $s_0$ 6; if both exist, then

$s_0$ 7

Of the 200 tasks, 156 rely solely on code-based verification and 44 use hybrid verification (Bai et al., 29 Apr 2026).

The primary reported metric is mean task score:

$s_0$ 8

The paper also gives optional formulations for strict success rate,

$s_0$ 9

and pass@k over multiple rollouts,

$A$ 0

The main reported results, however, use continuous scores rather than strict binary success (Bai et al., 29 Apr 2026).

Execution is performed through the OpenClaw interface. Each evaluation run initializes the task-specific workspace and verifier, resets the workspace to a clean state, and captures tool invocations and file-level changes through a black-box proxy. The paper fixes context length to 64K tokens for evaluated models and uses identical verifiers and rubric prompts for all runs. Resource and step or time limits are governed by the underlying OpenClaw runtime, but the paper does not publish fixed $A$ 1 or global timeouts. The benchmark inherits OpenClaw’s execution guards, sandboxing, and tool-level restrictions. On a balanced 50-task subset, 5 repeated runs yield low variance: Qwen3-8B achieves mean 36.4% with std 0.3%, and Qwen3-30B-A3B achieves mean 42.6% with std 1.0%, which the paper presents as evidence of stable scoring (Bai et al., 29 Apr 2026).

5. Baselines, training effects, and observed failure modes

ClawGym-Bench is used to evaluate proprietary frontier models, open-weight frontier models, compact open-weight models, and ClawGym-Agents trained on SynData (Bai et al., 29 Apr 2026). The strongest proprietary model reported is Claude-4.7-Opus at 77.81% average score. Open-weight frontier models range approximately from 63.7% to 71.1%, with GLM-5.1 at 71.12%. Compact backbones without ClawGym fine-tuning are substantially lower, exemplified by Qwen3-8B at 35.02%.

The training results emphasize the effect of ClawGym’s synthesized data and rollout-based training pipeline. ClawGym-8B reaches 50.24% versus Qwen3-8B at 35.02%, a reported +43.46% relative gain. ClawGym-30A3B reaches 56.82% versus Qwen3-30B-A3B at 45.11%, a reported +25.96% relative gain. ClawGym-4B reaches 47.73%, which the paper describes as surprisingly strong for its size. On an external benchmark, PinchBench, using a 30-task April 10, 2026 slice with unimodal tasks only, ClawGym-30A3B reaches 86.00%, ClawGym-8B 75.70%, and ClawGym-4B 76.40%.

Several ablations and training dynamics are also reported. Mixed synthesis using both persona-driven and skill-grounded generation outperforms either source alone on both ClawGym-Bench and PinchBench. Supervised fine-tuning performance peaks around epoch 3 on SynData, after which further training slightly degrades results, interpreted in the paper as mild overfitting to the synthetic distribution. Reward thresholding for trajectory selection is best at 0.5: lower thresholds harm quality, while higher thresholds reduce useful diversity. A lightweight RL signal based on sandbox-parallel GRPO is reported as positive for both a vanilla 4B model and an SFT 30B model, although exact gains are not tabulated (Bai et al., 29 Apr 2026).

The benchmark analysis also identifies recurring failure modes. Weaker agents may use tools but fail to construct robust workflows. In CI artifact auditing, for example, stronger agents enumerate files, inspect JSON shape, run aggregation scripts, and verify outputs, whereas weaker agents recover from wildcard issues but still miss summary fields, grouping, and filtering semantics. In long-horizon tasks, stronger agents treat tool errors as recoverable signals and confirm idempotency across reruns, whereas weaker agents accumulate unresolved errors and stall. Fine-grained instruction following is another failure mode: agents can produce plausible artifacts that violate critical constraints, such as including items with Quantity $A$ 2 ReorderPoint in reorder-plan tasks, which deterministic checks then penalize (Bai et al., 29 Apr 2026).

6. Relation to adjacent benchmarks and name diffusion

ClawGym-Bench occupies a particular position among Claw-style benchmarks because it is workspace-centric, synthetic in origin, and verifier-heavy. The ClawGym paper contrasts it with SWE-bench(-Verified), which focuses on patching real OSS repositories with test suites; with AgentBench, BrowseComp, and BrowserGym, which emphasize browser-like interaction loops; and with OSWorld and Windows Agent Arena, which are GUI-centric at OS scale. ClawGym-Bench instead centers local workspaces, persistent state, and final-state correctness over transient dialogues (Bai et al., 29 Apr 2026).

The ambiguity of the name becomes clearer when adjacent benchmarks are considered. SafeClawArena, introduced as a systems-first security benchmark for Claw-like agents, comprises 406 adversarial tasks spanning 24 sub-categories across four attack surfaces: Skill Supply-Chain Integrity, Persistent State Exploitation, Cross-Boundary Data Flow, and Indirect Prompt Injection. It evaluates containerized replicas of OpenClaw, NemoClaw, and SeClaw with canary-marked credentials and automated taint tracking across nine output channels; the reported overall ASR ranges from 20% to 70%, and malicious Plugins succeed in 100% of cases on unhardened platforms (Niu et al., 29 Jun 2026). That paper explicitly states that “ClawGym-Bench” maps to SafeClawArena in its own terminology, even though it does not use ClawGym-Bench as the paper’s main name.

Claw-SWE-Bench is a separate benchmark for coding tasks. It contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, together with a shared adapter layer that makes heterogeneous OpenClaw-style harnesses comparable under a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator. The paper then proposes that a Gym-style “ClawGym-Bench” could be built on top of this setup by exposing the dockerized repository workspace as environment state and the SWE-bench evaluator as reward and termination machinery (Zheng et al., 10 Jun 2026). Here the term is not an existing artifact but a design direction.

Claw-Anything is again distinct. It evaluates always-on personal assistants over long-horizon activity histories, interdependent backend services, and integrated GUI and CLI interaction across devices. The evaluation set contains 200 human-verified task environments, of which 150 are CLI-only and 50 are CLI+GUI, and GPT-5.5 reaches only 34.5% pass@1. The paper explicitly states that “ClawGym-Bench” is not its canonical name, though it notes that Claw-Anything can be viewed as the “Claw gym/bench” for always-on personal assistants (Lin et al., 25 May 2026).

These nearby uses do not erase the canonical ClawGym meaning, but they do show semantic diffusion. A plausible implication is that “ClawGym-Bench” has become a partially generic label for benchmarked, gym-like evaluation of Claw-style agents, even when the underlying artifacts differ substantially in task domain, threat model, and scoring.

7. Limitations and future directions

The canonical ClawGym-Bench has several stated limitations (Bai et al., 29 Apr 2026). Its tasks are single-modality and workspace-grounded; web-heavy, GUI-heavy, or multimodal scenarios are not the primary focus. Because the benchmark is derived from synthetic data, the authors note that domain biases may persist in format preferences, tool affordances, and other aspects of the generated distribution. Verification is robust but not fully self-contained: rubric-based tasks depend on an LLM judge, specifically GPT-5.4 in the reported setup, which introduces potential bias and variance.

The benchmark also privileges final-state correctness. It does not yet systematically score safety, efficiency, or process-level robustness beyond qualitative analysis. Harness execution is more expensive than text-only benchmarks, although the reported stability reduces the need for many repeated runs. The paper reports stability statistics but does not report inter-annotator agreement metrics, and it does not publish fixed global timeouts for evaluation. Code and data are described as “soon released” at https://github.com/ClawGym, so reproducibility, at least in the paper’s presentation, depends on eventual release plus an OpenClaw runtime.

The future directions implied by the paper are correspondingly concrete: expand task diversity, especially with richer stateful workflows; improve verifier quality and coverage, including trajectory-level properties such as safety and efficiency; and refine selection and thresholding for continuous verifier scores. In that sense, ClawGym-Bench is best understood not as a terminal benchmark definition but as the evaluation nucleus of a broader lifecycle framework whose central idea is that Claw-style agents should be trained and measured in executable workspaces with explicit state transitions and verifiable end conditions (Bai et al., 29 Apr 2026).