ClawGym: Scalable Claw Agent Framework

Updated 4 July 2026

ClawGym is a scalable framework that defines Claw-style environments where agents operate in persistent workspaces with integrated tools and multi-turn workflows.
It employs a dual-route synthetic task generation combining persona-driven and skill-grounded methods to create a corpus of 13.5K executable tasks.
The framework’s effectiveness is validated through rigorous benchmarking with ClawGym-Bench, showing significant performance gains over baseline models.

Searching arXiv for ClawGym and adjacent Claw-agent framework papers to ground the article. ClawGym is a scalable framework for building Claw-style personal agents: LLM-based agents that operate in computer-like sandboxes with persistent workspace state, local files, tools, and multi-step workflows whose correctness is judged primarily by the final workspace state rather than by the final textual response (Bai et al., 29 Apr 2026). It combines three tightly coupled components: ClawGym-SynData, a synthesized corpus of 13.5K executable tasks; ClawGym-Agents, a family of agents trained from black-box rollout trajectories collected in OpenClaw; and ClawGym-Bench, a 200-task evaluation suite curated through automated filtering and human–LLM review (Bai et al., 29 Apr 2026). Within the broader Claw literature, the explicit named framework is introduced in "ClawGym: A Scalable Framework for Building Effective Claw Agents" (Bai et al., 29 Apr 2026); several contemporaneous Claw papers explicitly state that the term is not defined in those works, which helps delimit ClawGym as a distinct contribution rather than a generic label (Wu et al., 21 May 2026, Gan et al., 7 May 2026, Zhao et al., 13 Apr 2026).

1. Definition and formal task model

ClawGym is built around Claw-style environments, described as sandboxes in which an agent has access to a persistent workspace state, a set of tools or actions, and a transition process that updates the environment over many turns (Bai et al., 29 Apr 2026). These environments support workflows over directories, files, configs, and web-accessible endpoints, and evaluation is centered on whether the agent correctly transforms the workspace into a desired end state (Bai et al., 29 Apr 2026).

A task instance is formalized as

$\tau = \langle p, s_0, \mathcal{A}, \mathcal{F}, \mathcal{V}_\tau\rangle$

where $p$ is the user instruction, $s_0$ is the initial environment state, $\mathcal{A}$ is the action space, $\mathcal{F}$ is the transition function, and $\mathcal{V}_\tau$ is the task-specific verifier (Bai et al., 29 Apr 2026). Agent behavior is represented as a trajectory

$\xi = (A_1, O_1, A_2, O_2, \ldots, A_K, O_K),$

with each $A_k$ defined as a segment of one or more tool calls and each $O_k$ as a segment of environment observations, allowing bursts of tool use before feedback is processed (Bai et al., 29 Apr 2026). Flattened actions $a_1,\dots,a_H$ evolve the state by

$p$ 0

and the verifier returns

$p$ 1

where $p$ 2 is an optional final text response (Bai et al., 29 Apr 2026).

This formulation places ClawGym in a distinct lineage from frameworks that emphasize prompt-only outputs or one-shot tool selection. A related but separate framework, ClawEnvKit, defines environments as $p$ 3 with tool interfaces and explicit evaluation functionals (Li et al., 20 Apr 2026). By contrast, ClawGym centers its formalism on persistent workspace state and a verifier over initial and final environments, which reflects its emphasis on file-grounded and artifact-grounded agent behavior (Bai et al., 29 Apr 2026).

2. Synthetic task generation and workspace construction

ClawGym-SynData is generated through a dual-route synthesis pipeline designed to combine user realism with tool realism (Bai et al., 29 Apr 2026). The first route is persona-driven top-down synthesis. Each task seed is

$p$ 4

where $p$ 5 is a user persona, $p$ 6 is a scenario category, and $p$ 7 is a set of atomic operations (Bai et al., 29 Apr 2026). Given $p$ 8, a task generator produces the instruction

$p$ 9

where $s_0$ 0 is a prompt template summarizing persona, scenario, and operations (Bai et al., 29 Apr 2026). The scenario space spans 9 macro classes and 43 subcategories, while the operation taxonomy spans 7 categories and 26 operations (Bai et al., 29 Apr 2026).

The second route is skill-grounded bottom-up synthesis, intended to anchor tasks in the actual tool ecosystem (Bai et al., 29 Apr 2026). Starting from approximately 30K raw skills, ClawGym annotates each skill via

$s_0$ 1

where $s_0$ 2 includes a summary, core content, constraints, I/O description, and a binary synthesizability label $s_0$ 3 (Bai et al., 29 Apr 2026). Retention is defined as

$s_0$ 4

yielding 16K synthesizable skills across categories such as Dev Tools, Data APIs, and Automation (Bai et al., 29 Apr 2026). Task generation then composes one primary skill and up to three supporting skills through

$s_0$ 5

After instruction synthesis, ClawGym constructs an initial mock workspace $s_0$ 6 through a resource specification

$s_0$ 7

where $s_0$ 8 is a path, $s_0$ 9 is a file type, and $\mathcal{A}$ 0 is a content specification (Bai et al., 29 Apr 2026). GPT-5 is then used to materialize realistic files such as CSVs, JSON configs, Markdown notes, and README files (Bai et al., 29 Apr 2026). The explicit design objective is that tasks remain self-contained, privacy-preserving, and verifiable at scale (Bai et al., 29 Apr 2026).

A central empirical result is that the two synthesis routes are complementary rather than redundant. In the reported ablations, persona-driven only and skill-grounded only both underperform mixed synthesis, while mixed synthesis gives the best results on both ClawGym-Bench and PinchBench (Bai et al., 29 Apr 2026). This suggests that realistic user intent distributions and realistic operational affordances contribute distinct training signal.

3. Verification design and quality control

ClawGym adopts a hybrid verification mechanism that combines deterministic workspace checks with rubric-based judgment (Bai et al., 29 Apr 2026). For code-based verification, a task defines a set of atomic conditions

$\mathcal{A}$ 1

each evaluated as

$\mathcal{A}$ 2

The resulting code score is

$\mathcal{A}$ 3

This score measures the fraction of objective requirements satisfied, such as whether required files exist, whether schemas are correct, or whether computed values match the specification (Bai et al., 29 Apr 2026).

Rubric-based verification is defined over a set of rubric rules

$\mathcal{A}$ 4

with each criterion scored by an LLM judge as

$\mathcal{A}$ 5

The rubric score is then

$\mathcal{A}$ 6

with equal weights by default unless otherwise specified (Bai et al., 29 Apr 2026). Rubrics are used for qualitative properties such as completeness, clarity, professionalism, and faithfulness to source data (Bai et al., 29 Apr 2026).

The final task score is defined as either

$\mathcal{A}$ 7

for code-only tasks, or

$\mathcal{A}$ 8

with $\mathcal{A}$ 9, for hybrid tasks (Bai et al., 29 Apr 2026). The weighting gives 70% emphasis to code-based checks and 30% to rubric-based checks, reflecting a deliberate preference for objective verification while retaining some coverage of qualitative outputs (Bai et al., 29 Apr 2026).

ClawGym also applies automated quality assessment to both tasks and verifiers before admitting them to the dataset (Bai et al., 29 Apr 2026). Task quality includes novelty filtering via embedding similarity, plausibility judgment via GPT-5.4, and difficulty estimation to maintain a mixture of easy, moderate, and hard tasks (Bai et al., 29 Apr 2026). Verification quality includes executable sanity checks, coverage and over-strictness assessment for code checkers, and rubric review to ensure that rubrics do not simply duplicate code checks (Bai et al., 29 Apr 2026). After these stages, the final synthetic corpus contains 13.5K executable tasks (Bai et al., 29 Apr 2026).

A 50-task human inspection reports mean scores on a 1–5 scale of 4.46 for task reasonableness, 3.50 for execution feasibility, 4.36 for resource consistency, 3.92 for verification quality, and 4.06 overall (Bai et al., 29 Apr 2026). These figures indicate that the synthesis pipeline is not presented as error-free; rather, it is positioned as sufficiently coherent and executable to support large-scale training and benchmarking.

4. ClawGym-Agents and the training pipeline

ClawGym-Agents are trained from black-box rollouts collected in native OpenClaw instances rather than from a reimplemented agent loop (Bai et al., 29 Apr 2026). In this setup, multiple OpenClaw Docker containers are provisioned with synthesized workspaces, teacher models act through the original runtime, and a proxy logs prompts, model responses, tool invocations, and environment outputs (Bai et al., 29 Apr 2026). The teachers used for this stage are MiniMax-M2.5 and GLM-5.1 (Bai et al., 29 Apr 2026).

Raw logs are aggregated into coherent multi-turn trajectories by grouping requests with the same message prefix, concatenating them into task-level dialogues, removing auxiliary prompts such as cron or heartbeat messages, and filtering out trajectories that rely on unsupported tools (Bai et al., 29 Apr 2026). Each aggregated trajectory is then scored with the hybrid verifier, and only those exceeding a reward threshold are retained (Bai et al., 29 Apr 2026). The reported best threshold is 0.5, which empirically balances trajectory quality and diversity (Bai et al., 29 Apr 2026). This process yields 24.5K high-fidelity trajectories with average statistics of 13.0 rounds, 18.67K tokens, 15.82 tool calls, and 3.25 distinct tool types (Bai et al., 29 Apr 2026).

Supervised fine-tuning is performed on Qwen3 base models: Qwen3-4B-2507-Instruct, Qwen3-8B, and Qwen3-30B-A3B-2507-Instruct (Bai et al., 29 Apr 2026). For Qwen3-8B, the context window is extended from 32K to 64K using YaRN (Bai et al., 29 Apr 2026). A key training choice is multi-turn loss masking, in which loss is applied only to model-generated tokens, not to tool outputs or environment-returned filesystem content (Bai et al., 29 Apr 2026). This prevents the model from learning to imitate observations and instead concentrates learning on policy generation, planning, and tool use (Bai et al., 29 Apr 2026). The resulting fine-tuned models are ClawGym-4B, ClawGym-8B, and ClawGym-30B-A3B (Bai et al., 29 Apr 2026).

ClawGym further explores reinforcement learning through a sandbox-parallel rollout pipeline (Bai et al., 29 Apr 2026). Each task is instantiated in an isolated sandbox with its own filesystem, gateway, and verifier, enabling parallel rollouts across many tasks (Bai et al., 29 Apr 2026). RL training uses 2,000 sampled tasks, balanced by tool usage, and applies GRPO with learning rate $\mathcal{F}$ 0, batch size 8, 8 rollouts per prompt, 100 training steps, temperature 0.7, and maximum response length 64K (Bai et al., 29 Apr 2026). Reward is taken directly from the code-only verifier score (Bai et al., 29 Apr 2026). Reported training curves show that RL improves both a vanilla 4B model and an already SFT-trained 30B agent, supporting the claim that verifier-grounded RL is feasible in this setting (Bai et al., 29 Apr 2026).

5. ClawGym-Bench and empirical performance

ClawGym-Bench is a 200-task benchmark derived from the synthetic task pool but further curated for discriminative difficulty and evaluation reliability (Bai et al., 29 Apr 2026). Candidate tasks are subjected to difficulty-aware filtering based on average scores of a strong model and a smaller model over four rollouts each, with retention criteria

$\mathcal{F}$ 1

This ensures that accepted tasks are neither impossible nor trivial and that they meaningfully separate stronger from weaker agents (Bai et al., 29 Apr 2026).

Tasks passing automated filtering are then reviewed through a combined GPT-5.4 and human process that examines the instruction, workspace, code checker, and rubric, and either accepts, revises, or discards the task (Bai et al., 29 Apr 2026). The final benchmark contains 156 code-only tasks and 44 hybrid tasks, distributed across six categories: Productivity & Collaboration (44), Systems & Automation (42), Analysis & Reasoning (35), Content & Domain Support (28), Planning & Knowledge (26), and Software Development (25) (Bai et al., 29 Apr 2026).

Benchmark stability is explicitly evaluated. On a 50-task balanced subset, five repeated runs produce standard deviations of 0.3% for Qwen3-8B at mean 36.4%, and 1.0% for Qwen3-30B-A3B at mean 42.6%, indicating relatively low variance without heavy repeated sampling (Bai et al., 29 Apr 2026). The benchmark also enforces verifiable solvability by confirming at least one full-score trajectory for every task, either from strong-agent rollouts or human completion (Bai et al., 29 Apr 2026).

The main empirical results show large gains from ClawGym-specific training (Bai et al., 29 Apr 2026). On ClawGym-Bench, baseline Qwen models score 35.02 for Qwen3-8B, 40.32 for Qwen3-32B, 45.11 for Qwen3-30B-A3B, and 54.48 for Qwen3-235B-A23B (Bai et al., 29 Apr 2026). The corresponding ClawGym-Agents score 47.73 for ClawGym-4B, 50.24 for ClawGym-8B, and 56.82 for ClawGym-30B-A3B (Bai et al., 29 Apr 2026). Thus, ClawGym-8B improves over Qwen3-8B by approximately 43.46% on ClawGym-Bench and from 54.5 to 75.7 on PinchBench, while ClawGym-30B-A3B improves over Qwen3-30B-A3B by approximately 25.96% on ClawGym-Bench and from 55.6 to 86.0 on PinchBench (Bai et al., 29 Apr 2026).

A particularly notable result is that ClawGym-30B-A3B outperforms Qwen3-235B-A23B on ClawGym-Bench, scoring 56.82 versus 54.48 (Bai et al., 29 Apr 2026). This is presented as evidence that environment-grounded task specialization can exceed the benefits of scaling a general model within this domain (Bai et al., 29 Apr 2026). Among proprietary frontier models, the highest average ClawGym-Bench score reported is 77.81 for Claude-4.7-Opus, while open-weight frontier models such as GLM-5.1 and Qwen3.5-Plus reach 71.12 and 70.35, respectively (Bai et al., 29 Apr 2026).

6. Position in the Claw ecosystem, limitations, and interpretation

ClawGym occupies a specific niche within the broader Claw ecosystem: it is a data-centric framework for developing agents in persistent workspace environments, rather than an autonomous lab orchestration system, an environment-generation toolkit, or a security monitor (Bai et al., 29 Apr 2026). This distinction matters because several other Claw papers describe adjacent components but explicitly do not define ClawGym. The Claw AI Lab paper centers on a lab-native autonomous research platform with multi-agent workflows and a Claw-Code Harness (Wu et al., 21 May 2026). ClawEnvKit emphasizes automatic generation of verified environments for claw-like agents and introduces Auto-ClawEval (Li et al., 20 Apr 2026). ClawGuard and ClawKeeper focus on agent security, out-of-band monitoring, runtime rule enforcement, and watcher-based protection (Gan et al., 7 May 2026, Zhao et al., 13 Apr 2026, Liu et al., 25 Mar 2026, Niu et al., 29 Jun 2026). Several of those papers explicitly state that “ClawGym” is not a named artifact in their text (Wu et al., 21 May 2026, Gan et al., 7 May 2026, Zhao et al., 13 Apr 2026). This helps resolve a common ambiguity: the explicit framework called ClawGym is the one introduced in (Bai et al., 29 Apr 2026).

The framework also has clear limitations. Its tasks are synthetic mock workspaces rather than real user desktops, which improves privacy and verifier design but may omit some real-world messiness (Bai et al., 29 Apr 2026). Evaluation focuses on final-state correctness, not on trajectory efficiency, safety, or intermediate harm (Bai et al., 29 Apr 2026). The training and benchmarking setup is tied to the OpenClaw ecosystem and the available ClawHub skill distribution (Bai et al., 29 Apr 2026). Reinforcement learning is presented as a lightweight extension rather than as a full exploration of RL methodology, using only 2,000 tasks, one algorithm, and outcome-only rewards (Bai et al., 29 Apr 2026).

These limitations suggest two plausible implications. First, ClawGym is best understood as infrastructure for artifact-grounded agent capability development, not as a complete account of safe or efficient personal-agent behavior. Second, the framework is complementary to adjacent lines of work: ClawEnvKit provides a different environment-construction formalism (Li et al., 20 Apr 2026), while SafeClawArena and ClawKeeper emphasize security surfaces and runtime governance that ClawGym does not directly benchmark (Niu et al., 29 Jun 2026, Liu et al., 25 Mar 2026). Taken together, these papers suggest an emerging division of labor in the Claw literature: ClawGym for large-scale task synthesis, rollout distillation, and capability evaluation; environment-generation systems for benchmark construction; and security frameworks for system-level hardening and adversarial testing (Bai et al., 29 Apr 2026, Li et al., 20 Apr 2026, Niu et al., 29 Jun 2026).

In that sense, ClawGym marks a shift from isolated task datasets toward a full lifecycle for Claw-style agents: task synthesis, workspace construction, verifier design, black-box trajectory collection, supervised training, lightweight RL, and benchmarked evaluation within persistent file-and-tool environments (Bai et al., 29 Apr 2026).