ClawGym-Agents for Workspace-Grounded Tasks

Updated 4 July 2026

ClawGym-Agents are Qwen3-based models specialized for multi-step, workspace-grounded tasks that require tool invocation and file operations.
They leverage a diverse synthesis of persona-driven and skill-grounded data to ensure realistic interactions and robust error recovery.
Their training combines supervised fine-tuning with reinforcement learning to optimize performance in long-horizon, tool-invoking environments.

ClawGym-Agents are the trained agent models produced within the broader ClawGym framework for environment-grounded “Claw-style” tasks, where an instruction is not solved by returning text alone, but by acting inside a persistent workspace: reading and writing local files, invoking tools, executing scripts, interacting with web-enabled services, and leaving behind a correct final environment state (Bai et al., 29 Apr 2026). In the paper introducing ClawGym, these agents are specifically Qwen3-based models fine-tuned on OpenClaw interaction trajectories, resulting in ClawGym-4B, ClawGym-8B, and ClawGym-30B-A3B. They are not presented as a standalone contribution detached from the rest of the system; rather, they are the trained-model component enabled by ClawGym-SynData and validated by ClawGym-Bench (Bai et al., 29 Apr 2026).

1. System identity and task formulation

ClawGym formalizes a Claw-style task as

$\tau = \langle p, s_0, \mathcal{A}, \mathcal{F}, \mathcal{V}_\tau\rangle,$

where $p$ is the user instruction, $s_0$ the initial workspace, $\mathcal{A}$ the available actions, $\mathcal{F}$ the environment transition induced by tool execution, and $\mathcal{V}_\tau$ the task verifier (Bai et al., 29 Apr 2026). The output is a trajectory

$\xi = (A_1, O_1, A_2, O_2, \ldots, A_K, O_K),$

with grouped action and observation segments, and the final state evolves through

$s_t = \mathcal{F}(s_{t-1}, a_t), \quad t = 1,\ldots,H.$

Success is judged primarily from final-state correctness rather than just a final textual answer, via

$v = \mathcal{V}_\tau(s_0, s_H, y), \quad v \in [0,1].$

This formulation distinguishes ClawGym-Agents from static instruction-following systems. The paper emphasizes three central difficulties motivating such agents. First, tasks must reflect personalized, realistic user needs across occupations and routines. Second, they are long-horizon and multi-step, involving file operations, tool calls, intermediate artifacts, and error recovery. Third, they require grounded local workspaces with realistic artifacts rather than abstract prompts (Bai et al., 29 Apr 2026). A plausible implication is that the benchmark target is not merely linguistic competence, but stable control over a persistent environment under verifier-constrained execution.

2. Data synthesis, workspace grounding, and verification

The training substrate for ClawGym-Agents is ClawGym-SynData, described as a diverse dataset of 13.5K filtered executable tasks (Bai et al., 29 Apr 2026). Its construction combines two synthesis routes.

In the persona-driven top-down route, each task begins with a seed

$z = (u, c, \mathcal{G}),$

where $p$ 0 is a user persona, $p$ 1 is a scenario category, and $p$ 2 is a set of basic operations. The paper states that the scenario taxonomy contains 9 major classes and 43 subcategories, and the atomic-operation taxonomy contains 7 categories and 26 distinct operations. A task generator, GPT-5, expands each seed into a concrete instruction via

$p$ 3

In the skill-grounded bottom-up route, the framework starts from OpenClaw skills from ClawHub rather than personas. Approximately 30K raw OpenClaw skills are annotated, and 16K synthesizable skills are retained (Bai et al., 29 Apr 2026). Table 1 reports 16,837 annotated skills distributed across categories: MCP Tools 411, Prompts 565, Workflows 1,972, Dev Tools 3,906, Data APIs 4,236, Security 993, Automation 1,221, Other 3,533. To generate a task, the framework samples one primary skill and up to three supporting skills, then asks GPT-5 to synthesize a user-facing instruction.

The two routes are deliberately complementary. Persona-driven generation broadens user-facing scenario coverage, while skill-grounded generation ensures executable grounding in actual OpenClaw capabilities (Bai et al., 29 Apr 2026). The paper later reports that mixed synthesis performs best, which it interprets as support for this synergy.

After generating instructions, ClawGym creates mock workspaces to instantiate $p$ 4. The resource specification is

$p$ 5

where $p$ 6 is the file path, $p$ 7 the file type, and $p$ 8 the file content specification. GPT-5 materializes these into concrete files. The paper stresses that these mock resources are lightweight but task-specific and realistic, and that structured files like JSON, CSV, and YAML include explicit schemas and values so that outputs can later be checked against source data (Bai et al., 29 Apr 2026).

Verification is hybrid. Code-based verification decomposes objective requirements into atomic checks

$p$ 9

each returning

$s_0$ 0

The code score is

$s_0$ 1

Rubric-based verification uses rubric rules

$s_0$ 2

with per-rule ordinal scores

$s_0$ 3

aggregated as

$s_0$ 4

with equal weights by default. For tasks with both code-based and rubric-based verification, the paper explicitly sets $s_0$ 5, giving more weight to objective workspace-grounded correctness (Bai et al., 29 Apr 2026).

The paper also reports a human evaluation of 50 sampled training tasks with average scores: Task Reasonableness 4.46, Execution Feasibility 3.50, Resource Consistency 4.36, Verification Quality 3.92, Overall 4.06 on a 1–5 scale (Bai et al., 29 Apr 2026). This suggests that execution feasibility, while acceptable, is the weakest rated dimension in the sampled training distribution.

3. Training pipeline and model family

Training ClawGym-Agents begins with black-box rollout collection. In this paper, “black-box rollout trajectories” means trajectories generated by running tasks inside the original OpenClaw harness without reconstructing or approximating the internal agent loop (Bai et al., 29 Apr 2026). OpenClaw is treated as an opaque executable system whose internals, such as context management and subagent sessions, remain inaccessible. The authors deploy multiple OpenClaw Docker environments on a distributed cluster, assign tasks and workspaces to containers, and record all request/response traffic through a proxy layer. Teacher models used for rollout are MiniMax-M2.5 and GLM-5.1.

Raw logs undergo aggregation and cleaning. Since logs are captured at the granularity of individual model calls, a single task may produce multiple partially overlapping traces. The authors reconstruct full trajectories by grouping requests with identical message prefixes and concatenating subsequent turns. They also remove OpenClaw-inserted auxiliary prompts such as cron or heartbeat messages and filter out trajectories involving unsupported tools (Bai et al., 29 Apr 2026).

Selection is reward-threshold based rather than best-of- $s_0$ 6, because task scores are continuous in $s_0$ 7 under the hybrid verifier. The paper states that trajectories are retained only if their final score exceeds a predefined reward threshold, and that the final supervised fine-tuning corpus contains 24.5K high-fidelity trajectories. The best threshold from experiments is 0.5. These trajectories have, on average, 13.00 interaction rounds, 18.67K tokens, 15.82 tool calls, and 3.25 distinct tool types (Bai et al., 29 Apr 2026).

A key training design is multi-turn loss masking: tokens corresponding to environment feedback from Docker execution are excluded from the supervised loss, so the model is not trained to imitate tool outputs or observations. Instead, it learns only from policy-generated content such as reasoning, decisions, and tool-call generation (Bai et al., 29 Apr 2026). This is one of the clearest algorithmic design choices in the paper.

The resulting model family is summarized below.

Model	Backbone	Training route
ClawGym-4B	Qwen3-4B-2507-Instruct	SFT on rollout trajectories
ClawGym-8B	Qwen3-8B	SFT on rollout trajectories
ClawGym-30B-A3B	Qwen3-30B-A3B-2507-Instruct	SFT on rollout trajectories

The paper further states that Qwen3-8B’s native 32K context is extended to 64K using YaRN, and that evaluation context length is uniformly set to 64K tokens (Bai et al., 29 Apr 2026). Training-dynamics analysis reports training across 5 epochs with 103 steps per epoch, evaluating checkpoints every 60 steps; performance peaks at the end of epoch 3, step 309, after which slight degradation indicates overfitting (Bai et al., 29 Apr 2026).

4. Reinforcement learning and sandbox-parallel optimization

ClawGym also explores reinforcement learning through a lightweight pipeline that parallelizes rollouts across per-task sandboxes (Bai et al., 29 Apr 2026). Each task is virtualized into an independent sandbox with its own filesystem, workspace, gateway, and verifier, allowing many rollouts to run concurrently without interference. The system supports both Docker-based and Docker-free backends depending on the cluster setup.

The RL algorithm is GRPO. Training is run from two initializations: a vanilla Qwen3-4B-2507-Instruct model and the SFT-trained ClawGym-30B-A3B model. The RL training set is a sample of 2,000 tasks from ClawGym-SynData, balanced across required tools to preserve diversity. Hyperparameters are explicitly listed as: learning rate $s_0$ 8, train batch size 8, rollouts per prompt 8, training steps 100, rollout temperature 0.7, and maximum response length 64K tokens (Bai et al., 29 Apr 2026).

Reward is outcome-only and comes directly from the code verifier, avoiding separate reward models or process annotations. Figure 1 plots RL training curves on ClawGym-Bench using only code-based verifiers with weight 1.0 and no rubric judgment. The paper reports that RL improves both the vanilla 4B model and the already fine-tuned ClawGym-30B-A3B, and characterizes the relationship between stages as follows: supervised fine-tuning learns from successful or high-scoring teacher trajectories, while RL directly optimizes behavior against verifier-derived task outcomes in the live sandboxed environment (Bai et al., 29 Apr 2026).

This suggests a deliberately lightweight RL design: the environment remains black-box OpenClaw, the reward is verifier-native, and the engineering emphasis is on scalable rollout infrastructure rather than on an elaborate learned reward stack.

5. Evaluation, benchmark design, and empirical performance

ClawGym-Bench is the evaluation benchmark built to diagnose Claw-style agents reliably. It contains 200 benchmark tasks drawn from the synthesized task pool but excluding tasks used in training (Bai et al., 29 Apr 2026). Benchmark construction uses stricter filtering than the main dataset. For each task $s_0$ 9, the framework performs $\mathcal{A}$ 0 rollouts with both a strong model and a small model, computes average completion scores $\mathcal{A}$ 1 and $\mathcal{A}$ 2, and retains only tasks satisfying

$\mathcal{A}$ 3

The final benchmark has 200 tasks across 6 categories: Productivity and Collaboration 44, Systems and Automation 42, Analysis and Reasoning 35, Content and Domain Support 28, Planning and Knowledge 26, and Software Development 25. Among these, 156 tasks are evaluated purely by code-based checkers and 44 use hybrid verification (Bai et al., 29 Apr 2026).

The benchmark is explicitly calibrated for reliability. On a fixed balanced subset of 50 tasks, repeated 5-run evaluation produced low standard deviations: Qwen3-8B had mean 36.4% with std. 0.3%, and Qwen3-30B-A3B had mean 42.6% with std. 1.0% (Bai et al., 29 Apr 2026).

The main results show substantial gains over the base models. On ClawGym-Bench, Qwen3-8B scores 35.02, while ClawGym-8B scores 50.24, a 43.46% improvement. Qwen3-30A3B scores 45.11, while ClawGym-30A3B scores 56.82, a 25.96% gain. On the external PinchBench benchmark, Qwen3-8B improves from 54.50 to 75.70 according to Table 4, while Qwen3-30A3B improves from 55.60 to 86.00 (Bai et al., 29 Apr 2026). The paper also states that ClawGym-4B reaches 47.73 on ClawGym-Bench and 76.40 on PinchBench.

Ablation studies support several design choices. For Qwen3-8B, Only Persona-driven scores 49.44 on ClawGym-Bench and 73.51 on PinchBench; Only Skill-grounded scores 49.06 and 68.23; Mixed Synthesis performs best at 50.24 and 75.68. For Qwen3-30A3B, Only Persona-driven scores 53.65 and 84.92, Only Skill-grounded 52.27 and 80.05, and Mixed Synthesis 56.82 and 86.00 (Bai et al., 29 Apr 2026). The reward-threshold ablation shows that 0.5 yields the best downstream results.

The paper also reports a clear model hierarchy. Among proprietary frontier models, Claude-4.7-Opus is best overall on ClawGym-Bench at 77.81 average. Among open-weight frontier models, GLM-5.1 reaches 71.12 and Qwen3.5-Plus 70.35 (Bai et al., 29 Apr 2026). ClawGym-trained compact models remain below the best frontier systems but substantially improve over their unadapted backbones.

The behavioral analyses identify three key capability bottlenecks. First, tool-use appropriateness: weaker models may call tools and recover from errors, but fail to organize those calls into a coherent discovery-inspection-computation-verification workflow. Second, long-horizon execution robustness: successful agents treat tool errors as recoverable feedback and preserve progress, while weaker agents accumulate unresolved failures and reach dead ends. Third, fine-grained instruction following: agents can generate plausible artifacts that nevertheless violate key constraints (Bai et al., 29 Apr 2026).

These remaining weaknesses are illustrated by the paper’s examples. In the CI artifact audit example, GPT-5.4 achieves reward 1.000, whereas the weaker 30A3 scores 0.308. In the support-ticket automation case, GPT-5.4 again achieves reward 1.000, whereas 30A3 receives 0.067. In the reorder-plan case, a weaker model produces outputs violating the required Quantity <= ReorderPoint rule and receives 0.429 (Bai et al., 29 Apr 2026). The paper interprets these failures as weaknesses in chaining tools coherently, maintaining long-horizon state, recovering from errors, and preserving detailed constraints across derived artifacts.

The framework also has explicit limitations. The paper omits conventional training details such as optimizer type, weight decay, learning-rate schedule for SFT, hardware, and total training compute (Bai et al., 29 Apr 2026). It notes that execution feasibility is the weakest human-rated quality dimension among sampled tasks, that benchmark evaluation is expensive, and that current evaluation focuses mainly on final-state correctness rather than trajectory-level safety, efficiency, or robustness (Bai et al., 29 Apr 2026).

Within the broader Claw ecosystem, adjacent papers help clarify the likely operational role of ClawGym-Agents. “Claw-R1: A Step-Level Data Middleware System for Agentic Reinforcement Learning” presents a middleware layer that sits between heterogeneous agent runtimes and RL training backends, treating interaction traces as managed training assets rather than incidental runtime logs (Wang et al., 8 Jun 2026). This suggests a natural complement to ClawGym-Agents when step-level trajectory management, curation, and downstream RL consumption become primary concerns. Likewise, “Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents” proposes trajectory-aware grading with 300 human-verified tasks, three independent evidence channels, and 2,159 fine-grained rubric items, emphasizing Completion, Safety, and Robustness rather than final-output scoring alone (Ye et al., 7 Apr 2026). A plausible implication is that ClawGym-Agents, which are evaluated mainly through final-state correctness and verifier-based outcomes, would benefit from being studied alongside trajectory-aware evaluation and security-oriented infrastructures.

Security-oriented OpenClaw work further sharpens this context. “Defensible Design for OpenClaw: Securing Autonomous Tool-Invoking Agents” argues that OpenClaw-like systems are “insecure by default” because they combine untrusted inputs, autonomous continuation, extensibility, and privileged system access within a single execution loop (Li et al., 13 Mar 2026). Since ClawGym-Agents are explicitly trained for persistent, tool-using, workspace-grounded environments, this suggests that future work on such agents may need to extend beyond capability scaling toward bounded authority, runtime isolation, extension governance, and auditability.

In concise technical terms, ClawGym-Agents are Qwen3-based Claw-style personal agents trained end-to-end within a data-centric framework built specifically for persistent, workspace-grounded environments. Their performance gains are tied to scenario-diverse task synthesis, realistic workspace construction, hybrid verification, black-box trajectory collection in the real OpenClaw runtime, reward-threshold filtering, and sandbox-parallel RL (Bai et al., 29 Apr 2026). What remains unresolved are richer trajectory-level evaluation, deeper safety analysis, and continued improvement on long-horizon robustness, coherent tool orchestration, and fine-grained constraint satisfaction.