ClawGym-SynData: Verifiable Task Corpus
- The paper presents ClawGym-SynData, a synthesized task corpus using dual-route synthesis and hybrid verification for multi-step workflow evaluation.
- ClawGym-SynData is a dataset of 13.5K realistic, environment-grounded tasks that simulate file operations, document editing, and data transformation.
- Its methodology combines persona-driven and skill-grounded synthesis with code-based and rubric-based verifiers to ensure final-state correctness of complex workflows.
ClawGym-SynData is the synthesized, environment-grounded task corpus at the core of the ClawGym framework, designed specifically to train and evaluate “Claw-style” personal agents. In this setting, agents operate within real computer environments to execute multi-step workflows over local files, tools, and persistent workspace states, and success is defined primarily by the correctness of the changed workspace state rather than by a final text answer. The dataset addresses the lack of large-scale, verifiable, workspace-grounded training data for realistic multi-step, tool-mediated workflows by combining dual-route synthesis, realistic mock workspaces, and hybrid verification, yielding 13.5K filtered, executable tasks within OpenClaw environments (Bai et al., 29 Apr 2026).
1. Formal setting and role within ClawGym
ClawGym-SynData is situated in a task model where an agent receives a user instruction and an initial workspace, can call tools, and must drive the environment to a correct final state. A task instance is defined as
with instruction , initial state , action set , state transition function , and task-specific verifier . If the agent produces a trajectory and induces a final state , success is verified by
This formulation centers final-state correctness, including workspace artifacts, structured outputs, and file transformations, rather than treating the agent as a text-only responder (Bai et al., 29 Apr 2026).
Within the broader ClawGym lifecycle, SynData functions as the data substrate for task synthesis, black-box rollouts, supervised fine-tuning, reinforcement learning, and subsequent evaluation on ClawGym-Bench. The paper motivates synthetic, verifiable data on the grounds that real user workspaces are private, heterogeneous, and difficult to validate automatically, while Claw-style tasks require multi-step, tool-mediated operations with long-horizon dependencies and robust verifiers.
2. Synthesis architecture and dataset composition
ClawGym-SynData uses a dual-route diversification strategy: persona-driven top-down synthesis and skill-grounded bottom-up synthesis. In the persona-driven route, seeds combine a persona , a scenario category 0, and atomic operations 1 as
2
after which task instructions are generated by
3
The persona-driven configuration spans 9 major scenario classes, 43 subcategories, and 7 categories covering 26 atomic operations. Its reported distribution is broad: even the largest scenario category accounts for only 12.5% of tasks, and no single atomic action dominates (Bai et al., 29 Apr 2026).
The skill-grounded route begins from raw skills from ClawHub, with approximately 30K raw skills annotated and filtered by synthesizability; 16K synthesizable skills are retained. Tasks are then composed from one primary skill and up to three supporting skills. The annotated skills used for synthesis are distributed across several categories:
| Skill category | Count | Share |
|---|---|---|
| MCP Tools | 411 | 2.44% |
| Prompts | 565 | 3.36% |
| Workflows | 1,972 | 11.71% |
| Dev Tools | 3,906 | 23.20% |
| Data APIs | 4,236 | 25.16% |
| Security | 993 | 5.90% |
| Automation | 1,221 | 7.25% |
| Other | 3,533 | 20.98% |
| Total | 16,837 | 100% |
The final corpus contains 13.5K filtered, executable tasks. Tasks involve local file operations, document editing, data analysis, script execution, and reporting, with resource types including text, markdown, JSON, CSV, and YAML. The paper does not specify exact counts per tool or file type, but the examples indicate multi-file workflows with realistic paths such as input/ci-artifacts/*.json and output/audit/*.json (Bai et al., 29 Apr 2026).
3. Workspace mocking and hybrid verification
A central design feature of ClawGym-SynData is that tasks are grounded in realistic mock workspaces. Resource preparation is specified as
4
where 5 is a file path, 6 is a type, and 7 is a content specification. Files are generated to be self-contained and verifiable, covering text, markdown, JSON, CSV, and YAML (Bai et al., 29 Apr 2026).
Verification is hybrid. Code-based checks define atomic verification points 8, producing binary scores
9
which are aggregated as
0
Rubric-based checks define rules 1 with ordinal scores
2
and a weighted average
3
Task scores are either code-only,
4
or hybrid,
5
with 6 for hybrid tasks (Bai et al., 29 Apr 2026).
The code-based component is used for deterministic checks such as existence, schema validity, recomputed statistics, and cross-file consistency. Rubrics evaluate qualitative aspects including clarity, organization, and faithfulness, and are explicitly intended not to duplicate deterministic constraints. This hybrid construction is also applied to the validation of black-box rollout trajectories: trajectories are retained only if their final task score exceeds a filtering threshold.
Representative examples in the paper include a CSV merge task, where filenames containing “sales” are merged and enriched with a source_file column, a CI artifact audit producing aggregated JSON reports, and an inventory reorder task that computes TargetStock, OrderQty, and LineTotal before writing supplier JSON files under output/orders/ (Bai et al., 29 Apr 2026).
4. Filtering, quality control, and statistical profile
ClawGym-SynData is filtered both for task validity and for verifier reliability. Task quality filters include novelty via cosine similarity, plausibility via an LLM judge using a binary decision against unrealistic tools or services, and difficulty via an LLM judge. Verification quality is checked by ensuring that code-based checkers do not award non-trivial score on the initial workspace without agent outputs, by reviewing task–checker alignment, and by assessing whether rubrics complement rather than duplicate code-based criteria (Bai et al., 29 Apr 2026).
For benchmark candidate selection, the paper uses difficulty-aware filtering with the conditions
7
The intent is to retain tasks that are neither trivially unsolved nor trivially solved, while preserving a performance gap between stronger and smaller models.
Human-sampled quality assessment over 50 random training tasks reports the following 1–5 scores: Task Reasonableness 4.46, Execution Feasibility 3.50, Resource Consistency 4.36, Verification Quality 3.92, and Overall 4.06. The Execution Feasibility score is notably lower than the other dimensions, which the paper associates with the difficulty of long-horizon workflows (Bai et al., 29 Apr 2026).
Post-rollout selection statistics for the 24.5K trajectories used for supervised fine-tuning indicate an average of 13.00 rounds, 18.67K tokens, 15.82 tool calls, and 3.25 tool types. Trajectory retention is governed by a reward-threshold rule on continuous scores in 8; the paper states that 0.5 is a good threshold balancing fidelity and diversity, while also noting that the exact threshold used in the core pipeline is a predefined value not otherwise specified.
5. Use in model training and benchmark construction
ClawGym-SynData is used to generate black-box rollout trajectories collected natively through the OpenClaw harness. Teacher models are MiniMax-M2.5 and GLM-5.1. The collection process proxies and intercepts all requests and responses, removes systematic cron and heartbeat prompts, filters unsupported tools, and aggregates fragments into coherent trajectories (Bai et al., 29 Apr 2026).
For supervised fine-tuning, the paper trains on Qwen3-4B-2507-Instruct, Qwen3-8B with context extended to 64K via YaRN, and Qwen3-30B-A3B-2507-Instruct. The resulting models are ClawGym-4B, ClawGym-8B, and ClawGym-30B-A3B. The training procedure uses multi-turn loss masking to exclude environment feedback tokens from loss, thereby focusing the optimization on policy actions such as reasoning, decisions, and tool calls.
The paper also explores reinforcement learning through a lightweight sandbox-parallel pipeline. It samples 2,000 SynData tasks and applies GRPO with learning rate 9, train batch size 8, rollouts per prompt 8, training steps 100, temperature 0.7, and maximum response length 64K. Rewards are taken directly from code verifier outputs, making the RL signal outcome-only.
ClawGym-Bench is then constructed from SynData through difficulty-aware filtering and human-LLM review. It contains 200 instances:
| Benchmark category | Count |
|---|---|
| Productivity and Collaboration | 44 |
| Systems and Automation | 42 |
| Analysis and Reasoning | 35 |
| Content and Domain Support | 28 |
| Planning and Knowledge | 26 |
| Software Development | 25 |
Of these 200 tasks, 156 are code-only and 44 are hybrid, with the same 0 aggregation for hybrid scoring. Repeated evaluation on 50 tasks over 5 runs shows low standard deviation, at most 1%; the paper reports, for example, Qwen3-8B at 36.4% mean with 0.3% standard deviation and Qwen3-30B-A3B at 42.6% mean with 1.0% standard deviation. Every benchmark task is stated to admit at least one path to full score, either through strong-agent rollouts or human-crafted references (Bai et al., 29 Apr 2026).
The paper attributes substantial downstream gains to training on SynData alone. ClawGym-30B-A3B reaches 86.00 on PinchBench and 56.82 average on ClawGym-Bench, surpassing Qwen3-235B-A23B at 54.48. Category-wise results are described as discriminative, with no single model dominating across all categories.
6. Scope, distinctions, and limitations
ClawGym-SynData is distinct from several nearby artifacts that can be conflated because of overlapping terminology. It is not SafeClawArena, which is a security benchmark of 406 adversarial tasks for Claw-like agents organized around four attack surfaces—Skill Supply-Chain Integrity, Persistent State Exploitation, Cross-Boundary Data Flow, and Indirect Prompt Injection—and evaluated in containerized replicas of OpenClaw, NemoClaw, and SeClaw (Niu et al., 29 Jun 2026). It is also unrelated to the humanoid-robot pipeline “CLAW: Composable Language-Annotated Whole-body Motion Generation,” which produces language-annotated whole-body motion data for the Unitree G1 in MuJoCo (Cao et al., 13 Apr 2026).
Within agent benchmarking, ClawGym-SynData is also positioned as distinct from text-only reasoning benchmarks such as AIME and from structured agent-loop benchmarks such as SWE-Bench-Verified and BrowseComp, because its supervision is grounded in local workspaces, opaque interfaces, persistent state changes, multi-step tool use, and hybrid verifiers (Bai et al., 29 Apr 2026).
Its limitations are explicitly stated. Verification is focused on final-state correctness; trajectory-level safety, efficiency, and error recovery are acknowledged but left for future work. The paper does not exhaustively quantify exact coverage of tools, file types, or workspace complexity measures, and it does not analyze domain or persona bias in depth. The continuous-score filtering threshold is empirically chosen rather than derived from a more principled selection framework. Practical release details are also incomplete in the paper: licensing, versioning, detailed loaders, CLI commands, and compute requirements are not specified, although the resources are said to be “soon released” at https://github.com/ClawGym (Bai et al., 29 Apr 2026).
A plausible implication is that ClawGym-SynData should be understood as a scalable substrate for environment-grounded supervision and diagnostic evaluation of task execution, rather than as a complete account of deployment robustness. In the paper ecosystem represented here, that broader robustness question is addressed more directly by security-focused artifacts such as SafeClawArena (Niu et al., 29 Jun 2026).