Auto-ClawEval: Automated Agent Evaluation

Updated 4 July 2026

Auto-ClawEval is a design pattern for automated evaluation systems targeting autonomous, claw-like agents, replacing manual benchmarks with dynamic, reproducible assessments.
It employs pipelines that convert natural language specifications into executable tasks through parsing, generation, and validation for scalable environment construction.
The framework integrates stateful execution and evidence-based grading, using trajectory logs and multi-metric scoring to evaluate performance in diverse security, software, and workflow domains.

Searching arXiv for the most relevant papers on Auto-ClawEval and closely related frameworks. Auto-ClawEval denotes an automated evaluation paradigm for autonomous, tool-using, or “claw-like” agents in which task construction, environment provisioning, execution, evidence capture, and scoring are performed end to end by software rather than by manual benchmark curation. In the literature, the name is used in more than one sense: it names a concrete benchmark generated with ClawEnvKit, and it also appears as an automation-oriented implementation view of several related systems, including continuous software-evolution evaluation, evidence-grounded workflow benchmarking, state-based personal-computing evaluation, and security red-teaming pipelines. Across these variants, the unifying objective is to replace static, brittle, and labor-intensive evaluation with scalable, refreshable, trajectory-aware, and reproducible assessment of agent behavior under realistic operational constraints (Li et al., 20 Apr 2026).

1. Scope, meaning, and lineage

Across recent arXiv literature, Auto-ClawEval is not a single standardized benchmark with one task format and one metric suite. Rather, it is a family resemblance term for fully automated evaluation systems centered on claw-like agents and adjacent agentic settings. Some papers define it as a large-scale automatically generated benchmark; others define it as the operationalization of a benchmark’s construction and scoring pipeline; still others use it as the natural automation layer over security, coding, or state-based agent evaluation.

Domain	Auto-ClawEval realization	Representative source
Claw-like tool-use environments	Automatically generated executable environments	(Li et al., 20 Apr 2026)
Continuous software evolution	DeepCommit + EvoClaw runner over Milestone DAGs	(Deng et al., 13 Mar 2026)
Trustworthy workflow evaluation	Evidence-grounded, trajectory-aware grading suite	(Ye et al., 7 Apr 2026)
Live workflow benchmarking	Refreshable signal layer + fixed release snapshot	(Li et al., 30 Apr 2026)
State-based personal computing	Automated task authoring, validation, and final-state verification	(Liang et al., 9 Jun 2026)
Coding harness comparison	Adapter-based SWE-style evaluation for OpenClaw-like harnesses	(Zheng et al., 10 Jun 2026)
Security evaluation	MITM red-teaming, spec-driven synthesis, taint-tracked security benchmarking, and dual-boundary defense evaluation	(Zhao et al., 19 Mar 2026, Cheng et al., 1 Jun 2026, Niu et al., 29 Jun 2026, Ma et al., 8 Jun 2026)

This broad usage suggests that Auto-ClawEval is best understood as a design pattern rather than as a single artifact. The pattern combines automated task or environment generation, controlled execution substrates, and verifiable grading. A plausible implication is that the term functions as a convergence point for several previously separate benchmarking traditions: benchmark generation, stateful execution, trajectory logging, and automated judging.

Related antecedents outside the core “claw” nomenclature reinforce that interpretation. AutoEval for mobile agents automates reward-signal design through Structured Substate Representation and an autonomous Judge System, while AutoEval for real-world robot manipulation policies automates success detection, scene reset, and queue-based around-the-clock evaluation. Both articulate the same shift from hand-written evaluation code toward autonomous evaluation infrastructure (Sun et al., 4 Mar 2025, Zhou et al., 31 Mar 2025).

2. Automated task and environment construction

A central property of Auto-ClawEval systems is that they do not assume manually authored benchmark instances as the only source of evaluation data. Instead, they synthesize or reconstruct executable tasks from higher-level signals.

In ClawEnvKit, Auto-ClawEval is built from a three-module pipeline comprising a Parser, a Generator, and a Validator. The formal environment is defined as $E = (P, M, C)$ , where $P$ is a natural-language task specification, $M = (\mathcal{T}, \mathcal{O})$ is the interaction interface of callable tools and audit logs, and $C = \{(c_i, w_i)\}$ is the evaluation functional. From natural-language requests such as a desired service mix or difficulty level, the Parser extracts structured “intent atoms”; the Generator instantiates prompts, tools, fixtures, and scoring; and the Validator enforces structural validity, atom coverage, feasibility, and consistency. Using this pipeline, the benchmark comprises 1,040 environments across 24 categories, built on an initial mock library of 15 services (Li et al., 20 Apr 2026).

In EvoClaw, environment construction is not prompt-to-environment synthesis but history-to-benchmark reconstruction. DeepCommit models software evolution as a Directed Acyclic Graph $G = (V, E)$ , where nodes are milestones and edges are strong or weak inter-milestone dependencies. It preprocesses commit histories, constructs Milestone DAGs through Seed Discovery, Milestone Consolidation, Dependency Inference, and Granularity Refinement, then resolves runtime environments by reconstructing repository states, generating Dockerfiles, and collecting Fail-to-Pass and Pass-to-Pass tests. The retained benchmark spans 7 repositories, 98 graded, human-verified milestones, and 124 inter-milestone dependencies (Deng et al., 13 Mar 2026).

STAGE-Claw automates a different construction problem: from a short task hint, it produces a realistic personal-computing scenario with environment configuration, user-facing prompt, hidden ground truth, and executable verifier. Its authoring stage outputs env_make.md, question.md, and Ground_truth/; an independent checker agent validates structure, reproducibility, verifiability, and difficulty, using a weighted 100-point checker score and an admission threshold of at least 80. In the reported release, 46 candidates yielded 40 accepted tasks, with an average of 2.67 repair iterations (Liang et al., 9 Jun 2026).

Claw-Eval-Live adds a refreshable upstream signal layer. It begins from a time-stamped ClawHub Top-500 snapshot, clusters signals into workflow patterns, derives family weights, expands seeds into runnable candidates, and then selects a public subset via mixed-integer linear programming. The reported release starts from 157 screened runnable candidates and publishes a fixed 105-task snapshot spanning 22 fine-grained families (Li et al., 30 Apr 2026).

These systems differ in source material—natural-language capability descriptions, commit logs, task hints, or workflow-demand signals—but they converge on the same architectural claim: benchmark construction itself can be automated while preserving executability and verification.

3. Stateful execution and evidence-grounded grading

Auto-ClawEval systems distinguish themselves from final-output-only benchmarks by insisting that evaluation be tied to what an agent actually did in a stateful environment.

Claw-Eval makes this principle explicit through trajectory-aware grading. Each trial runs in a fresh Docker sandbox, and every action is captured through three independent evidence channels: execution traces, service audit logs, and environment snapshots. A strict temporal firewall ensures that graders, reference data, and verification scripts are injected only after the agent finishes. The suite comprises 300 human-verified tasks across 9 categories, with grading grounded in 2,159 rubric items (Ye et al., 7 Apr 2026).

Claw-Eval-Live applies the same evidence-grounded logic to a live workflow benchmark. It records execution traces, service audit logs, service state, and post-run workspace artifacts. Deterministic checks dominate where evidence is sufficient, while structured LLM judging is reserved for semantic dimensions such as completeness, organization, or tone. This design supports a benchmark that is simultaneously refreshable in its upstream signal layer and reproducible in its published release snapshot (Li et al., 30 Apr 2026).

STAGE-Claw shifts the grading target from textual outputs to the correctness of the final system state. A benchmark instance is formalized as $\mathcal{B} = (q, E_0, G, R, V)$ , where $q$ is the visible prompt, $E_0$ the initial environment, $G$ the target final-state specification, $R$ the rubric, and $P$ 0 the executable verifier. Agents operate in isolated OS environments rebuilt from env_make.md, and success is adjudicated by executable programs over files, calendar, reminders, notes, email, and related tool states rather than by self-report (Liang et al., 9 Jun 2026).

EvoClaw supplies a stateful software-engineering analogue. A DAG-based scheduler unlocks milestone $P$ 1 only after all prerequisites have been completed, and the agent works over a persistent codebase that carries forward both correct modifications and technical debt. On milestone completion, the working repository is snapshotted into an isolated container for reproducible scoring, while the development environment remains continuous and stateful (Deng et al., 13 Mar 2026).

Claw-SWE-Bench offers a repository-level version of the same principle. It standardizes a clean Docker workspace, fixed prompt, fixed runtime budget, git-state-based patch extraction, and the official SWE-bench evaluator. Crucially, it requires solutions to exist as edits under /testbed, with model_patch exported from repository state via git diff rather than from model text, thereby separating actual workspace evolution from textual claims about a patch (Zheng et al., 10 Jun 2026).

The resulting conception of evaluation is evidence-grounded, not narration-grounded. Auto-ClawEval systems therefore treat traces, logs, workspace diffs, service state, and snapshots as first-class benchmark artifacts.

4. Metrics and formal objectives

Because Auto-ClawEval is designed for stateful and safety-critical settings, its metric suites are typically multidimensional. Accuracy alone is not treated as sufficient.

In ClawEnvKit’s Auto-ClawEval, the reward structure is inherited from Claw-Eval and combines safety, completion, and robustness:

$P$ 2

Here safety is a multiplicative gate, completion aggregates weighted task checks, and robustness measures recovery from injected API errors. The benchmark reports safety, completion, robustness, and mean score, with Pass $P$ 3 aggregation used to reduce variance (Li et al., 20 Apr 2026).

EvoClaw separates feature implementation from regression control. Its milestone-level metrics are Recall, Precision, and their harmonic Score, together with Resolve Rate as a strict success criterion requiring all Fail-to-Pass and all Pass-to-Pass tests for a milestone to pass. This decomposition is specifically intended to expose the tension between adding new functionality and preserving existing behavior under continuous software evolution (Deng et al., 13 Mar 2026).

Claw-Eval scores each trial by combining Completion, Safety, and Robustness, and reports Average Score, Pass@ $P$ 4, and Pass $P$ 5. Pass@ $P$ 6 captures capability ceiling—at least one successful trial—whereas Pass $P$ 7 captures reliability floor—success on every trial. The gap between the two is used as a consistency diagnostic rather than treated as mere variance (Ye et al., 7 Apr 2026).

Claw-Eval-Live adopts a public pass rule with threshold $P$ 8. Its headline metrics are Pass Rate,

$P$ 9

and Overall Completion Score, a 0–100 average over task scores. The benchmark also defines task-level discrimination as the standard deviation of completion scores across models (Li et al., 30 Apr 2026).

Security-oriented variants introduce additional metrics. ClawTrap defines attack success rate $M = (\mathcal{T}, \mathcal{O})$ 0, unsafe output rate $M = (\mathcal{T}, \mathcal{O})$ 1, anomaly attribution rate $M = (\mathcal{T}, \mathcal{O})$ 2, and optional task completion under attack $M = (\mathcal{T}, \mathcal{O})$ 3. SeClaw formalizes sample-level judgments $M = (\mathcal{T}, \mathcal{O})$ 4, coverage $M = (\mathcal{T}, \mathcal{O})$ 5, attack success $M = (\mathcal{T}, \mathcal{O})$ 6, and an overall $M = (\mathcal{T}, \mathcal{O})$ 7 harmonic score. SecureClaw’s common harness reports attack success rate and leak rate

$M = (\mathcal{T}, \mathcal{O})$ 8

including per-channel leakage over internal-relay and final-output channels (Zhao et al., 19 Mar 2026, Cheng et al., 1 Jun 2026, Ma et al., 8 Jun 2026).

The common theme is that Auto-ClawEval metrics are designed to preserve separability among distinct failure modes: unsafe actions, incomplete completion, weak recovery, regression introduction, unauthorized effects, and information leakage.

5. Continuous, live, and state-based evaluation domains

A major branch of Auto-ClawEval targets general task execution under realistic tool and service conditions. In ClawEnvKit, the benchmark’s scale permits large cross-model and cross-harness comparisons: the paper evaluates 4 model families through 8 agent harness frameworks and reports that harness engineering can improve the top-line mean by up to 15.7 points over a bare ReAct Agent Loop. It also reports that Auto-ClawEval matches or exceeds human-curated environments on coherence and clarity at 13,800x lower cost, and that completion, not safety or robustness, is the primary axis of variation because robustness is exactly 100.0 under the injected error protocol across reported harness and model summaries (Li et al., 20 Apr 2026).

A second branch targets continuous software maintenance rather than one-off task solving. EvoClaw shows that performance on isolated milestones, often above 80%, collapses in continuous mode to at most 38.03%, with Resolve Rate as low as 13.37%. The mechanism identified is asymmetric: cumulative Recall grows near-linearly, but Precision saturates, so regressions and technical debt accumulate and cap downstream capability. Error-chain analysis further reports domination by P1 Inherited and PX Missing in later milestones, with Logic Error as the dominant root cause at approximately 57% (Deng et al., 13 Mar 2026).

A third branch targets trustworthy evaluation of autonomous workflows. Claw-Eval demonstrates that a trajectory-opaque judge misses 44% of safety violations and 13% of robustness failures that its hybrid evidence pipeline catches. It also reports that increasing injected service-error rates primarily degrades consistency: Pass@3 remains relatively stable, whereas Pass $M = (\mathcal{T}, \mathcal{O})$ 9 drops by as much as 24.2% in some settings. The result is a model of agent quality in which repeatability under perturbation is a separate capability axis from peak success (Ye et al., 7 Apr 2026).

Claw-Eval-Live emphasizes freshness of workload demand. Its first release contains 105 tasks, 87 service-backed workflows, and 18 workspace repair tasks. The leading model, Claude Opus 4.6, passes 66.7% of tasks, and no model reaches 70%. Family-level analysis identifies HR / People, Management, and multi-system business workflows as persistent bottlenecks, whereas local workspace repair is easier but still unsaturated (Li et al., 30 Apr 2026).

STAGE-Claw extends state-based evaluation into personal-computing environments. It reports 40 accepted tasks in a realistic macOS setting and finds that virtual textual simulation overestimates performance relative to state-based checks; for example, DeepSeek-V4-Pro rises from 59.78 in state-based evaluation to 66.71 in virtual evaluation, and Qwen3.5-Plus rises from 59.74 to 64.57. This supports the claim that real tool-state changes expose execution failures that artifact-only or text-only evaluation can hide (Liang et al., 9 Jun 2026).

Claw-SWE-Bench shows that Auto-ClawEval also applies to coding-harness evaluation. On its 350-instance multilingual benchmark, a minimal direct-diff adapter scores only 19.1% Pass@1 with GLM 5.1, whereas the full adapter reaches 73.4% with the same backbone. The paper attributes the gap to workspace hygiene, fair prompt and runtime contracts, and git-state-based patch export rather than to model capability changes alone. Under fixed models, harness choice changes Pass@1 by up to 27.4 percentage points, confirming that evaluation must treat harness engineering as a first-class variable (Zheng et al., 10 Jun 2026).

6. Security-focused Auto-ClawEval

Security evaluation constitutes one of the most technically distinctive Auto-ClawEval branches because it requires automated threat injection, stateful monitoring, and deterministic attribution of unsafe behavior.

ClawTrap introduces a MITM-based red-teaming framework for OpenClaw-style web agents. It intercepts HTTP(S) flows through a researcher-controlled proxy path and applies three attack modes—Static HTML Replacement, Iframe Popup Injection, and Dynamic Content Modification—through a rule-driven interception engine built around interceptor.py, matcher.py, and transformer.py. The framework’s metrics separate deception, safety, and provenance awareness through attack success rate $C = \{(c_i, w_i)\}$ 0, unsafe output rate $C = \{(c_i, w_i)\}$ 1, anomaly attribution rate $C = \{(c_i, w_i)\}$ 2, and optional task completion under attack $C = \{(c_i, w_i)\}$ 3. The reported findings are qualitative rather than aggregate-quantitative, but they show consistent model stratification: weaker models trust tampered observations, while stronger models attribute anomalies to possible interception or proxy rewrite and adopt safer fallback strategies (Zhao et al., 19 Mar 2026).

SeClaw generalizes security benchmarking from attack scripts to task synthesis. It combines spec-driven security task synthesis with execution-based evaluation in SeClaw Docker, using structured risk specifications over resource, task, environment, and intrinsic risk channels. Each instantiated task includes workspace configuration, MCP bindings, mock services, local files, initialization scripts, and task metadata. Evaluation is trajectory-aware rather than outcome-only, with structured logs over model, tool/service, and artifact views, and with security scoring based on sample-level judgments $C = \{(c_i, w_i)\}$ 4, coverage $C = \{(c_i, w_i)\}$ 5, attack success $C = \{(c_i, w_i)\}$ 6, and $C = \{(c_i, w_i)\}$ 7 (Cheng et al., 1 Jun 2026).

SafeClawArena pushes the system-level view further by modeling a claw-like agent as an “agentic computer system” with an OS-like gateway runtime, Skills resembling installed applications, and Plugins resembling privileged extensions. Its benchmark contains 406 adversarial tasks across four attack surfaces—Skill Supply-Chain Integrity, Persistent State Exploitation, Cross-Boundary Data Flow, and Indirect Prompt Injection—and monitors nine output channels with canary-marked credentials. Empirically, the highest attack success rate reaches 70%; malicious Plugins succeed in 100% of cases regardless of the LLM on unhardened platforms; and SeClaw reduces GPT-5.4’s attack success rate from 70% to 22%, partly through utility-security tradeoffs rather than active defenses alone (Niu et al., 29 Jun 2026).

SecureClaw evaluates defenses rather than attacks alone. Its Auto-ClawEval harness implements a dual-boundary architecture: a trusted Gateway mediates reads by replacing raw secrets with opaque handles and bounded summaries, while a trusted Executor mediates writes through a PREVIEW $C = \{(c_i, w_i)\}$ 8COMMIT protocol. In a common harness over AgentDojo, AgentLeak, and ASB, the reported results are 0% attack success rate on ASB, 0.64% attack success rate on AgentDojo, and 3.23% overall leak on AgentLeak’s attacked parity lane. The paper argues that single-boundary defenses are insufficient because effect authorization and plaintext confinement address distinct attack surfaces (Ma et al., 8 Jun 2026).

Taken together, these security-oriented systems show that Auto-ClawEval is not limited to capability measurement. It also functions as an automated substrate for adversarial benchmarking, taint tracking, provenance checking, and defense evaluation.

7. Limitations, validity threats, and future directions

A recurrent limitation is that “Auto-ClawEval” lacks a single settled specification. The term spans several benchmark families with different task objects, state representations, and scoring semantics. This suggests a need for clearer interoperability layers, but the papers do not yet define one unified standard.

Construction pipelines remain imperfect. DeepCommit’s topology-driven reconstruction can fragment intent-defined milestones and under-represent process-level dependencies, with a scikit-learn case study reporting moderate agreement at $C = \{(c_i, w_i)\}$ 9. Its source-file whitelists and Fail-to-Pass requirements bias coverage toward tightly connected subgraphs, and retained milestones exclude certain maintenance-only changes from scoring paths (Deng et al., 13 Mar 2026). STAGE-Claw, despite executable verifiers and final-state grounding, is limited to 40 tasks and a first-valid-run protocol, with platform dependence and realistic OS brittleness explicitly acknowledged (Liang et al., 9 Jun 2026).

Environment fidelity is another persistent issue. ClawEnvKit’s mock services are deterministic and auditable, but the paper notes that they do not reproduce production messiness such as auth flows, schema drift, rate tiers, or external state. Its current scope—24 categories, 15 services, tasks of up to 20 tool-calling rounds, and 300-second timeouts—is substantial but incomplete. The same paper identifies voice, GUI automation, multi-agent delegation, and regulated workflows as not yet covered (Li et al., 20 Apr 2026). Claw-Eval-Live similarly grounds task mixtures in public workflow-demand signals, but those signals are explicitly a popularity prior rather than ground truth about real usage or economic value (Li et al., 30 Apr 2026).

Judging remains a technical fault line. Claw-Eval confines LLM judging to open-ended criteria and grounds it in collected artifacts, yet it still identifies judgment subjectivity and scaling rubric creation as live issues. Its release contains 2,159 human-verified rubric items, and maintaining that level of granular verification is labor-intensive (Ye et al., 7 Apr 2026). STAGE-Claw uses guarded LLM adjudication only as fallback, but the mere existence of that fallback indicates that deterministic verification does not cover every failure mode (Liang et al., 9 Jun 2026).

Security benchmarks face additional realism gaps. ClawTrap’s current findings are qualitative, and the paper leaves HTTPS interception specifics and site-specific defenses largely implicit. SeClaw notes coverage gaps for dynamic web and UI variability, long-running workflows, and multi-agent interactions. SafeClawArena’s taint tracking is intentionally precise for canary strings, but side channels and broader covert-channel behavior remain outside its measured scope (Zhao et al., 19 Mar 2026, Cheng et al., 1 Jun 2026, Niu et al., 29 Jun 2026).

The direction of travel is nonetheless clear. Future work described across the literature includes larger and more accurate itinerary harvesting for continuous software evolution, broader service libraries and live-sandbox hybrids for task generation, richer state and memory provenance controls, more robust verification scaffolds to reduce blind thrashing and regression accumulation, longer-horizon multi-session tasks, and more explicit treatment of utility-security tradeoffs in defended systems (Deng et al., 13 Mar 2026, Li et al., 20 Apr 2026, Ma et al., 8 Jun 2026). This suggests that Auto-ClawEval is evolving from benchmark automation into a general methodology for evaluating whether autonomous agents are not merely capable in isolated settings, but reliable under persistence, recoverable under perturbation, and auditable under real operational risk.