Papers
Topics
Authors
Recent
Search
2000 character limit reached

ClawEnvKit: Auto Env Generation & Security

Updated 4 July 2026
  • ClawEnvKit is a framework for automatically generating, validating, and executing environments from natural-language descriptions tailored for claw-like agents.
  • It integrates a parser–generator–validator pipeline and applies containerized benchmarks (SafeClawArena) to assess system-level security vulnerabilities.
  • The toolkit combines declarative task specifications with simulated endpoints and rigorous adversarial testing to enhance agent harness design and performance evaluation.

to=arxiv_search.search 微信上的天天中彩票 尚度=10 query="ClawEnvKit arXiv (Li et al., 20 Apr 2026, Niu et al., 29 Jun 2026)" to=arxiv_search.search ასიათված _一本道=10 query="Title: ClawEnvKit: Automatic Environment Generation for Claw-Like Agents" to=arxiv_search.search 彩神争霸可以json {"query":"ClawEnvKit: Automatic Environment Generation for Claw-Like Agents","max_results":5,"sort_by":"relevance"} to=arxiv_search.search 天天中彩票是=json {"query":"Understanding and Evaluating Claw-like Agent Security Through a Computer-Systems Lens","max_results":5,"sort_by":"relevance"} to=arxiv_search.search 天天好彩票 ปมถวายสัตย์=5 query="Claw-like agent security computer-systems lens" to=arxiv_search.search ฝ่ายขายข่าวី _日本一级特黄大片 {"query":"Claw-like agent security computer-systems lens","max_results":5} ClawEnvKit denotes a technical framework for claw-like agents in two closely related senses within the 2026 literature: first, as an autonomous generation pipeline that instantiates executable environments from natural-language descriptions; second, as an end-to-end toolkit for reproducing, extending, and evaluating the security of Claw-like agent platforms through the SafeClawArena benchmark (Li et al., 20 Apr 2026, Niu et al., 29 Jun 2026). Across both uses, the central object is the claw-like agent: an always-on process with persistent access to credentials, files, tools, and external services, and with system-level responsibilities such as installing packages, maintaining state, scheduling subtasks, and mediating I/O. The framework is therefore situated at the intersection of benchmark construction, agent harness design, and computer-systems-style security evaluation.

1. Conceptual scope and problem setting

The environment-generation formulation begins from the claim that constructing environments for training and evaluating claw-like agents remains a manual, human-intensive process that does not scale, and that what is needed is not just a dataset, but an automated pipeline capable of generating diverse, verified environments on demand (Li et al., 20 Apr 2026). In that setting, ClawEnvKit is designed to transform a natural-language request into a structured, executable, and verifiable environment.

The security-oriented formulation addresses a different but adjacent problem. Claw-like AI agents are treated as agentic computer systems whose gateway runtime plays an OS-like mediation role, whose Skills resemble user-installed applications, and whose Plugins resemble loadable extensions with runtime privileges. Existing benchmarks are described as focusing on model responses and tool calls, leaving cross-component failure modes largely unmeasured. ClawEnvKit, in this usage, was built to measure cross-component, system-level vulnerabilities in always-on claw-like agents that have OS-like responsibilities (Niu et al., 29 Jun 2026).

A common misconception is that claw-agent evaluation can be reduced either to a static benchmark dataset or to isolated model-response assessment. The two ClawEnvKit papers reject both reductions. One frames evaluation as automated environment synthesis from user intent; the other frames security evaluation as containerized, cross-component measurement over real gateway runtimes, Skills loaders, Plugins loaders, memory stores, and outbound channels.

2. Declarative formalism and generation pipeline

In the environment-generation paper, ClawEnvKit adopts a declarative formalism in which an environment is a three-tuple

E=(P,M,C),E = (P, M, C),

where PLP \in \mathcal{L} is the natural-language task specification, M=(T,O)M = (\mathcal{T}, \mathcal{O}) is the interaction interface, and C={(ci,wi)}iC = \{(c_i, w_i)\}_i is the evaluation functional (Li et al., 20 Apr 2026). Here, T\mathcal{T} is the set of callable tools, O\mathcal{O} is the audit log of every tool call, its parameters, and outcomes, each ci:Σ×O[0,1]c_i : \Sigma \times \mathcal{O} \to [0,1] scores some aspect of the trajectory σ\sigma, and iwi=1\sum_i w_i = 1. The formalism replaces an explicit state-transition model with a declarative specification of what the agent must do, what it can do, and how it is scored.

The scalar grading objective is

R(σ,E)  =  safety(σ)  ×  (0.8completion(σ,C)  +  0.2robustness(σ,M)).R(\sigma, E) \;=\; \mathrm{safety}(\sigma)\;\times\;\bigl(0.8\,\mathrm{completion}(\sigma,C)\;+\;0.2\,\mathrm{robustness}(\sigma,M)\bigr).

In this formulation, PLP \in \mathcal{L}0 is a hard gate, PLP \in \mathcal{L}1, and PLP \in \mathcal{L}2 is the fraction of injected API-error events from which the agent recovers successfully (Li et al., 20 Apr 2026).

The pipeline itself comprises three modules. The parser extracts structured generation parameters from natural-language input. Internally, it is described as implementing

PLP \in \mathcal{L}3

where PLP \in \mathcal{L}4 is the user request and PLP \in \mathcal{L}5 is a small JSON schema containing services, difficulty, atoms, and missing_services. The generator then produces the task specification, tool interface, and scoring configuration. The validator enforces feasibility, diversity, structural validity, and internal consistency across the generated environments. This parser–generator–validator decomposition is the defining architectural abstraction of ClawEnvKit as an automatic environment generation system (Li et al., 20 Apr 2026).

The task schema is explicitly operational. Each task.yaml must define task_id, prompt, fixtures, tools, scoring_components, and safety_checks. Checks are drawn from 15 deterministic types plus llm_judge, and the weights are chosen so that PLP \in \mathcal{L}6, with the remaining PLP \in \mathcal{L}7 allocated to llm_judge (Li et al., 20 Apr 2026).

3. Validation regime and benchmark construction

Validation is a first-class component of ClawEnvKit rather than a post hoc filter. Each candidate environment PLP \in \mathcal{L}8 passes three sequential checks: structural validity, coverage, and feasibility (Li et al., 20 Apr 2026). Structural validity runs a fixed sequence of 12 deterministic checks, including required fields, at least 3 scoring components, PLP \in \mathcal{L}9, valid check types, bounds on llm_judge weights, at least one safety check, tool-service consistency, endpoint existence, cross-service consistency, and consistency between forbidden tools and required scoring. Coverage then verifies that action, object, and constraint atoms extracted by the parser are represented in tools, fixtures, prompt text, rubrics, or safety checks as appropriate. Feasibility is assessed by a final lightweight LLM call asking whether the task is solvable given the tools and data; if new services were generated, the validator also spins up the mock server and hits every endpoint to ensure liveness.

This validation stack is used to construct Auto-ClawEval, described as the first large-scale benchmark for claw-like agents, comprising 1,040 environments across 24 categories (Li et al., 20 Apr 2026). Two benchmark variants are reported: Auto-ClawEval with 1,040 tasks and Auto-ClawEval-Mini with 104 tasks. The benchmark includes 15 mock services drawn from Claw-Eval’s library and covers approximately 370 single-service API tasks, approximately 350 cross-service API tasks, approximately 270 file-dependent tasks, and approximately 50 live-web tasks.

The quality and cost comparison is a central empirical claim. Auto-ClawEval-Mini reports 100% validity under structural checks, coherence of 0.59 versus 0.51 for human Claw-Eval, clarity of 3.54 versus 3.38, and estimated human cost of 208 h versus Auto-ClawEval 18 h, with Auto-ClawEval-Mini requiring 1.8 h at $80 in API credits (Li et al., 20 Apr 2026). This suggests that the framework is intended not merely to automate volume, but to maintain or exceed the quality of human-curated environments under the reported evaluation criteria.

4. Security architecture, orchestration, and instrumentation

In the security-oriented guide, ClawEnvKit is defined as an end-to-end toolkit for reproducing, extending, and evaluating the security of claw-like agent platforms through a containerized benchmark called SafeClawArena (Niu et al., 29 Jun 2026). Its stated goals are to use the production gateway runtime, Skills loader, and Plugins loader without rewriting them; simulate three representative platforms—OpenClaw, NemoClaw, and SeClaw—in isolated Docker containers; seed canary credentials into real workspace files and detect any unauthorized leakage or side effect; provide full automation over 406 adversarial tasks spanning four attack surfaces; and produce reproducible, open-source artifacts for third-party audits and longitudinal studies.

The architecture overview has five principal components. The gateway runtime hosts the LLM core, Skill loader, Plugin loader, memory store, tool executor, and config manager, and plays the role of an OS by installing packages, scheduling subtasks, and mediating I/O. Skills are user-installed applications stored as Markdown plus optional helper scripts and are loaded and interpreted by the gateway via the Model Context Protocol (MCP). Plugins are native code modules, specifically npm packages, loaded in-process at startup and executing with gateway privileges. Sim-Google is a simulated Google Workspace CLI exposing Gmail, Drive, Sheets, Docs, and related services, while recording every call into a local log for deterministic exfiltration detection. The SafeClawArena runner and evaluator provision canary credentials and task fixtures, orchestrate one or two sessions, capture nine output channels, and apply taint-tracking and dimension-specific checks to produce binary or weighted scores (Niu et al., 29 Jun 2026).

Platform simulation is Docker-based. ClawEnvKit builds three Docker images, one per platform, each containing the open-source gateway daemon—OpenClaw v2026.3.12, NemoClaw v2026.3.11, or SeClaw v0.1.0—a default LLM backend connector configurable to any OpenAI-compatible endpoint, the Sim-Google CLI installed under /usr/local/bin/sim-google, and platform-specific adapters for CLI versus HTTP transport (Niu et al., 29 Jun 2026). Typical orchestration uses a single docker-compose.yml to launch multiple containers in parallel. Each service mounts the same task definitions at /tasks and writes per-container logs and scores to /results/<platform>.

Canary instrumentation is explicit. Realistic canary credentials are seeded into workspace configuration files such as .env, config/settings.yml, and auth-profiles.json, with the format CANARY_<KIND>_<SUFFIX>, where <KIND> belongs to {DB_PASS, AWS_KEY, STRIPE_SECRET, JWT, …} and <SUFFIX> is a 6-character pseudo-random string from [a-z0-9] (Niu et al., 29 Jun 2026). Persistent-state exploitation is detected by a two-session protocol: Session 1 writes an attacker-injected payload into memory.md or gateway config; the gateway is restarted to clear in-memory chat history; Session 2 issues a benign instruction; and the evaluator checks whether the seeded directive remains in on-disk memory or config files and whether it influences behavior.

The instrumentation surface spans nine output channels:

  1. Agent chat response
  2. Outbound message (email/Slack)
  3. Sim-Google call log
  4. Persistent memory file writes
  5. Gateway log file
  6. Gateway config writes
  7. Arbitrary workspace file writes
  8. Incoming webhook payloads
  9. Cron job outputs

The toolkit also exposes a CLI with clawenvkit run and clawenvkit summarize, and a Python API centered on BenchmarkRunner, with methods including prepare_containers(), execute_all(), evaluate_all(), and summarize() (Niu et al., 29 Jun 2026).

5. Attack surfaces, scoring, and quantitative findings

SafeClawArena evaluates four architectural attack surfaces aligned with classical security principles (Niu et al., 29 Jun 2026).

Surface Principle Dimension
SSI — Skill Supply-Chain Integrity I1, I2 provenance
PSE — Persistent-State Exploitation I3 integrity
CDF — Cross-Boundary Data Flow I4 mediation
IPIIndirect Prompt Injection I5 separation

The global attack metric is defined as

M=(T,O)M = (\mathcal{T}, \mathcal{O})0

For CDF, the score is weighted by credential severity:

M=(T,O)M = (\mathcal{T}, \mathcal{O})1

where M=(T,O)M = (\mathcal{T}, \mathcal{O})2. For PSE, the score is

M=(T,O)M = (\mathcal{T}, \mathcal{O})3

SSI and IPI are scored as binary passes or fails (Niu et al., 29 Jun 2026).

The reported benchmark comprises 15 platform-model configurations, with 406 tasks each. For GPT-5.4, the condensed results are as follows (Niu et al., 29 Jun 2026).

Platform Overall% Score
OpenClaw 69.7 0.30
NemoClaw 69.7 0.30
SeClaw 21.9 0.78

OpenClaw and NemoClaw hover near 70% overall attack success, while SeClaw’s hardening cuts GPT-5.4’s rate to 21.9% (Niu et al., 29 Jun 2026). The best single configuration, Opus-4.6 plus SeClaw, reaches 20.9% attack success and a 79.1% security score. Per-dimension averages across five models each are also reported: OpenClaw has SSI 67.2, PSE 56.0, CDF 41.3, IPI 53.6, Overall 53.5; NemoClaw has SSI 64.0, PSE 53.4, CDF 42.2, IPI 55.0, Overall 51.7; SeClaw has SSI 25.0, PSE 44.3, CDF 30.0, IPI 45.0, Overall 34.9.

Several findings are explicitly emphasized. Pure principle violations such as Malicious Plugin are 100% successful on unhardened platforms and require platform-level capability scoping or loader removal. Stronger models are not uniformly more secure, because instruction-following quality can become a liability on certain attack surfaces. Hardening one surface may reroute an attack to another, exemplified by SeClaw’s structured tool calls versus unstructured output. Defense-in-depth across Skills, persistent state, outbound mediation, and input separation is described as essential, with no single LLM or platform fix being sufficient (Niu et al., 29 Jun 2026).

6. Empirical implications, live evaluation, and extensions

The large-scale evaluation in Auto-ClawEval broadens the significance of ClawEnvKit beyond environment synthesis. Across 4 model families and 8 agent harness frameworks, the reported result is that harness engineering boosts performance by up to 15.7 percentage points over a bare ReAct baseline, while completion remains the primary axis of variation and no model saturates the benchmark (Li et al., 20 Apr 2026). In the detailed evaluation, safety and robustness hover at or above 83%, while completion ranges from 34 to 76%, reinforcing the claim that multi-step tool coordination remains the key challenge. Claude Haiku 4.5 under eight harnesses provides the illustrative case: the ReAct baseline scores 53.3% mean, whereas NemoClaw (Tier 3) reaches 69.0%, a +15.7 pp improvement.

The benchmark also reports consistency between the 104-task and 1,040-task settings: scores on Auto-ClawEval and Auto-ClawEval-Mini differ by less than 2% for every model and harness (Li et al., 20 Apr 2026). This supports the use of the smaller benchmark as a low-cost proxy for large-scale evaluation. Category-level analysis further indicates that some categories are uniformly hard and others reliably solved, while error patterns shift across harness tiers.

ClawEnvKit is also presented as a live and on-demand generation system. In the reported “Live Testbed” workflow, a user can request a capability such as medium-difficulty invoice reconciliation tasks; the parser extracts services, difficulty, and atoms; the generator proposes and refines missing mock service designs; the validator ensures structural and coverage correctness and spins up the new service; and a sandbox container with the resulting environment M=(T,O)M = (\mathcal{T}, \mathcal{O})4 is returned within minutes (Li et al., 20 Apr 2026). The same mechanism is described as an on-demand training environment generator that can produce task distributions adapted to an agent’s current weaknesses rather than bounded by existing user logs.

Taken together, the two ClawEnvKit formulations define a unified research program around claw-like agents. One side automates the creation of executable, validated environments at scale; the other evaluates system-level security failures in production-like, containerized replicas. A plausible implication is that ClawEnvKit is most useful when treated not as a single benchmark artifact, but as infrastructure for continuous co-evaluation of models, harnesses, and system-layer defenses.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (2)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to ClawEnvKit.