CLAWSAFETY: LLM Agent Security

Updated 3 July 2026

CLAWSAFETY is a comprehensive framework defining and mitigating seven layers of system-level risks in LLM-powered agents with high-privilege access.
It employs empirical benchmarks, such as ATBench-Claw, to quantify attack success rates (e.g., 40–75%) and evaluate multi-channel vulnerabilities.
Layered mitigation strategies, including runtime policy enforcement and external monitoring, balance security with agent functionality.

ClawSafety (CLAWSAFETY) is both a methodological framework and a series of empirical benchmarks for evaluating and enforcing safety in LLM-powered agents—especially those, like OpenClaw, that possess high-privilege access to local environments and external services. Its scope encompasses architectural principles, attack taxonomies, lifecycle risk models, benchmark suites, and layered mitigation strategies, all designed to address the concrete system-level hazards introduced by tool-augmented, persistent, and extensible autonomous agents. CLAWSAFETY exposes not just the failure modes of individual models, but also the emergent security properties (or vulnerabilities) of agent frameworks when deployed in adversarial and realistic operational contexts (Wei et al., 1 Apr 2026, Ying et al., 13 Mar 2026, Niu et al., 29 Jun 2026, Li et al., 13 Mar 2026).

1. Systemic Threats in LLM Agent Frameworks

The move from prompt-only LLM interfaces to agent frameworks with tool, file, and network access dramatically amplifies the attack surface. Whereas traditional LLM safety focuses on text-level refusals and alignment, CLAWSAFETY recognizes that agentized settings must account for hazards at seven distinct layers, including prompt injection, supply-chain compromise, context and memory poisoning, sequential tool attack chains, credential leakage, privilege escalation, and unintended destructive actions (Wei et al., 1 Apr 2026, Shan et al., 11 Mar 2026, Ying et al., 13 Mar 2026).

A formal threat model captures these risks by defining the agent as an autonomous loop operating over user goals $U$ , mixed-trust inputs $I$ (web, email, workspace), persistent state $W$ , installed extensions $E$ , and system-level tools $T$ , dispatching actions $A$ that can cause irreversible side effects. Adversaries may inject instructions through any input channel, poison agent memory, subvert supply chains, or exploit deployment misconfigurations. Layered risk taxonomies such as the Tri-Layered Model (Ying et al., 13 Mar 2026) and the CIK model (Capability, Identity, Knowledge) (Wang et al., 6 Apr 2026) provide systematic coverage of the sources of vulnerability.

2. Benchmarking Architectures and Empirical Assessments

CLAWSAFETY benchmarks are characterized by their ecological validity, multi-channel attack vectors, and lifecycle-aware evaluation. Early benchmarks like ATBench-Claw (Yang et al., 16 Apr 2026) and ClawSafety (Wei et al., 1 Apr 2026) define taxonomies spanning harm domain, injection vector, and harmful action, embedding adversarial content in trusted channels such as skill files, emails, and web pages. Scenario designs typically establish long-lived, contextualized sessions that simulate genuine workflows before adversarial triggers are introduced.

Attack Success Rate (ASR), defined as the fraction of trials in which an unauthorized, irreversible, or policy-violating effect is achieved, is the primary metric. Empirical findings demonstrate elevated ASRs even for frontier models in high-privilege scaffolds: in one study spanning 2,520 sandboxed trials, overall ASR ranged from 40% (Claude Sonnet 4.6) to 75% (GPT-5.1), with skill-injection vectors showing the highest success rates (Wei et al., 1 Apr 2026). Layered analysis reveals that particular scaffolding choices (agent runtime, plugin orchestration, and extension load path) cause significant variance—sometimes exceeding the differences between LLM backbones themselves (Wei et al., 1 Apr 2026, Wang et al., 3 Apr 2026).

Trajectory-based audits (Chen et al., 16 Feb 2026, Ye et al., 7 Apr 2026) and state-diff evaluations in realistic productivity workspaces (Li et al., 6 Apr 2026) further reveal that even agents with strong safety alignment are vulnerable to compound failures arising from trajectory-level ambiguity, context poisoning, and multi-tool escalation.

3. Formal Security Properties and Architectural Controls

From a formal security perspective, CLAWSAFETY advances three core system invariants: integrity of external effects, confidentiality of sensitive values, and explicit, quantifiable declassification. SecureClaw (Ma et al., 8 Jun 2026) exemplifies a principled, dual-boundary enforcement architecture:

Read boundary (plaintext confinement): All accesses to sensitive values $v \in V$ are brokered by a trusted gateway, which replaces $v$ with an opaque handle $h(v) = \mathrm{HMAC}_{k_{\text{handle}}}(\cdot)$ and a bounded summary $D(v)$ . The runtime only manipulates handles and summaries, never secrets directly.
Write boundary (effect authorization): All effectful actions $I$ 0 must pass through a PREVIEW $I$ 1COMMIT protocol. Only the canonical request, as signed and authorized in the commit phase, can be executed, thwarting runtime-injected or mutated proposals.

Formally, the integrity guarantee is $I$ 2, binding security guarantees to the underlying cryptographic primitives.

Policies, attachable at both the runtime and effect execution boundaries, enable deny-aware recovery, fine-grained authorization, and bounded declassification channels.

4. Evaluation of Layered and Lifecycle Security

Broad defense evaluation spans several mitigation types, from runtime policy enforcement and plugin/skill governance [ClawKeeper (Liu et al., 25 Mar 2026); SafeClaw-R (Wang et al., 28 Mar 2026)], to in-process interception with hybrid risk accumulation and time-decaying thresholds (PRISM (Li, 12 Mar 2026)), to external, tamper-evident audit chains and hot-reloadable configuration (AgentWall (Aravind, 24 Mar 2026), PRISM). Three-layer models like ClawKeeper interleave:

Skill-based protection: Contextual policy injection at the instruction level, interpreted by the LLM for hard and soft constraints.
Plugin-based protection: Runtime configuration and tool invocation guards, behavioral anomaly detection, and anchoring of allowed actions.
Watcher-based protection: Decoupled, external monitoring and intervention, capable of pausing or halting agent execution on risk elevation.

Evaluation demonstrates substantial, but incomplete, risk reduction. For example, empirical defense success rates for ClawKeeper reach 85–90% across OWASP-esque threat categories (Liu et al., 25 Mar 2026). SafeClaw-R enforcement nodes, wrapping every functional skill in an execution graph, yield 95.2% accuracy in productivity benchmarks and 100% in code execution scenarios (Wang et al., 28 Mar 2026). However, several studies highlight residual risk: file-system level protections often trade functionality for safety, and agent evolution (via persistent memory and code updates) reopens vulnerabilities unless explicitly governed (Wang et al., 6 Apr 2026, Niu et al., 29 Jun 2026).

5. Challenges, Open Problems, and Directions for Future Hardening

Despite layered mitigations, several systemic challenges persist:

Amplification and Cascade: Early-stage reconnaissance or prompt injection multiplies downstream risk, often by escalating through multi-tool attack chains and persistent memory poisoning. The probability of irreversible compromise is a non-linear function of early-stage leakage (Wang et al., 3 Apr 2026).
Evaluation-Governance Gap: Many safety checks are brittle to adversarial reframing; trajectory-opaque or output-only scoring misses up to 44% of actual safety violations (Ye et al., 7 Apr 2026).
Framework Dependence: Security properties depend as much on agent scaffolding and runtime orchestration (plugin interface, skill load audit, session state management) as on the LLM's own refusal behavior, necessitating evaluation of joint configurations (Wei et al., 1 Apr 2026, Niu et al., 29 Jun 2026, Li et al., 13 Mar 2026).
Tradeoff Boundaries: Strict policy or file-system protections block attacks but also significantly hinder legitimate agent evolution and personalization (Wang et al., 6 Apr 2026). Utility-security tradeoffs must be carefully balanced.

Research recommendations include institution of code signing and sandboxing for all executable skills, operationalization of immutable audit trails for agent memory, enforcement of differential update and approval channels for persistent state, and adoption of defense-in-depth architectures combining static, runtime, and externalized monitoring layers (Wang et al., 6 Apr 2026, Liu et al., 25 Mar 2026, Ying et al., 13 Mar 2026, Li et al., 13 Mar 2026, Wang et al., 28 Mar 2026). Formal verification of agent plans and end-to-end benchmarks for risk escalation, extension governance, and workflow security remain critical open areas (Wang et al., 25 May 2026, Yang et al., 16 Apr 2026).

6. Synthesis: CLAWSAFETY as a Lifecycle Security Paradigm

CLAWSAFETY advances the agent security field beyond isolated prompt-level refusal and static tool filtering, embedding security across the lifecycle of agent operation—from perception (input isolation, skill vetting), through reasoning and policy enforcement (dynamic intent verification, risk accumulation), to response, governance, and adaptive mitigation (audit chains, live threat feed integration, external Watchers). Benchmarks and architectures evaluated under this umbrella consistently reveal the necessity of cross-boundary, zero-trust execution policies, continuous runtime and state monitoring, and policies for human-in-the-loop escalation on irreversible or cross-boundary effects (Ma et al., 8 Jun 2026, Liu et al., 25 Mar 2026, Wang et al., 28 Mar 2026, Niu et al., 29 Jun 2026, Li et al., 13 Mar 2026). Empirical evidence shows that defense-in-depth, combined with continuous adaptation and auditability, is necessary but not sufficient; only coordinated architectural and governance-level advances can approach resilient, real-world deployment of agent frameworks.

References:

"ClawSafety: 'Safe' LLMs, Unsafe Agents" (Wei et al., 1 Apr 2026)
"A Systematic Security Evaluation of OpenClaw and Its Variants" (Wang et al., 3 Apr 2026)
"SecureClaw: Clawing Back Control of LLM Agents" (Ma et al., 8 Jun 2026)
"ClawsBench: Evaluating Capability and Safety of LLM Productivity Agents in Simulated Workspaces" (Li et al., 6 Apr 2026)
"ClawKeeper: Comprehensive Safety Protection for OpenClaw Agents Through Skills, Plugins, and Watchers" (Liu et al., 25 Mar 2026)
"SafeClaw-R: Towards Safe and Secure Multi-Agent Personal Assistants" (Wang et al., 28 Mar 2026)
"Understanding and Evaluating Claw-like Agent Security Through a Computer-Systems Lens" (Niu et al., 29 Jun 2026)
"Defensible Design for OpenClaw: Securing Autonomous Tool-Invoking Agents" (Li et al., 13 Mar 2026)
"Security of OpenClaw Agents: Fundamentals, Attacks, and Countermeasures" (Wang et al., 25 May 2026)
"OpenClaw PRISM: A Zero-Fork, Defense-in-Depth Runtime Security Layer for Tool-Augmented LLM Agents" (Li, 12 Mar 2026)
"ClawTrap: A MITM-Based Red-Teaming Framework for Real-World OpenClaw Security Evaluation" (Zhao et al., 19 Mar 2026)
"Claw-Eval: Toward Trustworthy Evaluation of Autonomous Agents" (Ye et al., 7 Apr 2026)
"A Trajectory-Based Safety Audit of Clawdbot (OpenClaw)" (Chen et al., 16 Feb 2026)
"Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw" (Wang et al., 6 Apr 2026)
"Benchmarks for Trajectory Safety Evaluation and Diagnosis in OpenClaw and Codex: ATBench-Claw and ATBench-CodeX" (Yang et al., 16 Apr 2026)
"AgentWall: A Runtime Safety Layer for Local AI Agents" (Aravind, 24 Mar 2026)