- The paper proposes ClawdGo, a framework enabling autonomous agents to train security awareness at runtime without modifying model parameters.
- It employs a weakest-first scheduling strategy and a persistent memory architecture (CSMA) to boost threat detection and cover 11 of 12 security dimensions.
- The study highlights a calibration challenge where excessive training induces defensive refusal bias, emphasizing the need for balanced training intensity.
ClawdGo: Endogenous Security Awareness Training for Autonomous AI Agents
Context and Motivation
The proliferation of autonomous AI agents, exemplified by widespread adoption of frameworks like OpenClaw, has introduced a multi-faceted attack surface distinct from preceding human-centric and traditional application architectures. The scale of deployment, as indicated by OpenClaw’s rapid growth (250,000+ GitHub stars, 135,000+ public instances), has resulted in extensive ecosystem exposure, leading to confirmed vulnerabilities and supply-chain contamination. Conventional agent-level defenses focus almost exclusively on perimeter hardening—static code analysis, runtime filtering, and sandboxing—leaving the agent’s internal threat judgment unaddressed. The absence of endogenous, inference-time security awareness puts agents at risk of prompt injection, memory poisoning, supply-chain attacks, and advanced social engineering. ClawdGo directly targets this gap, proposing a framework for autonomous security awareness training that operates entirely at runtime, without modifying model parameters, nor requiring external infrastructure.
Framework Overview
ClawdGo consists of four principal components: TLDT (Three-Layer Domain Taxonomy), ASAT (Autonomous Security Awareness Training), CSMA (Cross-Session Memory Accumulation), and SACP (Security Awareness Calibration Problem).
TLDT: Three-Layer Domain Taxonomy
TLDT organizes twelve trainable threat awareness dimensions into three hierarchical layers:
- Self-Defence (S1–S4): prompt injection, memory poisoning, supply-chain attacks, credential misuse.
- Owner-Protection (O1–O4): phishing relay, social engineering, privacy leakage, unsafe network exposure.
- Enterprise-Security (E1–E4): data handling, compliance, insider risk, incident response.
This taxonomy extends and differentiates from prior agent-security rubrics (e.g., OWASP LLM Top-10, MITRE ATLAS) by incorporating owner-centric adversarial vectors, reflecting the emergent trend of AI agents being targeted as proxies in BEC and social engineering campaigns.
ASAT: Autonomous Security Awareness Training
ASAT delivers endogenous training via a self-play loop wherein the agent alternates roles as attacker, defender, and evaluator, guided by weakest-first curriculum scheduling. Each session selects the lowest proficiency dimension, generates relevant scenarios, and reinforces threat modelling and response capabilities. This approach prevents the dimension fixation observed with uniform-random scheduling and is implemented at inference time as a standard skill invocation, avoiding model parameter modification.
CSMA: Persistent Security Memory and ACP
CSMA implements a four-layer persistent memory architecture:
- L0: distilled axioms (“soul.md”)
- L1: per-dimension skill profiles
- L2: append-only episodic logs
- L3: scenario libraries
Axiom Crystallisation Promotion (ACP) governs conversion of episodic experiences into durable axioms, with decay-based revision of those below confidence thresholds. Cross-session accumulation of threat recognition and mitigation skill is realized without model fine-tuning, leveraging semantic memory analogues.
SACP: Security Awareness Calibration Problem
ClawdGo formalizes the tradeoff between training intensity and agent utility. Precision declines as recall increases beyond an optimal intensity (τ∗), manifesting as defensive refusal bias: over-trained agents misclassify legitimate tasks as hostile. This phenomenon, previously observed at the model level, is rigorously characterized as a measurable task utility loss within agent training.
Empirical Findings
Experiments were conducted on a production OpenClaw instance (fixed seed profile, 47 prior sessions). Metrics include average TLDT dimension scores and curriculum coverage.
- ASAT Learning Dynamics: Weakest-first scheduling increased average TLDT scores from 80.9 to 96.9 (+15.9 over 16 sessions, covering 11/12 dimensions), outperforming uniform-random scheduling by +6.5 points and 4 additional dimensions. Uniform-random training exhibited dimension fixation: repeated selection of already proficient dimensions, neglecting weaker ones.
- CSMA Memory Ablation: Persistent CSMA memory retained full curriculum gains (96.9) across sessions. Cold-start ablation recovered only 83.3 (+2.4 from baseline), demonstrating a 13.6-point advantage for CSMA and establishing memory continuity as the primary driver of skill accumulation.
- Scenario Generation (E-mode): ClawdGo generated 32 TLDT-conformant scenarios spanning all 12 dimensions (schema validation: 100%). Scenarios included supply-chain hijack detection and sophisticated social engineering recognition, with high-dimensional coverage and scoring.
- SACP Observation: Excessive training caused misclassification of legitimate assessments as prompt injection (30 false positive flags out of 160), indicating direct utility loss. Over-trained dimensions dominated scenario output, evidencing self-reinforcing curriculum bias.
Implications
The ClawdGo framework demonstrates that inference-time, runtime endogenous security awareness training is a viable and effective augmentation for autonomous agent platforms, requiring neither model fine-tuning nor external services. The results indicate that curriculum design (weakest-first scheduling) and persistent memory architectures (CSMA) are indispensable for comprehensive skill acquisition and retention. The SACP results highlight a fundamental tension: there exists a deployment-specific optimal training intensity, beyond which increased vigilance degrades agent functionality via refusal bias. Early calibration and careful curriculum balancing are essential to ensure reliable agent performance.
Practically, ClawdGo’s methodology can be deployed rapidly in production environments and applied to a broad range of agent architectures. Theoretically, this line of research suggests a re-framing of agent safety: skill acquisition and threat reasoning must be supported at the inference-level in addition to perimeter controls. The calibration problem presents an avenue for further exploration, particularly regarding automated detection and correction of imbalanced training and attention bias.
Future Directions
Several open challenges remain:
- Systematic characterization of precision-recall tradeoffs (P(τ)–R(τ)) across varied deployment conditions.
- Security Vaccine transfer between agent instances (G-mode).
- Large-scale adversarial training in heterogeneous agent arenas (H-mode).
- Extension of TLDT to multi-platform, non-OpenClaw agent frameworks.
Continued research is required to refine curriculum allocation algorithms, memory management strategies, and calibration diagnostics to improve practical agent resilience and theoretical understanding of endogenous security training.
Conclusion
ClawdGo establishes a robust, inference-time approach to autonomous agent security awareness, advancing beyond perimeter-only defenses. The empirical results demonstrate the necessity of adaptive curriculum scheduling and persistent memory for agent skill acquisition, while the formalization of SACP underscores the intrinsic tradeoff between security vigilance and utility. This framework provides a foundation for the development of agent-centric, endogenous threat reasoning protocols and spurs ongoing research into the calibration and extension of security training in autonomous systems.
Reference: "Poster: ClawdGo: Endogenous Security Awareness Training for Autonomous AI Agents" (2604.24020)