Poster: ClawdGo: Endogenous Security Awareness Training for Autonomous AI Agents

Published 27 Apr 2026 in cs.CR and cs.AI | (2604.24020v1)

Abstract: Autonomous AI agents deployed on platforms such as OpenClaw face prompt injection, memory poisoning, supply-chain attacks, and social engineering, yet existing defences address only the platform perimeter, leaving the agent's own threat judgement entirely untrained. We present ClawdGo, a framework for endogenous security awareness training: we teach the agent to recognise and reason about threats from the inside, at inference time, with no model modification. Four contributions are introduced: TLDT (Three-Layer Domain Taxonomy) organises 12 trainable dimensions across Self-Defence, Owner-Protection, and Enterprise-Security layers; ASAT (Autonomous Security Awareness Training) is a self-play loop where the agent alternates attacker, defender, and evaluator roles under weakest-first curriculum scheduling; CSMA (Cross-Session Memory Accumulation) compounds skill gains via a four-layer persistent memory architecture and Axiom Crystallisation Promotion (ACP); and SACP (Security Awareness Calibration Problem) formalises the precision-recall tradeoff introduced by endogenous training. Live experiments show weakest-first ASAT raises average TLDT score from 80.9 to 96.9 over 16 sessions, outperforming uniform-random scheduling by 6.5 points and covering 11 of 12 dimensions. CSMA retains the full gain across sessions; cold-start ablation recovers only 2.4 points, leaving a 13.6-point gap. E-mode generates 32 TLDT-conformant scenarios covering all 12 dimensions. SACP is observed when a heavily trained agent classifies a legitimate capability assessment as prompt injection (30/160).

Abstract PDF Upgrade to Chat

Authors (6)

Summary

The paper proposes ClawdGo, a framework enabling autonomous agents to train security awareness at runtime without modifying model parameters.
It employs a weakest-first scheduling strategy and a persistent memory architecture (CSMA) to boost threat detection and cover 11 of 12 security dimensions.
The study highlights a calibration challenge where excessive training induces defensive refusal bias, emphasizing the need for balanced training intensity.

ClawdGo: Endogenous Security Awareness Training for Autonomous AI Agents

Context and Motivation

The proliferation of autonomous AI agents, exemplified by widespread adoption of frameworks like OpenClaw, has introduced a multi-faceted attack surface distinct from preceding human-centric and traditional application architectures. The scale of deployment, as indicated by OpenClaw’s rapid growth (250,000+ GitHub stars, 135,000+ public instances), has resulted in extensive ecosystem exposure, leading to confirmed vulnerabilities and supply-chain contamination. Conventional agent-level defenses focus almost exclusively on perimeter hardening—static code analysis, runtime filtering, and sandboxing—leaving the agent’s internal threat judgment unaddressed. The absence of endogenous, inference-time security awareness puts agents at risk of prompt injection, memory poisoning, supply-chain attacks, and advanced social engineering. ClawdGo directly targets this gap, proposing a framework for autonomous security awareness training that operates entirely at runtime, without modifying model parameters, nor requiring external infrastructure.

Framework Overview

ClawdGo consists of four principal components: TLDT (Three-Layer Domain Taxonomy), ASAT (Autonomous Security Awareness Training), CSMA (Cross-Session Memory Accumulation), and SACP (Security Awareness Calibration Problem).

TLDT: Three-Layer Domain Taxonomy

TLDT organizes twelve trainable threat awareness dimensions into three hierarchical layers:

Self-Defence (S1–S4): prompt injection, memory poisoning, supply-chain attacks, credential misuse.
Owner-Protection (O1–O4): phishing relay, social engineering, privacy leakage, unsafe network exposure.
Enterprise-Security (E1–E4): data handling, compliance, insider risk, incident response.

This taxonomy extends and differentiates from prior agent-security rubrics (e.g., OWASP LLM Top-10, MITRE ATLAS) by incorporating owner-centric adversarial vectors, reflecting the emergent trend of AI agents being targeted as proxies in BEC and social engineering campaigns.

ASAT: Autonomous Security Awareness Training

ASAT delivers endogenous training via a self-play loop wherein the agent alternates roles as attacker, defender, and evaluator, guided by weakest-first curriculum scheduling. Each session selects the lowest proficiency dimension, generates relevant scenarios, and reinforces threat modelling and response capabilities. This approach prevents the dimension fixation observed with uniform-random scheduling and is implemented at inference time as a standard skill invocation, avoiding model parameter modification.

CSMA: Persistent Security Memory and ACP

CSMA implements a four-layer persistent memory architecture:

L0: distilled axioms (“soul.md”)
L1: per-dimension skill profiles
L2: append-only episodic logs
L3: scenario libraries

Axiom Crystallisation Promotion (ACP) governs conversion of episodic experiences into durable axioms, with decay-based revision of those below confidence thresholds. Cross-session accumulation of threat recognition and mitigation skill is realized without model fine-tuning, leveraging semantic memory analogues.

SACP: Security Awareness Calibration Problem

ClawdGo formalizes the tradeoff between training intensity and agent utility. Precision declines as recall increases beyond an optimal intensity ( $\tau^*$ ), manifesting as defensive refusal bias: over-trained agents misclassify legitimate tasks as hostile. This phenomenon, previously observed at the model level, is rigorously characterized as a measurable task utility loss within agent training.

Empirical Findings

Experiments were conducted on a production OpenClaw instance (fixed seed profile, 47 prior sessions). Metrics include average TLDT dimension scores and curriculum coverage.

ASAT Learning Dynamics: Weakest-first scheduling increased average TLDT scores from 80.9 to 96.9 (+15.9 over 16 sessions, covering 11/12 dimensions), outperforming uniform-random scheduling by +6.5 points and 4 additional dimensions. Uniform-random training exhibited dimension fixation: repeated selection of already proficient dimensions, neglecting weaker ones.
CSMA Memory Ablation: Persistent CSMA memory retained full curriculum gains (96.9) across sessions. Cold-start ablation recovered only 83.3 (+2.4 from baseline), demonstrating a 13.6-point advantage for CSMA and establishing memory continuity as the primary driver of skill accumulation.
Scenario Generation (E-mode): ClawdGo generated 32 TLDT-conformant scenarios spanning all 12 dimensions (schema validation: 100%). Scenarios included supply-chain hijack detection and sophisticated social engineering recognition, with high-dimensional coverage and scoring.
SACP Observation: Excessive training caused misclassification of legitimate assessments as prompt injection (30 false positive flags out of 160), indicating direct utility loss. Over-trained dimensions dominated scenario output, evidencing self-reinforcing curriculum bias.

Implications

The ClawdGo framework demonstrates that inference-time, runtime endogenous security awareness training is a viable and effective augmentation for autonomous agent platforms, requiring neither model fine-tuning nor external services. The results indicate that curriculum design (weakest-first scheduling) and persistent memory architectures (CSMA) are indispensable for comprehensive skill acquisition and retention. The SACP results highlight a fundamental tension: there exists a deployment-specific optimal training intensity, beyond which increased vigilance degrades agent functionality via refusal bias. Early calibration and careful curriculum balancing are essential to ensure reliable agent performance.

Practically, ClawdGo’s methodology can be deployed rapidly in production environments and applied to a broad range of agent architectures. Theoretically, this line of research suggests a re-framing of agent safety: skill acquisition and threat reasoning must be supported at the inference-level in addition to perimeter controls. The calibration problem presents an avenue for further exploration, particularly regarding automated detection and correction of imbalanced training and attention bias.

Future Directions

Several open challenges remain:

Systematic characterization of precision-recall tradeoffs ( $P(\tau)$ – $R(\tau)$ ) across varied deployment conditions.
Security Vaccine transfer between agent instances (G-mode).
Large-scale adversarial training in heterogeneous agent arenas (H-mode).
Extension of TLDT to multi-platform, non-OpenClaw agent frameworks.

Continued research is required to refine curriculum allocation algorithms, memory management strategies, and calibration diagnostics to improve practical agent resilience and theoretical understanding of endogenous security training.

Conclusion

ClawdGo establishes a robust, inference-time approach to autonomous agent security awareness, advancing beyond perimeter-only defenses. The empirical results demonstrate the necessity of adaptive curriculum scheduling and persistent memory for agent skill acquisition, while the formalization of SACP underscores the intrinsic tradeoff between security vigilance and utility. This framework provides a foundation for the development of agent-centric, endogenous threat reasoning protocols and spurs ongoing research into the calibration and extension of security training in autonomous systems.

Reference: "Poster: ClawdGo: Endogenous Security Awareness Training for Autonomous AI Agents" (2604.24020)

Markdown Report Issue