Separating Legitimate Updates from Malicious Injections in Persistent Agent Files

Determine reliable criteria and mechanisms to separate legitimate updates from malicious injections in the persistent state files that enable evolution in OpenClaw (such as MEMORY.md, SOUL.md, IDENTITY.md, USER.md, and AGENTS.md), ensuring the agent can continue to learn and adapt without providing an attack surface for state poisoning.

Background

OpenClaw continuously evolves by modifying persistent state files that capture its capabilities, identity, and knowledge. This same persistent state functions as the attack surface for adversaries who can inject malicious content, leading to harmful actions in later sessions.

The authors evaluated a file-protection mechanism that instructs the agent to wait for approval before modifying its knowledge and identity files. While this reduces attack injections substantially, it also blocks legitimate updates at nearly the same rate, exposing a fundamental tradeoff between enabling evolution and maintaining safety. This motivates the need for methods that can distinguish benign updates from adversarial injections without freezing the agent’s learning.

References

A file-protection approach reduces injection rates by up to 97% but blocks legitimate agent updates at nearly the same rate, revealing a fundamental evolution--safety tradeoff: as long as the persistent files that enable evolution are also the attack surface, separating legitimate updates from injections remains an open problem.

— Your Agent, Their Asset: A Real-World Safety Analysis of OpenClaw (2604.04759 - Wang et al., 6 Apr 2026) in Abstract

Separating Legitimate Updates from Malicious Injections in Persistent Agent Files

Background

References

Related Problems