Systematic quantitative evaluation of OpenClaw agent safety

Establish a systematic, quantitative evaluation of the OpenClaw personal AI assistant’s safety under adversarial conditions, replacing prior qualitative or narrow-scope assessments with comprehensive measurements of harmful action execution in realistic deployments.

Background

Prior analyses of OpenClaw have largely been qualitative, case-based, or narrow in scope, leaving gaps in understanding how often and under what conditions adversarial prompt injections lead to harmful actions. The paper argues that a rigorous, quantitative methodology is needed to measure safety outcomes in realistic high-privilege settings where agents read local files, emails, and web content.

This open problem addresses the need to move beyond anecdotal or black-box findings by developing standardized, reproducible evaluation protocols that can capture attack success rates and failure modalities in OpenClaw across realistic tasks and environments.

References

\citet{shapira2026agents} take a step in this direction, documenting emergent OpenClaw failures in a live lab setting, and \citet{wang2026assistant} evaluate OpenClaw under black-box conditions, but both remain qualitative or narrow in scope, leaving systematic quantitative evaluation an open problem.

ClawSafety: "Safe" LLMs, Unsafe Agents  (2604.01438 - Wei et al., 1 Apr 2026) in Section 1 (Introduction)