Papers
Topics
Authors
Recent
Search
2000 character limit reached

Architectural Obsolescence of Unhardened Agentic-AI Runtimes

Published 3 May 2026 in cs.CR, cs.AI, and cs.MA | (2605.01740v1)

Abstract: An agentic-AI runtime issues tool calls, sends messages, and actuates devices on behalf of an LLM. Catching the four ways an action can diverge from its audit record -- F1 gate-bypass, F2 audit-forgery, silent host failure, F4 wrong-target, -- is a load-bearing safety property of any such runtime. We show that upstream OpenClaw, the most engineered single-user agentic-AI gateway in public release, catches none of them: recall is 0.000 on every cell of every confusion matrix, on a 1600-sample template baseline through OpenClaw's actual production command-line interface (CLI) and on a ten-LLM cross-model generalisation run. Detecting F1--F4 requires seven specific runtime structures absent from OpenClaw's source tree: a biconditional checker, a hash-chained audit log, an extension admission gate, a two-layer egress guard, a Bell-LaPadula classification policy, a module-signing trust root, and a bootstrap seal. enclawed-oss -- an MIT-licensed drop-in fork that ships all seven -- reaches $P = R = F_1 =$ accuracy $= 1.000$ on the same input. The gap is structural, not parametric: a six-line append-only widening of enclawed-oss's data-loss-prevention (DLP) regex catalog raises per-channel F3 detection by 14.6\% net at unchanged precision; the same edit on OpenClaw has nowhere to land. The harness deliberately exercises real Discord and Telegram channels -- plugin categories the first enclawed release deleted as unsafe -- to show F1--F4 detection extends to those previously-unsafe extensions. With architectural superiority for security and feature parity for extensions, we argue that unhardened agentic-AI runtimes are architecturally obsolete: a strictly better alternative exists, is adoptable today, and the gap requires re-architecture rather than configuration. We invite reviewers to apply the harness to any candidate runtime.

Authors (1)

Summary

  • The paper demonstrates that unhardened OpenClaw fails to detect critical failure modes, while enclawed-oss achieves perfect recall using seven explicit detection primitives.
  • A rigorous statistical adversarial harness employing 81,600 samples with Wilson score intervals and McNemar's test validates the empirical performance differences.
  • The study highlights that incorporating defined detection primitives is essential for ensuring regulatory compliance and enhanced runtime security in agentic-AI systems.

Architectural Obsolescence in Unhardened Agentic-AI Runtimes: An Empirical Study

Overview

This paper delivers a formal comparative analysis of agentic-AI runtimes, focusing on the architectural gap between unhardened representative OpenClaw and the hardened, MIT-licensed enclawed-oss runtime. The evaluation operationalizes the detection of four critical biconditional failure modes—F1 (gate-bypass), F2 (audit forgery), F3 (silent host failure), F4 (wrong-target)—which, by regulatory and practical requirements, constitute safety properties for agentic-AI systems. The authors define architectural obsolescence in this context as strict recall superiority, practical adoptability, and a structural (not parametric) gap in primitive availability. Enclawed-oss ships the seven necessary detection primitives; OpenClaw ships none.

Failure Mode Taxonomy and Detection Primitives

The four biconditional failure modes underlie divergent accountability in agentic-AI runtimes:

  • F1 (gate bypass): Undetected state mutation.
  • F2 (audit forgery): Fabricated audit entries.
  • F3 (silent host failure): Partial mutation, unlogged.
  • F4 (wrong target): Authorized action misapplied.

Detection mandates seven explicit primitives: (1) biconditional check, (2) hash-chained audit log, (3) extension admission gate, (4) two-layer egress guard, (5) Bell-LaPadula classification lattice, (6) module-signing trust root, (7) bootstrap seal. These structures are entirely absent in OpenClaw and comprehensively present in enclawed-oss, amounting to a joint architectural gap rather than a tunable weakness.

Methodology

The experiment leverages a statistical adversarial harness that mediates 1600 template-synthesized and 10×LLM cross-model samples through both OpenClaw and enclawed-oss. Empirical primitive availability is determined by source-tree audit and API surface verification. Statistical metrics employ the Wilson score interval and McNemar's test to rigorously assess paired recall and precision, ensuring robust FPR bounds and comparative inferential claims. The harness has full reproducibility guarantees with random seed control, channel isolation, and hash-chained auditing.

Results

Baseline and Stress-Test

  • OpenClaw achieves recall = 0.000 (every cell, every confusion matrix).
  • Enclawed-oss and enclawed-enclaved reach P=R=F1=1.000P = R = F_1 = 1.000 (every cell).
  • The 80,000-sample stress test corroborates these results with a Wilson upper-bound FPR <4×10−44\times10^{-4}.

LLM Generalization

  • Enclawed-oss maintains mean recall = 1.000 on all cooperating LLMs for all F-categories.
  • OpenClaw recall remains 0.000, regardless of adversarial LLM emission.
  • LLM cooperation is highly variable across model families; safety training does not reliably prevent adversarial content emission.

Architectural Improvability

A six-line append-only widening of the enclawed-oss DLP regex catalog provides a 14.6% increase in per-channel F3 detection (silent host failure) at unchanged precision. This constitutes evidence for practical improvability at the architectural layer; OpenClaw, lacking a detection primitive, offers no locus for such improvements.

Implications

Regulatory and Operational Compliance

Enclawed-oss aligns with all major regulatory frameworks (PCI DSS, HIPAA, FedRAMP, NIST 800-53, GDPR, ISO/IEC 27001/27002), delivering tamper-evident audit and gate-level detection necessary for compliance. OpenClaw, lacking these primitives, is non-compliant by inspection.

Runtime Security Guarantees

LLM refusal is an unreliable and stochastic proxy defense (zero guarantees on adversarial input, variable across models and retraining), underscoring the necessity for host-side deterministic shape detection. Only agentic-AI runtimes shipping the full detection primitives provide safety coverage against F1–F4 direct actions—crucial for any deployments actuating real-world devices.

Extension and Generalization

The current enclawed-oss release attains feature parity with OpenClaw (126 bundled extensions) and extends F1–F4 detection to formerly unsafe plugin classes (e.g., Discord, Telegram). The architectural claim is strictly comparative: any agentic-AI runtime failing to detect F1–F4 and lacking these primitives is architecturally obsolete in this context.

Practical Improvability

Improvability is structural: the inclusion of explicit, testable primitives allows incremental improvement and empirical measurement of additional detection coverage. The absence of such primitives in unhardened runtimes constrains remediation to total re-architecture.

Theoretical Speculation and Future Directions

The architectural primitives and biconditional audit regime provide a formal basis for extensible hardening in agentic-AI runtimes. Given the rapid evolution of generative models and adversarial input vectors, strictly architectural guarantees are likely necessary for long-term deployability and regulatory approval. Future work should extend coverage towards system-level attacks (TOCTOU, side-channels), semantic verification, and multi-agent collusion scenarios.

Conclusion

The empirical and architectural evidence in this study fulfills the comparative obsolescence definition: OpenClaw (and, by extension, unhardened agentic-AI runtimes) is strictly dominated by enclawed-oss on F1–F4 detection, feature parity, and practical adoptability. The gap is structural—remediation demands the wholesale addition of the seven primitives. Reviewers are invited to apply the harness to any candidate runtime; obsolescence is determined by explicit empirical absence of detection primitives and null recall. The minimum viable structure for safe agentic-AI runtimes is now formalized and reproducibly established.

(2605.01740)

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.