- The paper demonstrates that unhardened OpenClaw fails to detect critical failure modes, while enclawed-oss achieves perfect recall using seven explicit detection primitives.
- A rigorous statistical adversarial harness employing 81,600 samples with Wilson score intervals and McNemar's test validates the empirical performance differences.
- The study highlights that incorporating defined detection primitives is essential for ensuring regulatory compliance and enhanced runtime security in agentic-AI systems.
Architectural Obsolescence in Unhardened Agentic-AI Runtimes: An Empirical Study
Overview
This paper delivers a formal comparative analysis of agentic-AI runtimes, focusing on the architectural gap between unhardened representative OpenClaw and the hardened, MIT-licensed enclawed-oss runtime. The evaluation operationalizes the detection of four critical biconditional failure modes—F1 (gate-bypass), F2 (audit forgery), F3 (silent host failure), F4 (wrong-target)—which, by regulatory and practical requirements, constitute safety properties for agentic-AI systems. The authors define architectural obsolescence in this context as strict recall superiority, practical adoptability, and a structural (not parametric) gap in primitive availability. Enclawed-oss ships the seven necessary detection primitives; OpenClaw ships none.
Failure Mode Taxonomy and Detection Primitives
The four biconditional failure modes underlie divergent accountability in agentic-AI runtimes:
- F1 (gate bypass): Undetected state mutation.
- F2 (audit forgery): Fabricated audit entries.
- F3 (silent host failure): Partial mutation, unlogged.
- F4 (wrong target): Authorized action misapplied.
Detection mandates seven explicit primitives: (1) biconditional check, (2) hash-chained audit log, (3) extension admission gate, (4) two-layer egress guard, (5) Bell-LaPadula classification lattice, (6) module-signing trust root, (7) bootstrap seal. These structures are entirely absent in OpenClaw and comprehensively present in enclawed-oss, amounting to a joint architectural gap rather than a tunable weakness.
Methodology
The experiment leverages a statistical adversarial harness that mediates 1600 template-synthesized and 10×LLM cross-model samples through both OpenClaw and enclawed-oss. Empirical primitive availability is determined by source-tree audit and API surface verification. Statistical metrics employ the Wilson score interval and McNemar's test to rigorously assess paired recall and precision, ensuring robust FPR bounds and comparative inferential claims. The harness has full reproducibility guarantees with random seed control, channel isolation, and hash-chained auditing.
Results
Baseline and Stress-Test
- OpenClaw achieves recall = 0.000 (every cell, every confusion matrix).
- Enclawed-oss and enclawed-enclaved reach P=R=F1​=1.000 (every cell).
- The 80,000-sample stress test corroborates these results with a Wilson upper-bound FPR <4×10−4.
LLM Generalization
- Enclawed-oss maintains mean recall = 1.000 on all cooperating LLMs for all F-categories.
- OpenClaw recall remains 0.000, regardless of adversarial LLM emission.
- LLM cooperation is highly variable across model families; safety training does not reliably prevent adversarial content emission.
Architectural Improvability
A six-line append-only widening of the enclawed-oss DLP regex catalog provides a 14.6% increase in per-channel F3 detection (silent host failure) at unchanged precision. This constitutes evidence for practical improvability at the architectural layer; OpenClaw, lacking a detection primitive, offers no locus for such improvements.
Implications
Regulatory and Operational Compliance
Enclawed-oss aligns with all major regulatory frameworks (PCI DSS, HIPAA, FedRAMP, NIST 800-53, GDPR, ISO/IEC 27001/27002), delivering tamper-evident audit and gate-level detection necessary for compliance. OpenClaw, lacking these primitives, is non-compliant by inspection.
Runtime Security Guarantees
LLM refusal is an unreliable and stochastic proxy defense (zero guarantees on adversarial input, variable across models and retraining), underscoring the necessity for host-side deterministic shape detection. Only agentic-AI runtimes shipping the full detection primitives provide safety coverage against F1–F4 direct actions—crucial for any deployments actuating real-world devices.
Extension and Generalization
The current enclawed-oss release attains feature parity with OpenClaw (126 bundled extensions) and extends F1–F4 detection to formerly unsafe plugin classes (e.g., Discord, Telegram). The architectural claim is strictly comparative: any agentic-AI runtime failing to detect F1–F4 and lacking these primitives is architecturally obsolete in this context.
Practical Improvability
Improvability is structural: the inclusion of explicit, testable primitives allows incremental improvement and empirical measurement of additional detection coverage. The absence of such primitives in unhardened runtimes constrains remediation to total re-architecture.
Theoretical Speculation and Future Directions
The architectural primitives and biconditional audit regime provide a formal basis for extensible hardening in agentic-AI runtimes. Given the rapid evolution of generative models and adversarial input vectors, strictly architectural guarantees are likely necessary for long-term deployability and regulatory approval. Future work should extend coverage towards system-level attacks (TOCTOU, side-channels), semantic verification, and multi-agent collusion scenarios.
Conclusion
The empirical and architectural evidence in this study fulfills the comparative obsolescence definition: OpenClaw (and, by extension, unhardened agentic-AI runtimes) is strictly dominated by enclawed-oss on F1–F4 detection, feature parity, and practical adoptability. The gap is structural—remediation demands the wholesale addition of the seven primitives. Reviewers are invited to apply the harness to any candidate runtime; obsolescence is determined by explicit empirical absence of detection primitives and null recall. The minimum viable structure for safe agentic-AI runtimes is now formalized and reproducibly established.
(2605.01740)