Beyond Static Sandboxing: Learned Capability Governance for Autonomous AI Agents

Published 12 Apr 2026 in cs.CR and cs.AI | (2604.11839v1)

Abstract: Autonomous AI agents built on open-source runtimes such as OpenClaw expose every available tool to every session by default, regardless of the task. A summarization task receives the same shell execution, subagent spawning, and credential access capabilities as a code deployment task, a 15x overprovision ratio that we call the capability overprovisioning problem. Existing defenses, including the NemoClaw container sandbox and the Cisco DefenseClaw skill scanner, address containment and threat detection but do not learn the minimum viable capability set for each task type. We present Aethelgard, a four layer adaptive governance framework that enforces least privilege for AI agents through a learned policy. Layer 1, the Capability Governor, dynamically scopes which tools the agent is aware of in each session. Layer 3, the Safety Router, intercepts tool calls before execution using a hybrid rule based and fine tuned classifier. Layer 2, the RL Learning Policy, trains a PPO policy on the accumulated audit log to learn the minimum viable skill set for each task type.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces Aethelgard, an adaptive framework that dynamically restricts tool exposure using RL policy and a safety router to enforce least privilege for autonomous AI agents.
It demonstrates empirical gains with a 73% tool reduction and over 260% SER improvement, while eliminating dangerous tool exposure and mitigating prompt injection attacks.
The approach provides a modular, runtime-agnostic solution that complements static sandboxing, reinforcing agent safety and resilience across diverse deployments.

Learned Capability Governance for Autonomous AI Agents: An Authoritative Analysis of "Beyond Static Sandboxing" (2604.11839)

Problem Formulation and Context

The paper identifies a critical vulnerability in open-source agent runtimes, typified by OpenClaw, arising from indiscriminate exposure of a fixed set of capabilities (tools) to all agent sessions independent of task granularity. This leads to a quantifiable capability overprovisioning problem, whereby the Skill Economy Ratio (SER) is extremely suboptimal (e.g., SER=0.067 for summarization: 1/15 tools used). Attacks such as ClawHavoc and CVE-2026-25253 exploited this overprovisioning, reinforcing the necessity for governance mechanisms that enforce least privilege.

Existing containment schemes (NemoClaw's containerization, DefenseClaw's skill scanning) are analyzed as static or reactive, lacking the ability to dynamically restrict capability awareness or adapt exposure based on empirical task behaviors. The paper advances the conceptualization of capability scoping from a static policy to a learned optimization problem, with operational consequences for agentic safety and supply chain resilience.

Aethelgard Framework and Architectural Components

The primary contribution is the design and deployment of Aethelgard, a four-layer adaptive governance framework:

Layer 1: Capability Governor dynamically restricts tool visibility per session via AGENTS.md and tools.deny injection, enforcing scope semantically and infrastructurally.
Layer 2: RL Learning Policy applies PPO (proximal policy optimization) on session audit logs, learning minimal viable capability sets per task type. The MDP formulation combines task type and trust level as state, with a binary exposure mask as action, optimizing for task accuracy, economy (inverse SER), and safety (block count).
Layer 3: Safety Router implements a hybrid rule-based and fine-tuned LLM classifier (Qwen2.5-1.5B with 273 labeled examples) as a MITM proxy, intercepting tool calls and reliably blocking malicious invocations (including prompt injection, dangerous arguments, always-block tools).
Layer 4 (Integration Layer): Coordinates audit signals, policy inference, and fallback mechanisms for runtime-agnostic operation.

The framework operationalizes the insight that tool-level awareness, rather than mere execution quarantine, provides a robust boundary; an agent cannot misuse what it cannot reason about.

Empirical Evaluation and Numerical Outcomes

Deployment on OpenClaw v2026.3.28 with DeepSeek-chat as the agent LLM demonstrates substantial security and efficiency gains:

Tool Reduction: 73% reduction and 100% elimination of dangerous tools for summarization tasks.
SER Improvement: PPO policy achieves +260% (synthetic) and +337% (real session) SER improvement over baseline, converging toward the ideal value of 1.0 (agent only invokes exposed tools).
Block Rate: Across N=500 evaluation (400 benign, 100 adversarial), 26.2% of tool calls are blocked; exec and sessions_spawn calls are blocked at 100% efficiency; adversarial coverage is 92%.
Adversarial Robustness: Zero attack success rate observed. Prompt injection led qwen2.5:7b to perform dangerous tool calls, which were blocked at infrastructure-level; DeepSeek-chat refused prompt-level misuse.
Ablation: Capability Governor accounts for ~80% of SER gain; RL policy provides +28% improvement over static YAML rules. Safety Router is essential for blocking tool execution, exhibiting zero marginal change in SER but full execution interception.

Classifier TPR and FPR are 100% and 0% respectively, validating reliability. Fine-tuned LLM classifier latency is ~2s, suitable for interactive deployments, but may require further optimization.

Practical and Theoretical Implications

Aethelgard reframes capability governance as a dynamic, learned problem, offering the following implications:

Model-Agnostic Safety Floor: Infrastructure-level interception decouples agent safety from LLM tool-calling reliability, accommodating diverse base models and ensuring defense in depth.
Adaptive Least Privilege: Learned policies tighten capability restriction, outperforming static configuration, and adapting to new task and trust patterns.
Complementarity: Aethelgard functions as a complement to existing sandbox and skill scanners, focusing on task-specific capability awareness and enforcing boundaries both semantically and executively.
Scope Boundaries: While tool-level governance is necessary, it is not sufficient; agents may express dangerous actions in natural language. Complete agentic safety requires both capability and output governance.

The methodology is runtime agnostic and deployable without model modification, supporting broad application across agent ecosystems.

Limitations and Future Directions

Policy Feature Space: Current RL policy lacks user identity and session history, limiting personalization.
Classifier Training Imbalance: Over-blocking due to skewed BLOCK/ALLOW ratios; additional ALLOW data is needed.
Latency: LLM classifier latency may impede high-throughput environments.
Content Safety: Governance is limited to tool-awareness and invocation; semantic inspection of output content (e.g., advice vectors) remains an open area, suggesting integration with output filters (e.g., Llama Guard, CodeGuard).
Further Integration: Planned integration with NemoClaw and mobile NPU runtimes.

Future work may explore expanding state features for RL policy, optimizing classifier inference, and combining with semantic response inspection for holistic agentic safety.

Conclusion

The paper introduces Aethelgard, a principled, adaptive capability governance framework leveraging RL policy learning and hybrid interception to enforce least privilege for autonomous agents. Strong empirical results validate significant reductions in tool overprovisioning, elimination of dangerous tool exposure, and resilience against prompt injection and supply chain attacks. The approach provides a modular, runtime-agnostic defense layer, shifting governance from static policies to adaptive optimization. The integration of code, data, and logs advances reproducible research in agentic governance and lays groundwork for further development toward comprehensive AI agent safety.

Markdown Report Issue