Governed MCP: Kernel-Level Tool Governance for AI Agents via Logit-Based Safety Primitives

Published 18 Apr 2026 in cs.CR, cs.AI, and cs.OS | (2604.16870v1)

Abstract: AI agents increasingly call external tools (file system, network, APIs) through the Model Context Protocol (MCP). These tool calls are the agent's syscalls -- privileged operations with side effects on shared state -- yet today's safety enforcement lives entirely in userspace, where a 10-line script can bypass it. I propose Governed MCP, a kernel-resident tool governance gateway built on a logit-based safety primitive (ProbeLogits, companion paper: arXiv:2604.11943). The gateway interposes on every MCP tool call in a 6-layer pipeline: schema validation, trust tier check, rate limit, adversarial pre-filter, ProbeLogits gate (the load-bearing semantic check), and constitutional policy match, with a Blake3-hashed audit chain. I implement Governed MCP in Anima OS, a bare-metal x86_64 OS in approximately 86,000 lines of Rust. The five non-inference layers add 65.3 microseconds of overhead per call; ProbeLogits adds 65 ms (per-token-class semantic decision) on 7B Q4_0. A 4-config ablation on a 101-prompt MCP-domain benchmark shows that removing the ProbeLogits layer collapses F1 from 0.773 to 0.327 (Delta F1 = -0.446) -- hand-rule firewalling alone is insufficient. All 15 WASM-to-system host functions in the runtime route through the gateway (complete mediation of the WASM ABI surface; the scope and caveats of this claim are stated in Section 4.6); a 10-LoC userspace bypass that defeats existing guardrail libraries is structurally impossible against the kernel-resident gate.

Abstract PDF Upgrade to Chat

Authors (1)

Daeyeon Son

Summary

The paper presents a kernel-resident mediation approach that enforces complete safety checks for AI tool calls below the agent's privilege boundary.
It introduces a six-layer governance pipeline, highlighted by the novel ProbeLogits semantic safety primitive, achieving an F1 score of 0.773.
The study demonstrates that semantic safety enforcement is critical for mitigating adversarial and bypass attacks inherent in userspace safety mechanisms.

Kernel-Resident Tool-Call Governance for AI Agents: An Analysis of Governed MCP

Introduction

The increasing deployment of AI agents equipped to make autonomous decisions via tool calls creates a privileged surface for system interaction. The Model Context Protocol (MCP) has become the dominant interface for such calls, but current safety infrastructure wholly resides within userspace, making it trivially bypassable and structurally incapable of robust privilege separation. This paper, "Governed MCP: Kernel-Level Tool Governance for AI Agents via Logit-Based Safety Primitives" (2604.16870), presents a kernel-level governance gateway embodying complete mediation of the MCP tool-call interface within Anima OS, a bare-metal Rust-based operating system. The governance pipeline is implemented at the OS kernel boundary and is fortified by a logit-based semantic safety primitive, ProbeLogits.

Motivation and Problem Statement

Tool calls, as standardized by MCP, functionally resemble system calls and thus require mediation by an OS trust boundary. Existing safety mechanisms, including NeMo Guardrails, AGT schemas, and model-driven classifiers, inevitably operate within the same process and privilege context as the agent. The paper demonstrates that all such userspace enforcement can be bypassed in a single line of code without requiring an implementation bug, exclusively exploiting privilege co-residency. The security gap is thus not due to poor safety logic but results from an architectural absence of domain separation and enforcement below the attacker's privilege boundary.

The author identifies a realistic threat model: not only malicious agents, but also prompt-injected, role-play, composed multi-step, or compromised-tool server scenarios—all representing concrete bypass paths not closed by current architectural deployment models.

Governed MCP Architecture

Governed MCP interposes a six-layer governance pipeline on every MCP tool call:

Schema validation: Strict enforcement of JSON-RPC and MCP tool schema compliance.
Trust tier check: Agent trust levels are mapped to permissible tool sets via hash-based whitelisting.
Rate limiting: Per-agent, per-tool rate control via token bucket algorithm.
Adversarial pre-filter: O( $n$ ) regex and DFA-driven detection of injection, encoding, and authority-impersonation attacks.
ProbeLogits semantic gate: Kernel-level LLM inference classifies tool-call intent using specialized logit extraction and calibrated thresholds.
Constitutional policy match: Enforces a 12-principle deployment policy via predicate evaluation.

These layers are augmented with a tamper-evident, Blake3-hashed chain of audit logs, recording all decisions in a persistent ring buffer.

The entire WASM ABI surface (15 host functions, with empirical verification across 123 agent-to-system code paths) is mediated exclusively by a kernel-exposed governance_check_host() entry point. No userspace path can bypass this gateway, absent a WASM or kernel exploit.

ProbeLogits Semantic Safety Primitive

The ProbeLogits layer is central to the system's security posture. It utilizes a single-token, interpolated logit difference (e.g., Dangerous vs. Safe) at a known verbalizer position within the model's output distribution. Each tool call is passed through the model, and semantic safety is determined by thresholding the calibrated logit score. If the model or infrastructure is unavailable, all tool calls are denied by FAIL-CLOSED semantics. ProbeLogits is model-agnostic, with contracts defined via a token fertility check at boot.

Numerical Evaluation and Ablation

Governed MCP is evaluated on a 101-prompt, author-labeled Custom-101 benchmark covering multiple attack domains. The end-to-end pipeline yields F1=0.773 (Accuracy 83.2%), with a 95% bootstrap CI on F1 of [0.656, 0.870]. The ablation removing only the ProbeLogits semantic layer drops F1 to 0.327, representing a ΔF1 of −0.446. This empirically establishes that syntactic and hand-rule firewalling are insufficient—provably admitting a significant fraction of dangerous tool calls without semantic inspection. Practical evaluation of per-layer throughput yields a median pipeline overhead of 65.3 μs without inference, and 65 ms with ProbeLogits inference (7B model, Q4_0 quantization).

Cross-benchmark comparison and multi-model validation (via HarmBench, XSTest, ToxicChat) show the substrate primitive is robust across multiple LLM architectures, with block rates of 97–99% on external adversarial test sets.

Security Analysis

The governance gateway presents a concrete kernel-to-agent trust boundary. By implementing all agent-to-system mediation in kernel space and ensuring no public entry points to side-effecting operations beyond the gateway, the system achieves complete mediation as defined by Saltzer and Schroeder. The design makes userspace bypass structurally impossible within the WASM ABI surface but does not claim to mitigate hardware-level or JIT-bug exploits, which remain future work.

Semantic evaluation with ProbeLogits is not only necessary for defeating complex adversarial patterns but is also efficient enough (65 ms overhead) for practical deployment in typical agent workloads. The gateway utilizes a graduated deny/allow policy with warning bands and atomic KV cache snapshotting to ensure per-call context isolation and consistency.

Limitations and Future Directions

The key limitations include inference-layer performance, lack of public, community-labeled benchmarks for MCP-domain governance, and absence of formal red-team studies. The approach does not (yet) cover post-execution output probing or formal model provenance validation. Specific areas for future research include accelerator-based inference to reduce per-call latency, cascade models for low-overhead deployment, post-execution semantic checking of tool results, and scaling studies under multi-agent contention. Initiating red-team competitions with bug bounty incentives is proposed to extend and stress-test the governance surface.

Theoretical and Practical Implications

The paper reifies the classic OS reference monitor model in the AI agent execution environment, relocating safety enforcement to a position where it is invulnerable to trivial application-layer bypass. A critical claim supported by empirical ablation is that semantic safety enforcement is a necessary and load-bearing kernel primitive. Complete mediation of all synchronous agent-system interaction channels is tractable with this architecture, in contrast to userspace guardrails, which are formally insufficient against privilege co-residence attacks.

This structurally secures the AI tool-call vector in a manner analogous to syscall filtering and mediation in classical OS security. The kernel-resident approach is complementary to ongoing hardware-software co-design for agent isolation and aligns with developments such as hypervisor-enforced (e.g., Guillotine [HotOS'25]) and capability-based access control.

Conclusion

Governed MCP demonstrates that semantic enforcement of tool-call safety is (1) technically feasible as a kernel primitive, and (2) empirically necessary for robust defense against adversarial, prompt-injected, or composed attacks. The system eliminates the predominant failure mode of current in-process guardrails by enforcing safety checks below the agent's privilege boundary, with empirically validated completeness and durability guarantees. The resulting model for AI agent governance should be considered the architectural baseline for future agent operating systems and MCP-compatible AI deployments.

Markdown Report Issue