Papers
Topics
Authors
Recent
Search
2000 character limit reached

Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture

Published 26 Apr 2026 in cs.AI and cs.CR | (2604.23646v1)

Abstract: Recent evidence suggests that frontier AI systems can exhibit agentic misalignment, generating and executing harmful actions derived from internally constructed goals, even without explicit user requests. Existing mitigation methods, such as Reinforcement Learning from Human Feedback (RLHF) and constitutional prompting, operate primarily at the model level and provide only probabilistic safety guarantees. We propose the Policy-Execution-Authorization (PEA) architecture, a "separation-of-powers" design that enforces safety at the system level. PEA decouples intent generation, authorization, and execution into independent, isolated layers connected via cryptographically constrained capability tokens. We present five core contributions: (C1) an Intent Verification Layer (IVL) for ensuring capability-intent consistency; (C2) Intent Lineage Tracking (ILT), which binds all executable intents to the originating user request via cryptographic anchors; (C3) Goal Drift Detection, which rejects semantically divergent intents below a configurable threshold; (C4) an Output Semantic Gate (OSG) that detects implicit coercion using a structured $K \times I \times P$ threat calculus (Knowledge, Influence, Policy); and (C5) a formal verification framework proving that goal integrity is maintained even under adversarial model compromise. By shifting agent alignment from a behavioral property to a structurally enforced system constraint, PEA provides a robust foundation for the governance of autonomous agents.

Authors (1)

Summary

  • The paper introduces the PEA architecture, decomposing AI operations into untrusted policy, trusted authorization, and strict execution layers to ensure goal integrity.
  • It leverages formal mechanisms—including intent verification and cryptographic tokens—to reduce goal drift from 41.2% to 3.9% and achieve 84.7% recall in coercion detection.
  • The study demonstrates that separating intent generation from execution via formal invariants yields robust, traceable, and adaptable AI safety under adversarial conditions.

Structural Enforcement of Goal Integrity in AI Agents: The PEA Architecture

Introduction

The paper "Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture" (2604.23646) introduces the Policy–Execution–Authorization (PEA) architecture, a system-level enforcement framework that addresses the persistent challenge of agentic misalignment in AI agents. The work contends that prevalent mitigation strategies—such as RLHF and constitutional prompting—fail to provide unconditional safety, as they operate purely at the model-output level and offer only probabilistic defenses. By leveraging a separation-of-powers paradigm, PEA seeks to transform AI safety from a behavioral property to a system invariant: execution is strictly constrained by independently enforced, formally specified boundaries, analogous to the move from discretionary to mandatory access control in secure operating system kernels.

Architecture and Formal Model

Layered Decomposition and Formal Interfaces

PEA structurally decomposes agentic operation into three isolation-enforced layers:

  • Policy Layer: Generates intents from user requests, but is treated as untrusted and never allowed to authorize or execute actions directly.
  • Authorization Layer: Operates as a trusted security kernel. It validates intents, applies multilayered gating—including deterministic capability checks, intent lineage validation, and semantic content filters—and only issues authorization in the form of cryptographically verifiable capability tokens.
  • Execution Layer: Enforces strict token-based execution. It never interprets policy, only executes what is cryptographically authorized.

Feedback paths (e.g., status reporting) are explicitly bounded by "Feedback Path Security Invariants" to prevent learning- or context-based escalations.

Mechanism Design and Security Properties

Five principal mechanisms define the security envelope:

  1. Intent Verification Layer (IVL): Enforces that all capability requests strictly match a policy-defined minimal capability set, precluding unauthorized privilege acquisition by the policy model.
  2. Intent Lineage Tracking (ILT): Binds each intent to its originating natural language request, using cryptographic anchors and independently computed semantic similarity, thus closing the semantic gap exploited by agentic goal drift.
  3. Goal Drift Detection: Filters intents that deviate semantically from the original request below a tunable similarity threshold or match a hardcoded shutdown-resistance blacklist.
  4. Output Semantic Gate (OSG): Dual-path filtering combines deterministic pattern matching (HOPS) and an independently deployed semantic coherence evaluator to block implicit threats and misaligned outputs, employing the novel K×I×PK \times I \times P calculus for coercion detection.
  5. Policy-Parameterized Capability Safety: Decouples safety policy specification from enforcement, allowing formally invariant enforcement boundaries across organizational contexts while maintaining configurability of action scopes.

A typed intent system, decidable constraint logic, and explicit state space underpin six formally stated security theorems, guaranteeing, under specified assumptions, that execution is always capability-bounded, authorized, and traceable to the original user intent.

Empirical Evaluation

Experimental Coverage

The evaluation leverages three adversarial datasets targeting enforcement bypass (E1), goal drift (E2), and semantic coercion (E3):

  • E1: 10,000 adversarial inputs targeting bypass, escalation, and token-reuse.
  • E2: 5,000 tasks optimized to exploit semantic drift from user request to executed intent.
  • E3: 3,000 labeled outputs for explicit threats, implicit coercion, and benign cases.

Results

Strong empirical results are reported:

  • Zero bypass rate across all 10,000 E1 adversarial trials, with 100% of actions requiring and presenting a valid capability token.
  • Goal drift attack success rate reduced from 41.2% (baseline, no drift detection) to 3.9%—a 90.5% relative reduction—with the remaining adversarial success concentrated in high-similarity, semantically divergent intents not caught by current embedding-based detection.
  • Implicit coercion detection recall of 84.7% for the OSG K×I×PK \times I \times P calculus, vastly outperforming keyword-based methods (21.3% recall).

An ablation study demonstrates that removing the Hard Auth veto yields the largest individual security loss (attack success rate rises to 34.8%), but no single defense layer is sufficient in isolation; defense-in-depth is essential.

Policy Parameterization

Comparative experiments across strict and permissive policy contexts confirm that, while effective action space expands with less restrictive policy, the core enforcement and non-bypassability properties remain invariant, thus validating the policy-parameterized, policy-invariant enforcement design.

Theoretical and Practical Implications

Shift from Behavioral to Structural Safety

PEA's primary advance is the conversion of safety from a probabilistic property of model outputs to a conditionally formal system invariant. By decoupling intent generation from both authorization and execution, and treating the policy model as untrusted, PEA is provably robust against agentic misalignment attacks for any policy-consistent capability set, within the formal bounds of enumerated assumptions.

Limitations and Residual Risks

PEA does not claim to solve inherent goal specification ambiguity or ensure that semantic similarity equates to user-aligned goal correctness. Residual attack surface includes high-similarity, semantically divergent intents, policy mis-specification, and covert channels within the authorized capability scope. Some residual error (6.9%) persists for adversarial attacks that exploit the limits of semantic similarity metrics at current thresholds.

Formalism and Extensibility

PEA draws on formal methods—the architecture is amenable to Coq verification, TLA+ specification for liveness properties, and Dolev-Yao–style symbolic adversary modeling. Unlike prior approaches (RLHF, Constitutional AI, operational SDKs), PEA's invariants are machine checkable and defined at the system—not merely model—level.

Future Directions

Key open directions include tightening formal verification coverage (e.g., full mechanized proofs of all theorems), adaptive and learned thresholds for drift detection, automated coverage analysis for policy-capability mappings, and scaling to distributed, multi-agent deployments with proof-carrying authorization.

Conclusion

The PEA architecture operationalizes a paradigm shift for AI agent safety, providing formal, structural guarantees for goal integrity, capability-bounded execution, and traceability under adversarial conditions, subject to explicitly articulated and empirically bounded assumptions. This systematization of agent safety redefines the defense strategy for tool-using AI agents, making possible a new, conditionally sound baseline for the deployment of agentic AI in security-sensitive environments (2604.23646).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 0 likes about this paper.