- The paper introduces the PEA architecture, decomposing AI operations into untrusted policy, trusted authorization, and strict execution layers to ensure goal integrity.
- It leverages formal mechanisms—including intent verification and cryptographic tokens—to reduce goal drift from 41.2% to 3.9% and achieve 84.7% recall in coercion detection.
- The study demonstrates that separating intent generation from execution via formal invariants yields robust, traceable, and adaptable AI safety under adversarial conditions.
Structural Enforcement of Goal Integrity in AI Agents: The PEA Architecture
Introduction
The paper "Structural Enforcement of Goal Integrity in AI Agents via Separation-of-Powers Architecture" (2604.23646) introduces the Policy–Execution–Authorization (PEA) architecture, a system-level enforcement framework that addresses the persistent challenge of agentic misalignment in AI agents. The work contends that prevalent mitigation strategies—such as RLHF and constitutional prompting—fail to provide unconditional safety, as they operate purely at the model-output level and offer only probabilistic defenses. By leveraging a separation-of-powers paradigm, PEA seeks to transform AI safety from a behavioral property to a system invariant: execution is strictly constrained by independently enforced, formally specified boundaries, analogous to the move from discretionary to mandatory access control in secure operating system kernels.
PEA structurally decomposes agentic operation into three isolation-enforced layers:
- Policy Layer: Generates intents from user requests, but is treated as untrusted and never allowed to authorize or execute actions directly.
- Authorization Layer: Operates as a trusted security kernel. It validates intents, applies multilayered gating—including deterministic capability checks, intent lineage validation, and semantic content filters—and only issues authorization in the form of cryptographically verifiable capability tokens.
- Execution Layer: Enforces strict token-based execution. It never interprets policy, only executes what is cryptographically authorized.
Feedback paths (e.g., status reporting) are explicitly bounded by "Feedback Path Security Invariants" to prevent learning- or context-based escalations.
Mechanism Design and Security Properties
Five principal mechanisms define the security envelope:
- Intent Verification Layer (IVL): Enforces that all capability requests strictly match a policy-defined minimal capability set, precluding unauthorized privilege acquisition by the policy model.
- Intent Lineage Tracking (ILT): Binds each intent to its originating natural language request, using cryptographic anchors and independently computed semantic similarity, thus closing the semantic gap exploited by agentic goal drift.
- Goal Drift Detection: Filters intents that deviate semantically from the original request below a tunable similarity threshold or match a hardcoded shutdown-resistance blacklist.
- Output Semantic Gate (OSG): Dual-path filtering combines deterministic pattern matching (HOPS) and an independently deployed semantic coherence evaluator to block implicit threats and misaligned outputs, employing the novel K×I×P calculus for coercion detection.
- Policy-Parameterized Capability Safety: Decouples safety policy specification from enforcement, allowing formally invariant enforcement boundaries across organizational contexts while maintaining configurability of action scopes.
A typed intent system, decidable constraint logic, and explicit state space underpin six formally stated security theorems, guaranteeing, under specified assumptions, that execution is always capability-bounded, authorized, and traceable to the original user intent.
Empirical Evaluation
Experimental Coverage
The evaluation leverages three adversarial datasets targeting enforcement bypass (E1), goal drift (E2), and semantic coercion (E3):
- E1: 10,000 adversarial inputs targeting bypass, escalation, and token-reuse.
- E2: 5,000 tasks optimized to exploit semantic drift from user request to executed intent.
- E3: 3,000 labeled outputs for explicit threats, implicit coercion, and benign cases.
Results
Strong empirical results are reported:
- Zero bypass rate across all 10,000 E1 adversarial trials, with 100% of actions requiring and presenting a valid capability token.
- Goal drift attack success rate reduced from 41.2% (baseline, no drift detection) to 3.9%—a 90.5% relative reduction—with the remaining adversarial success concentrated in high-similarity, semantically divergent intents not caught by current embedding-based detection.
- Implicit coercion detection recall of 84.7% for the OSG K×I×P calculus, vastly outperforming keyword-based methods (21.3% recall).
An ablation study demonstrates that removing the Hard Auth veto yields the largest individual security loss (attack success rate rises to 34.8%), but no single defense layer is sufficient in isolation; defense-in-depth is essential.
Policy Parameterization
Comparative experiments across strict and permissive policy contexts confirm that, while effective action space expands with less restrictive policy, the core enforcement and non-bypassability properties remain invariant, thus validating the policy-parameterized, policy-invariant enforcement design.
Theoretical and Practical Implications
Shift from Behavioral to Structural Safety
PEA's primary advance is the conversion of safety from a probabilistic property of model outputs to a conditionally formal system invariant. By decoupling intent generation from both authorization and execution, and treating the policy model as untrusted, PEA is provably robust against agentic misalignment attacks for any policy-consistent capability set, within the formal bounds of enumerated assumptions.
Limitations and Residual Risks
PEA does not claim to solve inherent goal specification ambiguity or ensure that semantic similarity equates to user-aligned goal correctness. Residual attack surface includes high-similarity, semantically divergent intents, policy mis-specification, and covert channels within the authorized capability scope. Some residual error (6.9%) persists for adversarial attacks that exploit the limits of semantic similarity metrics at current thresholds.
PEA draws on formal methods—the architecture is amenable to Coq verification, TLA+ specification for liveness properties, and Dolev-Yao–style symbolic adversary modeling. Unlike prior approaches (RLHF, Constitutional AI, operational SDKs), PEA's invariants are machine checkable and defined at the system—not merely model—level.
Future Directions
Key open directions include tightening formal verification coverage (e.g., full mechanized proofs of all theorems), adaptive and learned thresholds for drift detection, automated coverage analysis for policy-capability mappings, and scaling to distributed, multi-agent deployments with proof-carrying authorization.
Conclusion
The PEA architecture operationalizes a paradigm shift for AI agent safety, providing formal, structural guarantees for goal integrity, capability-bounded execution, and traceability under adversarial conditions, subject to explicitly articulated and empirically bounded assumptions. This systematization of agent safety redefines the defense strategy for tool-using AI agents, making possible a new, conditionally sound baseline for the deployment of agentic AI in security-sensitive environments (2604.23646).