Papers
Topics
Authors
Recent
Search
2000 character limit reached

Containment Verification: AI Safety Guarantees Independent of Alignment

Published 9 May 2026 in cs.AI, cs.CR, and cs.SE | (2605.09045v1)

Abstract: Agentic frameworks are the software layer through which AI agents act in the world. Existing safety methods intervene on the model and therefore remain conditional on unverifiable properties of learned behavior. We introduce containment verification, which locates safety guarantees in the agentic framework itself. Under havoc oracle semantics, the AI is modeled as an unconstrained oracle ranging over the entire typed action space, and the verified containment layer must enforce the boundary policy for every possible AI output. For boundary-enforceable properties, expressed over modeled boundary events, action arguments, and state, we prove a universal guarantee by forward-simulation refinement and mechanize it in Dafny. We instantiate the paradigm by verifying PocketFlow, a minimalist agentic LLM framework, and use an agentic synthesis pipeline to generate the specification, operational model, and refinement proof under an information barrier against tautological specifications. To our knowledge, this is the first deductive formal verification of an agentic framework, and its guarantee is invariant to model capability over the modeled typed action boundary.

Authors (2)

Summary

  • The paper introduces containment verification as a novel technique that shifts safety guarantees from the AI agent to a formally verified containment layer.
  • It employs forward-simulation refinement and a Dafny mechanization on the PocketFlow framework to rigorously enforce boundary policies for all AI outputs.
  • The method achieves capability invariance by decoupling safety assurance from agent behavior, promoting layered, compositional defenses in high-consequence settings.

Containment Verification: A Capability-Invariant Guarantee for AI Safety

Motivation and Context

The ongoing deployment of agentic AI systems in high-consequence real-world settings has amplified both technical and societal concerns about achieving reliable safety guarantees. Existing safety protocols—spanning alignment training regimes (e.g., RLHF, Constitutional AI), mechanistic interpretability, chain-of-thought monitoring, and adversarial defense—intervene on the model itself. Despite progress, all remain fundamentally conditional on unverifiable or empirically brittle properties of learned behaviors. Empirical failures such as sleeper agents persisting through safety training, unfaithful chain-of-thought reasoning, and residual adversarial vulnerabilities highlight that the assurance gap widens as model capability scales.

By analogy with mature engineering disciplines (nuclear, process, and computer engineering), where fail-safes do not depend on the controlled system’s correct functioning, this work introduces containment verification: an approach that relocates the locus of formal safety guarantees from the agent to the agentic framework that mediates interaction with the external world—the containment layer. The central objective is a universal, deductive guarantee: for every possible AI output, the containment layer enforces explicitly specified safety boundaries, invariant to agent capability or strategy.

Containment Verification Paradigm and Formalism

Containment verification operationalizes AI safety at the interface boundary: the typed action space exposed by the containment layer. The AI is modeled as a havoc oracle—an unconstrained procedure whose outputs range over the entire action type. Under this semantics, the proof burden shifts to the containment layer, which must enforce boundary policies over all possible agentic actions.

The approach leverages labeled transition systems to formalize agentic computation:

  • T=(S,A,E,s0,→)T = (S, A, \mathcal{E}, s_0, \rightarrow): state space SS, action space AA, boundary event space E\mathcal{E}, initial state s0s_0, and transition relation.
  • The AI's strategy is abstracted away; every a∈Aa \in A is possible at every step.
  • Verification proceeds by forward-simulation refinement between an abstract model (specifying desired boundary properties) and the operational (concrete) agentic framework.

Boundary-enforceable properties are predicates solely over typed actions, modeled boundary events, action arguments, and state. The formal soundness theorem guarantees that, conditional on effect exclusivity (all agent-accessible effects pass through the boundary) and a successfully discharged refinement proof, no out-of-policy effect can be realized, regardless of the AI’s intent, internal state, or training.

Implementation and Mechanization

The paradigm is instantiated on PocketFlow, a minimalist LLM agentic framework exposing an explicit event-driven state machine. The formal verification is mechanized in Dafny, utilizing:

  • Abstract and operational transition models (specifying permitted actions/events);
  • Explicit refinement relations (state, event abstractions);
  • Deductive proofs establishing the commutation and safety-preserving properties required for simulation refinement.

A notable engineering advancement is the agentic formal specification synthesis pipeline: a seven-phase, LLM-driven process automating the specification, refinement, and proof repair in Dafny. Central to pipeline soundness is an architectural information barrier that strictly decouples property derivation from implementation context, eliminating tautological specifications—i.e., preventing the synthesis of vacuously true but unenforceable properties.

Three-gate validation (resolution, vacuity, discrimination) further ensures specification non-vacuity and discriminative capacity. Empirical evaluation demonstrates that for the PocketFlow witness, the automated pipeline efficiently constructs and discharges all verified obligations, with the guarantee robust to underlying agent behavior.

Guarantee Strength, Limitations, and Layered Assurance

A critical distinction relative to prior work is the universality of containment verification guarantees. While empirical and probabilistic approaches provide assurance only for tested or distributional input subsets, deductive container-layer verification quantifies over all possible trajectories admissible at the action interface. The guarantee, however, is only as strong as the effect exclusivity assumption: any unmodeled or direct path to side-effects outside the verified interface (e.g., OS escapes, Python internals, hardware faults) remains uncovered, underscoring the necessity of compositional defense-in-depth—pairing containment verification with sandboxing and lower-layer isolation.

Scope is strictly bounded to boundary-enforceable properties. Content-level desiderata (truthfulness, honesty, resistance to manipulation) are outside the formalism’s reach, as these necessitate semantic integration with the language generation mechanism, which containment verification deliberately abstracts over. Oracle-level and statistical certification approaches thus remain complementary.

The guarantee's capability invariance addresses a central concern in AI safety discourse: the possibility of superhumanly capable agents that escape or subvert model-centric controls. Here, the guarantee's resilience is fundamentally decoupled from agent strategy or training distribution—no increase in AI capability can breach an effect boundary the agent cannot express. The design pressure accordingly shifts from model alignment to the narrowness and clarity of the containment layer’s action interface.

Empirical Observations and Synthesis Quality

Empirical analysis of specification synthesis under architectural barriers reveals that property completeness (ensuring all privileged effects are covered by the template) is crucial. While information barriers prevent tautologies, if the property template fails to represent a critical external effect, specifications may verify vacuously. Additional template-fitness validation is necessary to ensure every field of the safety policy is actively discharged in the semantics of the specific agentic workflow.

Furthermore, both structural and protocol-invariant validation gates demonstrate robustness to common specification-mutation attacks. The synthesis pipeline achieves efficient, repeatable, and discriminative proof generation and validation for varied agentic workflows, validating practical applicability at moderate action-interface scales.

Implications and Future Directions

Practically, the methodology enables a new deployment-time safety regime for AI agents, particularly in safety-critical and high-assurance settings. The paradigm's composability with sandboxing and statistical certification methods establishes a path toward layered, cross-cutting safety architectures.

Theoretically, containment verification poses a capability-invariant residualization of the AI safety problem: as model-centric controls degrade with capability, only the fidelity of the containment layer and its formal model-to-implementation correspondence must be trusted. As such, the remaining challenge—and future research focal point—is automated, scalable, and template-complete synthesis of boundary policies for large, complex agentic frameworks (e.g., LangChain). The formal composition of boundary-layer and statistical/certified content-level guarantees remains an open problem for integrated assurance across the deployment stack.

Conclusion

Containment verification relocates the AI safety guarantee to the containment layer, formalizing a universal, deductive, and capability-invariant boundary for agent externalization. By mechanizing this paradigm in Dafny and on PocketFlow, and by introducing an LLM-driven, information-barriered synthesis-audit pipeline, the work demonstrates viable, scalable tools for production safety at the agent-framework interface. The paradigm, however, is not a panacea: it must be layered with model-centric and operational methods, and its technical scope is sharply bounded to boundary-enforceable properties contingent on rigorous interface and template engineering.

Further work will determine its tractability and compositional soundness at the largest scales and in the context of continued advances in agent capability and autonomy.


Cited Paper: "Containment Verification: AI Safety Guarantees Independent of Alignment" (2605.09045)

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 1 tweet with 1 like about this paper.