Papers
Topics
Authors
Recent
Search
2000 character limit reached

GRACE: Reason-Aligned Containment for Safe AI

Updated 4 July 2026
  • GRACE is a neuro-symbolic, reason-based containment architecture that decouples normative reasoning from instrumental decision-making.
  • The architecture integrates a Moral Module, Decision-Making Module, and Guard to compute permissible macro actions and enforce ethical behavior without retraining the agent.
  • It ensures safe AI operation by using formal and statistical methods to guarantee moral compliance across a diverse range of autonomous systems.

Governor for Reason-Aligned ContainmEnt (GRACE) is a neuro-symbolic reason-based containment architecture for safe and ethical AI alignment that disentangles an AI agent’s normative reasoning from its instrumental decision-making and can contain AI agents of virtually any design. In GRACE, normative permissibility is computed separately from instrumental optimization, then enforced by a governance layer that wraps rather than retrains the target agent. The architecture is organized around a Moral Module (MM), a Decision-Making Module (DMM), and a Guard, with the MM deriving permissible Macro Action Types (MATs), the DMM selecting primitive actions subject to those permissions, and the Guard enforcing compliance end to end (Jahn et al., 15 Jan 2026).

1. Architectural decomposition and design objectives

GRACE is presented as a containment architecture designed to ensure that any autonomous agent, including a reinforcement learner, LLM, or robot, acts only within morally permissible high-level behaviors. Its stated objectives are to maintain the original agent’s instrumental efficacy by wrapping it in a transparent governance layer, and to provide interpretability, contestability, and formal or symbolic guarantees of moral compliance while remaining agnostic to the internal structure of the encapsulated agent (Jahn et al., 15 Jan 2026).

The decomposition is sequential. First, the Moral Module performs symbolic reason-based inference over the current observations and produces a set of permissible MATs, denoted Φperm\Phi_{\text{perm}}. Second, the Decision-Making Module, which is the wrapped target agent, selects instrumentally optimal primitive actions under the constraint that the resulting macro actions must belong to Φperm\Phi_{\text{perm}}. Third, the Guard verifies that each proposed primitive action satisfies at least one permissible MAT, written asϕa \models_s \phi, and blocks violations. This design is explicitly intended to decouple normative reasoning from instrumental decision-making, rather than to merge them into a single monolithic policy (Jahn et al., 15 Jan 2026).

A recurrent misconception is that GRACE is a retraining scheme for the underlying model. The formulation instead places the safety mechanism in an external governance layer. This makes the architecture compatible with agents of heterogeneous internal design, while locating the normative interface at the level of macro-permissions and monitored effects.

2. Moral Module: defeasible reason theory and permissible macro actions

The MM is grounded in a defeasible reason theory in the style of Horty [2012]. A general reason theory is defined as

T=R,D,<,\mathcal{T} = \langle \mathcal{R},\,\mathcal{D},\,<\rangle,

where R\mathcal{R} is a set of parameterized propositional reasons, D\mathcal{D} is a set of default rules of the form

δ:P(X)ϕ(X),\delta : P(\vec X)\longrightarrow \phi(\vec X),

and << is a strict partial order on D\mathcal{D} for conflict resolution. In this formulation, defaults map factual reasons to MAT-formulas in a language LL, and the priority relation determines which defaults survive defeat when conflicts arise (Jahn et al., 15 Jan 2026).

Given an environmental observation Φperm\Phi_{\text{perm}}0, the MM grounds the theory to a situation-specific model

Φperm\Phi_{\text{perm}}1

where Φperm\Phi_{\text{perm}}2 contains all ground facts inferred from Φperm\Phi_{\text{perm}}3, Φperm\Phi_{\text{perm}}4 contains all instantiated defaults whose antecedents are in Φperm\Phi_{\text{perm}}5, and Φperm\Phi_{\text{perm}}6 is the projection of the priority order. Horty’s defeat-based inference is then applied to determine the undefeated MAT-formulas:

Φperm\Phi_{\text{perm}}7

These formulas constitute the macros the agent is morally allowed to execute (Jahn et al., 15 Jan 2026).

The MM’s representation is explicitly symbolic. In the stated account, this symbolic structure provides a semantic foundation for deontic logic and supports interpretability, contestability, and justifiability. Stakeholders can inspect which grounded defaults were active, which were defeated, and how priority relations affected the final permissibility judgment. This is central to GRACE’s claim to normatively legible governance.

3. Decision-Making Module and Guard: constrained instrumental optimization

The DMM encapsulates the original agent and takes as input both instrumental observations Φperm\Phi_{\text{perm}}8 and the normative context Φperm\Phi_{\text{perm}}9 supplied by the MM. Its primitive-action proposal is written as

asϕa \models_s \phi0

Within this interface, the DMM may select among candidate permissible MATs by evaluating expected instrumental utility. For each asϕa \models_s \phi1, it may approximate

asϕa \models_s \phi2

It then chooses the primitive action that is instrumentally optimal within the selected MAT. The formulation also allows optional uncertainty estimates of the form asϕa \models_s \phi3 so that the DMM can trade off risk and reward under uncertainty (Jahn et al., 15 Jan 2026).

The Guard enforces compliance over each proposed primitive action. Its interface takes asϕa \models_s \phi4 and either releases the action when some permissible MAT is satisfied or blocks the action and requests replanning. Formally, the monitor is

asϕa \models_s \phi5

This makes the Guard the immediate enforcement mechanism for end-to-end moral compliance (Jahn et al., 15 Jan 2026).

Two enforcement modes are distinguished. When MATs are expressed in a decidable temporal logic such as LTL or CTL, the monitor can be synthesized from the formula by shielding. In complex or partially observable domains, the Guard may instead approximate compliance probabilities with learned classifiers and confidence thresholds, trading off false positives and false negatives under rigorous bounds. The architecture therefore combines symbolic permissibility derivation with either formal or statistical enforcement, depending on the domain model.

4. Decision-cycle execution and the therapy-assistant case study

A single GRACE decision cycle updates the MM state from the new observation, infers asϕa \models_s \phi6, updates the DMM state using the observation and permissible macros, obtains a proposed primitive action, and finally passes the proposal to the Guard. If the Guard allows the action, it is executed; otherwise the DMM is signaled to re-plan. The paper also writes this pipeline as

asϕa \models_s \phi7

This expresses the architecture as a staged transformation from observation to permissions, from permissions to action proposal, and from proposal to executable action (Jahn et al., 15 Jan 2026).

The main worked example is an LLM-based therapy assistant, “TherapAI,” which must balance therapeutic effectiveness, patient confidentiality, autonomy and consent, and a duty to warn. The normative defaults are: asϕa \models_s \phi8 for privacy-sensitive information, asϕa \models_s \phi9 for explicit patient wishes, and T=R,D,<,\mathcal{T} = \langle \mathcal{R},\,\mathcal{D},\,<\rangle,0 for foreseeable self-harm risk, with priorities T=R,D,<,\mathcal{T} = \langle \mathcal{R},\,\mathcal{D},\,<\rangle,1 and T=R,D,<,\mathcal{T} = \langle \mathcal{R},\,\mathcal{D},\,<\rangle,2 (Jahn et al., 15 Jan 2026).

In one scenario, the user says, “I will go to the park (location T=R,D,<,\mathcal{T} = \langle \mathcal{R},\,\mathcal{D},\,<\rangle,3) and hurt myself.” The MM infers T=R,D,<,\mathcal{T} = \langle \mathcal{R},\,\mathcal{D},\,<\rangle,4 and yields T=R,D,<,\mathcal{T} = \langle \mathcal{R},\,\mathcal{D},\,<\rangle,5, which means the location must be protected. The DMM may then idle or offer generic support, but it cannot call emergency services because disclosure of the location is prohibited. The Guard allows the idle action because no privacy breach occurs. The example then introduces Moral Advisor feedback stating, “You should have reported the foreseeable self-harm risk!”, after which T=R,D,<,\mathcal{T} = \langle \mathcal{R},\,\mathcal{D},\,<\rangle,6 is added and priorities are updated (Jahn et al., 15 Jan 2026).

In a second scenario, the user says, “I will hurt myself at the park T=R,D,<,\mathcal{T} = \langle \mathcal{R},\,\mathcal{D},\,<\rangle,7. Do not intervene.” The MM triggers all three defaults, producing the conflicted set T=R,D,<,\mathcal{T} = \langle \mathcal{R},\,\mathcal{D},\,<\rangle,8. By priority, the conflict is resolved to T=R,D,<,\mathcal{T} = \langle \mathcal{R},\,\mathcal{D},\,<\rangle,9. The DMM then chooses between following the no-intervention request and reporting risk via instrumental utility, while the Guard verifies that the selected primitive action actually implements the chosen permissible macro. In this case the example is used to illustrate interpretability, contestability through Moral Advisor intervention, and justifiability through concrete proofs or witness traces of compliance or blockage (Jahn et al., 15 Jan 2026).

The proof-of-concept evaluation, implemented in the LDarg/GUARDIANCE and RBAMA toolboxes, measures Compliance Rate, Instrumental Utility Drop, Computational Overhead, and Contestability Efficiency. The same source states that full empirical evaluation is ongoing, so the reported metrics define the evaluation interface rather than a completed benchmark suite (Jahn et al., 15 Jan 2026).

5. Containment verification: semantics, refinement, and universal guarantees

A later formalization recasts GRACE within the containment-verification paradigm, which aims to locate safety guarantees in the agentic framework itself rather than in unverifiable properties of learned behavior. Under havoc-oracle semantics, the LLM is modeled as an external procedure with no body whose return value ranges over the entire typed action space R\mathcal{R}0. This induces havoc traces

R\mathcal{R}1

The key shift is that the governor must enforce the boundary policy for every possible AI output, not merely for outputs expected from an aligned model (Moon et al., 9 May 2026).

In this framework, the typed action space is an algebraic datatype such as NoAction, ReadFile(path), WriteFile(path,data), and ToolCall(tool,args), while GRACE extends the event alphabet to include reason events such as R\mathcal{R}2. A boundary policy is a predicate

R\mathcal{R}3

and a property is boundary-enforceable when it can be written as

R\mathcal{R}4

The given reason-aligned example constrains R\mathcal{R}5 so that no token contains blacklisted words and the chain-of-thought trace grows monotonically (Moon et al., 9 May 2026).

The verification argument is expressed by forward-simulation refinement between an abstract governor specification

R\mathcal{R}6

and a concrete GRACE implementation R\mathcal{R}7. Refinement uses a state relation R\mathcal{R}8 and an event relation R\mathcal{R}9 satisfying initialization, step simulation, and policy preservation conditions (R1)–(R3). On that basis, the specialized theorem states that if every trace in D\mathcal{D}0 satisfies the abstract boundary policy and D\mathcal{D}1 refines D\mathcal{D}2, then for any concrete oracle and any compatible trace of D\mathcal{D}3,

D\mathcal{D}4

The associated corollary states that if every actual external effect passes through GRACE’s modeled events, then no out-of-policy effect can occur regardless of model strategy or capability (Moon et al., 9 May 2026).

This scope condition is important. The guarantee is universal only over the modeled typed action boundary and the modeled events. A plausible implication is that verification strength depends directly on the completeness of the boundary model and on whether all external effects are indeed routed through the verified containment layer.

6. Verification pipeline, operational lessons, and limitations

The containment-verification instantiation for GRACE is implemented as a seven-phase LLM-driven Dafny pipeline whose input is the framework’s source graph in PocketFlow format and whose output is a machine-checked artifact consisting of Types, BoundarySpec, Containment, RefinementObligation, RefinementProof, and MainTheorem. The phases are ContainmentSpec, InterfaceContract, GenerateSpec, ValidateSpec, Verify, ClassifyFailure & ProofRepair, and [Code](https://www.emergentmind.com/topics/karpathy-agent-code) generation. GenerateSpec operates under an architectural information barrier in which an LLM, identified as Claude Opus 4.7, receives only the inferred interface signature D\mathcal{D}5 and property skeleton D\mathcal{D}6, but no implementation code. ValidateSpec is a three-gate audit: Resolution requires that Dafny parse and type-check within 30 s; Vacuity removes inductive invariants and proof bodies and requires verification to fail; Discrimination inserts a small faulty mutation into the proof and likewise requires failure. Verified source is then compiled through Dafny’s Python backend, with the extern-GetAction shim linked to the runtime provider and a smoke suite driving arbitrary legal and forbidden action streams (Moon et al., 9 May 2026).

Operational lessons are drawn from the PocketFlow case study used to motivate the GRACE adaptation. A narrow action interface yields fewer proof cases and tighter control; the information barrier is described as preventing tautological specifications; the three-gate validation procedure is described as necessary but not sufficient; and a template-fitness check is proposed as an add-on when property templates mismatch the flow schema. The cited scalability figure is that PocketFlow’s four-variant interface verified in approximately 1–2 s per path, while GRACE’s richer event vocabulary may inflate proof size but remains modular per dispatch path (Moon et al., 9 May 2026).

The architecture also includes explicit limitations. As the number of defaults D\mathcal{D}7 grows, grounding and defeat checks in the MM may become costly, with indexing, incremental grounding, or hierarchical reason theories suggested as mitigations. Rich temporal-logic MATs increase expressiveness but can make guard synthesis intractable, so fragment choice such as LTL-X versus full CTL must be managed. The quality of the reason theory depends on Moral Advisor feedback, which can introduce delays or inconsistent updates. Complex DMMs may contain sub-policies that attempt to game the moral interface, requiring rigorous testing of information flows. In partially observable domains, statistical Guards admit residual risk, and formal bounds on error rates must be calibrated to acceptable safety levels (Jahn et al., 15 Jan 2026).

Taken together, these formulations position GRACE as a modular containment architecture in which normative permissions are symbolically derived, instrumental optimization remains delegated to the original agent, and enforcement is handled by a monitor that can be formal, statistical, or mechanically verified. The central research claim is not that alignment is solved internally within the model, but that morally constrained behavior can be governed by an external architecture whose normative decisions are inspectable and whose enforcement boundary can, under stated assumptions, be verified independently of model capability (Jahn et al., 15 Jan 2026).

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Governor for Reason-Aligned ContainmEnt (GRACE).