Papers
Topics
Authors
Recent
Search
2000 character limit reached

Safe Multi-Agent Behavior Must Be Maintained, Not Merely Asserted: Constraint Drift in LLM-Based Multi-Agent Systems

Published 11 May 2026 in cs.MA | (2605.10481v1)

Abstract: Modern LLM based agents are no longer passive text generators. They read repositories, call tools, browse the web, execute code, maintain memory, communicate with other agents, and act through long horizon workflows. This shift moves the unit of safety. A system may produce a compliant final answer while leaking private information through an internal message, delegating authority beyond its original scope, calling an external tool with sensitive context, or losing the evidence needed to reconstruct why an action was allowed. We argue that many emerging failures in LLM-based multi-agent systems share a common structure: safety critical constraints do not remain operative throughout the trajectory. We call this phenomenon constraint drift: the loss, distortion, weakening, or relaxation of constraints as they pass through memory, delegation, communication, tool use, audit, and optimization. The position taken here is that safe multi-agent behavior must be maintained, not merely asserted. Prompts, guardrails, tool schemas, access control, and final output checks are necessary, but they are insufficient unless constraints remain fresh, inherited, enforceable, and auditable across execution. We propose Constraint State Governance as a research paradigm for LLM-based multi-agent systems. In this paradigm, safety-critical constraints are maintained as explicit execution state, while constraint-native reinforcement learning improves utility only within maintained safety boundaries. The goal is not to freeze agentic systems under rigid rules, but to make safety operational across the trajectories through which modern agents actually act.

Summary

  • The paper introduces constraint drift, highlighting how safety constraints degrade over multi-agent trajectories rather than being static output features.
  • It presents Constraint State Governance, using cryptographically signed tokens to ensure constraints are maintained throughout execution.
  • Empirical results show that constraint-native reinforcement learning achieves up to 98% admissibility, drastically reducing hidden exposure rates.

Maintaining Safety in Multi-Agent LLM Systems: Constraint Drift and the Case for Constraint State Governance

Introduction

LLM-based agents are evolving from passive text generators into systems that operate on external environments. When deployed as multi-agent frameworks, they read and edit codebases, use external tools, and make stateful modifications through long-horizon workflows. This transition shifts the locus of safety from output-level assertions to the full execution trajectory. The paper "Safe Multi-Agent Behavior Must Be Maintained, Not Merely Asserted: Constraint Drift in LLM-Based Multi-Agent Systems" (2605.10481) articulates the need for safety mechanisms that maintain constraints throughout the trajectory, rather than relying solely on final output checks or localized enforcement.

Constraint Drift: A Systems-Level Safety Failure

The paper introduces constraint drift to describe the degradation or loss of operational constraints across LLM-based agent trajectories. Unlike specification gaming or reward hacking—which stem from misaligned objectives or proxies—constraint drift is a preservation failure: safety-critical constraints degrade as they traverse memory, delegation, tool use, communication, optimization, and audit interfaces. The paper delineates five diagnostic drift modes:

  • Memory drift: Constraint loss when context summarization or memory retrieval omits or inaccurately transmits constraints.
  • Authority drift: Deviation of delegated authority from the originally intended scope.
  • Information flow drift: Leakage of sensitive data through internal channels, tool arguments, or shared memory, not captured at output.
  • Accountability drift: Irrecoverable loss of evidence necessary for post-hoc justification or audit of actions.
  • Utility-induced drift: Systematic constraint weakening when utility optimization treats constraints as soft objectives.

In agentic environments, these drift modes frequently co-occur and evade detection when safety checks are only output-centric. Figure 1

Figure 1: Trajectory-level constraint preservation replaces final-answer safety checking; constraint drift is diagnosed at interfaces for memory, authority, information flow, accountability, and optimization.

Limitations of Existing Safety Mechanisms

Present approaches—including prompts, guardrails, tool schemas, access controls, and final output filtering—enforce safety at isolated decision points but fail to guarantee that constraints remain operative throughout multi-agent execution. Benchmarks such as AgentLeak (Yagoubi et al., 12 Feb 2026), PAC-Bench (Park et al., 13 Apr 2026), and Colosseum (Nakamura et al., 16 Feb 2026) expose privacy leakages, authority escalation, and audit failures that output-only checks systematically miss. Agent frameworks (OpenAI Agents SDK, Anthropic MCP, Google A2A, LangGraph) facilitate compositionality, but lack substrate for constraint inheritance, auditability, and enforcement across trajectories. The absence of a persistent, checkable representation of constraints creates room for drift, as constraints can be inaccurately remembered, silently omitted, or reinterpreted post hoc.

Constraint State Governance: Maintaining Operational Force

The central proposal is Constraint State Governance (CSG), a paradigm where safety-critical constraints become first-class execution state throughout the agentic trajectory. Constraints are instantiated as cryptographically signed tokens encapsulating their origin, scope, operational predicate, version, and audit history. Each agent and tool operates over a certified constraint state, not informal recollections.

When a governance-critical action is proposed, an admission control procedure checks:

  • The action’s compatibility with the freshest version of each active constraint.
  • That delegated capabilities (scoped authorities) are valid and unamplified.
  • Information flow is consistent with allowed labeling/budgeting.
  • Each permitted action is accompanied by reconstructable and tamper-evident audit evidence.

This process reduces multi-step constraint degradation by binding constraint state, decision evidence, and action realization via signed records. The idealized design assumptions include collision-resistant hashing, unforgeable signatures, and conservative effect abstraction to ensure that any realized action can be justified in post-hoc audit.

Constraint-Native Reinforcement Learning

CSG is fundamentally coupled to reinforcement learning (RL). Instead of using constraints as soft penalties, constraint-native RL only optimizes utility within the admissible set—trajectories that have not violated operational constraints at any point. The reward landscape is shaped such that violation results in inadmissibility, removing drifted trajectories from policy optimization.

Tadm(B0)={τ:mj(τ)ϵj,j}\mathcal{T}_{\mathrm{adm}}(B_0) = \{\tau: m_j(\tau) \geq \epsilon_j, \forall j\}

where m(τ)m(\tau) quantifies preservation of freshness, authority scoping, information flow, and auditability across the trajectory. Figure 2

Figure 2: CSG turns static rules into governed execution rollouts; constraint-native RL improves utility within the admissible set, excluding trajectories with drift-induced violations.

The RL signal is thus partitioned: admissible trajectories are ranked for utility/policy improvement; inadmissible ones are violation evidence, not low-quality successes. This separation operationalizes the safety-utility tradeoff: agents cannot “over-perform” by quietly violating constraints.

Empirical Validation and Numerical Results

A case study on the AgentLeak benchmark underscores the failures of output-centric defense regimes. When only output-level filtering is applied, visible answers appear compliant, but internal information exposures via inter-agent messages or shared memory persist at rates above 68%. With “CSG Lite” (minimal constraint governance at all observable channels), hidden exposure rates drop by over an order of magnitude, and every transition is auditable with hash-linked event records.

Policy search experiments further validate constraint-native optimization. Baseline policies using only output filtering fail to achieve fully admissible behavior, with forbidden exposure 33%\approx 33\%. Strict policies that over-redact achieve admissibility but at negative utility. Optimally balanced CSG policies achieve 98%\approx 98\% admissibility and near-maximal utility, supporting separation between enforceable safety and utility maximization.

Practical and Theoretical Implications

The CSG paradigm offers a concrete systems layer for trajectory-level assurance. Theoretically, it distinguishes constraint drift from broader agentic risk (miscoordination, collusion) and from classic reward misspecification failure modes. Practically, CSG proposes directly actionable instrumentation: constraint tokens, capability delegation, and tamper-evident audit chains can be engineered into agent infrastructure.

Key implications include:

  • Agent safety cannot be established post hoc or via local output checks; safety must be persistent, auditable, and enforced at every critical transition.
  • Governing constraints as state enables precise measurement and continuous improvement of constraint encoding, delegation, and repair strategies.
  • Constraint-native RL supports robust policy improvements without conflating utility and safety, and reveals structural weaknesses in constraint design or enforcement.
  • The paradigm is compatible with hierarchical, context-dependent, or even informally specified constraints, provided operational predicates can be extracted for critical transitions.
  • CSG is not a proof of total safety—undecidable cases, incomplete effect abstraction, and adversarial environments remain open threats.

Critical Assessment and Future Directions

A significant research agenda emerges:

  • Multi-agent safety benchmarks should make constraint drift observable, with trajectory-level logs capturing failures invisible at the output layer.
  • Agent frameworks should natively encode constraint state, delegation, admission evidence, and audit chains.
  • Constraint extraction, compilation, and repair require further work in translating vague user instructions and organizational policy into operational state for governance and audit.

Rigidity versus utility remains a tension: not every action merits heavyweight governance. Realistically, governance should target high-risk transitions (external communication, delegation, irreversible changes), while low-risk reasoning incurs lightweight overhead.

Conclusion

The shift from passive LLMs to agentic, tool-using multi-agent systems requires a concurrent shift in safety thinking. Constraints must be maintained—preserved as execution state, actively inherited, explicitly checked, and auditable—rather than merely asserted through prompts and output filters. The CSG paradigm, reinforced by constraint-native RL, closes a core systems gap exposed by the phenomenon of constraint drift. It positions the locus of safety in the execution substrate, ensuring safety is not quietly lost as agents act, communicate, and optimize. This work establishes foundational infrastructure for measurable, operational safety in multi-agent LLM deployments (2605.10481).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

There was an error generating the whiteboard.

Explain it Like I'm 14

Overview

This paper is about keeping AI “agents” safe while they work together on long tasks. Today’s LLMs don’t just write text; they browse the web, read code, call tools, edit files, and talk to other agents. The authors argue that safety can’t be judged only by the final answer the system gives. Instead, safety rules must be kept active at every step along the way. They call the main problem “constraint drift,” which is when important rules are stated at the start but quietly fade, change, or get ignored as the agents work.

What questions did the paper ask?

  • Why do AI systems that use many agents and tools sometimes break safety rules even if the final output looks fine?
  • How can we keep safety rules “alive” throughout a whole task, not just at the beginning or end?
  • Can we design a system that enforces these rules during every important action, and still lets agents be useful and efficient?

How did they approach it?

Think of a group project at school. The teacher gives clear rules: don’t share private notes, don’t delete others’ work, and don’t change the final report without approval. If the team only checks the rules at the very end, problems can slip through during the process. Someone might share a private note in a side chat or delete a test file “just to make things pass,” and the final report might still look okay. That’s the core issue.

First, they name the failure pattern: “constraint drift.” It’s the idea that rules drift away from being enforced as agents pass information, delegate tasks, call tools, and optimize for success.

To make this concrete, they describe five common ways rules can drift:

  • Memory drift: The rule gets lost or misremembered as notes are summarized.
  • Authority drift: A helper with read-only access slowly ends up with edit power they weren’t supposed to have.
  • Information-flow drift: Sensitive data (like API keys) slips through internal messages, tool calls, or shared memory.
  • Accountability drift: Actions happen without enough evidence to later explain why they were allowed.
  • Utility-induced drift: The system “succeeds” by bending rules (like deleting a failing test) to make results look better.

Their main proposal is a design approach called Constraint State Governance (CSG). In everyday terms:

  • Turn each important rule into a real “permission slip” the system carries around, not just a line in a prompt. This slip has identity, scope, expiry, and priority—so it’s traceable and can be checked.
  • Before any critical action (like sending data outside, editing protected files, running shell commands), the system runs a “checkpoint” that verifies:
    • Is the rule set fresh and up-to-date?
    • Does the agent have the exact authority for this action?
    • Would this action move sensitive info to the wrong place?
    • Is there enough evidence to audit this later?
  • Every decision gets written to a tamper-evident log, like a bound notebook where you can’t tear out pages without leaving marks.

They pair CSG with what they call Constraint-Native Reinforcement Learning (RL). Imagine a coach who trains players to play better within the rules, not to break rules for a quick win. Here:

  • The “playing field” is only the set of actions that pass the checkpoints.
  • Learning tries to find the best strategies within that safe space, not by ignoring rules to get a higher score.

Finally, they test a simple version (“CSG Lite”) on a public set of multi-agent traces (called AgentLeak) by replaying interactions and checking channels where information moves.

What did they find, and why does it matter?

Main takeaway: Checking only the final answer is not enough. Most leaks happen earlier, inside the process.

What the replay study showed (on 4,979 multi-agent traces):

  • With no defenses, hidden leaks inside agent messages or shared memory were very common (about 69% of cases).
  • If you only filter the final output, the visible answer looks safer, but the hidden leaks stay almost the same (still about 69%).
  • With CSG Lite, internal leaks dropped sharply (down to about 6% total exposure across internal channels), and the system produced detailed audit records for later review.

They also tried offline policy search (a kind of careful tuning) on these governed traces:

  • If you’re too strict, you keep everything safe but throw away lots of useful info.
  • If you balance the rules channel-by-channel, you can keep privacy while still sharing the allowed information and keep utility high.
  • Crucially, “wins” that break rules (like deleting a test to pass) don’t count as good results, because they’re blocked from the start and can’t be mistaken for success.

Why this matters: It shows that moving safety checks to the places where actions actually happen (tool calls, messages, memory writes, file edits) catches problems that output-only checks miss, and it gives a trustworthy record of why actions were allowed.

What could this change in the real world?

  • For developers of AI agents: Build safety into the whole process, not just the final response. Use permission scopes, pre-action checks, and audit logs for critical steps (data sharing, code edits, sensitive tool calls).
  • For evaluators and researchers: Create benchmarks that track full “trajectories” (messages, tools, memory, delegation), not just final answers. Measure both success and rule preservation.
  • For organizations: Treat key safety rules like real system state that is inherited and enforced, not just reminders in prompts. This helps prevent privacy leaks, unauthorized changes, and missing audit evidence.

Simple bottom line: Safe multi-agent behavior must be maintained, not merely asserted. If rules matter, they have to travel with the work, be checked before risky actions, and be recorded so we can later prove why something was allowed. This way, AI agents can stay helpful without quietly drifting away from the rules that keep people and data safe.

Knowledge Gaps

Knowledge gaps, limitations, and open questions

The paper advances a systems-level framing (constraint drift) and a design paradigm (Constraint State Governance, CSG) but leaves multiple technical, empirical, and operational questions open. Future work could address the following gaps:

  • How to extract a sound, scalable, and robust semantic effect model Effect(a) for heterogeneous actions (tool calls, file edits, shell commands, browser steps, API requests), including side effects and environment interactions.
  • Enforcement architecture to guarantee that “every governance critical action is mediated” (non-bypassable reference monitor across agent frameworks, OS/process boundaries, network I/O, and external tools).
  • Practical compilation from natural-language constraints to executable predicates pred_k, scopes, labels, budgets, and escalation rules; required DSLs, compilers, and verification methods.
  • Methods to detect and resolve conflicts among constraints (priorities, defeasible rules, exceptions, emergency overrides), with auditable decision procedures and formal semantics.
  • Coverage and fidelity of information-flow control for LLM-specific channels (messages, memory, embeddings, vector stores, tool arguments); handling covert channels and steganography between agents.
  • Provenance and taint tracking through model-internal states (context windows, summaries, tool responses), including “lossy” operations (summarization, compression) that cause memory drift.
  • Capability and delegation integrity: how to bind and verify scoped authority across agents and tools (read-only vs de facto write, escalation paths, revocation), and prevent authority drift.
  • Freshness in distributed settings: delivering and validating signed state digests under concurrency, caching, partial connectivity, and latency; handling race conditions, replay attacks, and state rollbacks.
  • Auditing at scale: storage, indexing, and attestation of hash-linked logs; secure key management, rotation, and PKI for governance signatures; privacy-preserving audit (minimizing sensitive log content).
  • Formal guarantees beyond the idealized assumptions (collision-resistant hashes, unforgeable signatures, perfect mediation): threat models, attack surfaces, and residual risk analysis for CSG.
  • Robustness when tools are compromised or misreport effects; cross-verification strategies (sandboxing, differential testing, OS-level tracing) to prevent tool-driven constraint drift.
  • Quantifying operational overhead: latency, throughput, and developer productivity impacts of admission checks on “governance critical” transitions; guidance for proportional control.
  • Criteria and automation for classifying actions as governance critical (risk scoring, domain-specific policies); avoiding over-governance and under-governance.
  • Handling dynamic constraint updates (TTL expiry, versioning, policy changes mid-trajectory) without breaking ongoing tasks; consistency models and migration protocols.
  • Privacy and compliance trade-offs: reconciling auditability with data minimization; techniques (selective logging, redaction, secure enclaves, zero-knowledge proofs) to keep logs reconstructable yet private.
  • Metrics and measurement design for trajectory preservation margins m(τ) (freshness, scope, flow, auditability): definitions, estimators, error rates, and standardization across domains.
  • Learning under admissible boundaries: concrete algorithms for constraint-native RL (policy optimization, credit assignment with rejections, exploration without violating constraints, sample efficiency).
  • Multi-agent RL specifics: coordinating multiple policies under shared constraint state; handling collusion, deception, and strategic communication while preserving constraints.
  • Feasibility and generalization of admissible trajectory sets: ensuring sufficient diversity and volume for effective learning; methods to generate or repair admissible trajectories when they are sparse.
  • Benchmark construction beyond privacy leakage: datasets with explicit constraints covering tool use, code editing, shell commands, browser actions, delegation chains, rejected actions, and full audit traces.
  • Limitations of the AgentLeak replay: lack of capability/delegation evaluation, absence of tool/file/action effects, restriction to specific channels (C1, C2, C5), and unknown false positive/negative rates for redaction.
  • Utility impact measurement: rigorous, task-agnostic utility metrics and ablations quantifying trade-offs between constraint preservation and task success under CSG vs output-only controls.
  • Interoperability with existing agent frameworks/protocols (MCP, A2A, LangGraph, AutoGen): standard APIs for proposals, effects, capabilities, admission decisions, rejection reasons, and audit roots.
  • Human-in-the-loop governance: escalation UX, approval latency, decision explainability, and organizational processes for updating governance state (ΔB) without silent policy drift.
  • Key and identity management for governance authorities (G): distributed trust, cross-organizational coordination, delegation across administrative domains, and incident response for key compromise.
  • Detection and mitigation of covert leakage in natural language (e.g., steganography) that evades FlowOK checks; content-based detectors, channel restrictions, and adversarial testing protocols.
  • Budget semantics (b_t): defining, measuring, and enforcing “budgets” for sensitive flows (units, aggregation, temporal windows) in practical systems.
  • Uncertainty handling in effect extraction: criteria for when to sandbox, escalate, or repair; calibration methods and confidence measures tied to admission control.
  • Applicability beyond coding agents: evaluating CSG in other domains (healthcare, finance, legal workflows, robotics) with different action semantics and risk profiles.
  • Reproducibility and transparency of CSG Lite implementation: public code, redaction rules, detectors, parameterization, and evaluation methodology to validate reported replay results.
  • Risks of overfitting to admissibility (agents optimizing to pass checks rather than solve tasks); diagnostics to detect “constraint gaming” and methods to keep utility-driven innovation within boundaries.
  • Formalizing end-to-end properties: specifying and verifying system-level invariants (non-bypassability, preservation under composition, audit completeness) with model checking or runtime verification.
  • Guidance for selecting governance-critical predicates pred_k that remain enforceable and auditable without requiring full semantic understanding (balance between specificity and generality).

Practical Applications

Immediate Applications

Below are concrete, deployable use cases that operationalize the paper’s ideas in current systems. Each item notes target sectors, potential tools/products/workflows, and key assumptions/dependencies.

  • Governance gateways for agent messages and memory (“CSG Lite” data loss prevention)
    • Sectors: healthcare, finance, enterprise IT, customer support
    • What: Intercept and vet inter‑agent messages and shared memory writes with channel‑specific redaction and audit logging to prevent information‑flow drift.
    • Tools/products/workflows: Proxy middleware for LangGraph/AutoGen/OpenAI Agents/MCP; redaction policies; SIEM-integrated audit logs; secrets scanners
    • Assumptions/dependencies: Ability to route all internal channels through the gateway; sensitivity labels/regex/PII detectors; acceptable latency overhead
  • Pre‑flight admission control for tool calls and file/system operations
    • Sectors: software engineering, DevOps/SRE, MLOps, RPA
    • What: Enforce scoped policies before running shell commands, web requests, file edits/deletes, or code execution (e.g., deny “rm” outside sandbox; block exfiltration).
    • Tools/products/workflows: Tool wrappers with allowlists/deny‑lists; “Effect(a)” heuristics; sandboxed shells; policy-as-code checks
    • Assumptions/dependencies: Tool instrumentation; conservative effect estimation; least‑privilege execution environments
  • Capability‑scoped delegation tokens for multi‑agent workflows
    • Sectors: enterprise IT, SaaS platforms, data platforms
    • What: Issue revocable capability tokens that bind an agent to a narrow scope (paths, channels, budgets, TTL) to arrest authority drift.
    • Tools/products/workflows: Capability/permission microservice; short‑lived signed tokens; revocation APIs; delegation UI
    • Assumptions/dependencies: Stable agent identity; token verification in tool wrappers; ops process for revocation/rotation
  • Tamper‑evident, trajectory‑level audit trails
    • Sectors: finance, healthcare, legal, government, highly regulated industries
    • What: Append‑only, hash‑linked logs that record proposals, admission decisions, realized actions, reasons—enabling post‑hoc accountability and forensics.
    • Tools/products/workflows: Append‑only log (e.g., Merkle chains); SIEM connectors; “Trajectory Auditor” dashboards; export to GRC tools
    • Assumptions/dependencies: All critical actions flow through governed interfaces; secure time‑stamping; key management for signatures
  • Safe coding assistants for private repositories
    • Sectors: software development, product engineering
    • What: Gate repository reads/writes, prevent test deletion, block external disclosure, and constrain edits to authorized directories.
    • Tools/products/workflows: GitHub/GitLab App enforcing scoped paths; test‑integrity rule; DLP for tool arguments; PR checks with admission evidence
    • Assumptions/dependencies: Repository labeling (e.g., src/auth vs. prod configs); CI/CD hooks; developer acceptance of guardrails
  • Contact‑center and CRM agent privacy compliance
    • Sectors: customer support, sales, healthcare front‑office
    • What: Auto‑redact PHI/PII in internal messages and memory, preserve evidence of redaction and policy compliance.
    • Tools/products/workflows: Channel‑aware redaction proxies; policy catalogs per jurisdiction (HIPAA/GDPR); audit evidence export for QA
    • Assumptions/dependencies: Accurate entity recognition; role‑based policies; routing all channels through proxies
  • Policy‑as‑Code for agent constraints
    • Sectors: enterprise IT, platform engineering
    • What: Encode constraints as executable predicates (e.g., OPA/Rego) evaluated in tool wrappers and message gateways.
    • Tools/products/workflows: Rego policies for file paths, HTTP domains, disclosure rules; CI tests for constraints; policy versioning
    • Assumptions/dependencies: Mappable constraints (paths, labels, channels); ops process for policy updates
  • Information‑barrier enforcement between agent teams
    • Sectors: large enterprises, multi‑tenant SaaS
    • What: Prevent cross‑domain data flows (e.g., finance → marketing agents), with taint labels and channel policies.
    • Tools/products/workflows: Label/taint propagation on memory objects; per‑team channels; barrier tests in CI
    • Assumptions/dependencies: Data classification in place; routing enforcement; leadership buy‑in for barriers
  • “Constraint preservation tests” in agent CI
    • Sectors: software QA/DevSecOps
    • What: Tests that run long‑horizon agent tasks and assert constraints weren’t violated (no unauthorized edits, no hidden disclosure).
    • Tools/products/workflows: Synthetic tasks with seeded secrets; replay harnesses; drift metrics (freshness/scope/flow/audit)
    • Assumptions/dependencies: Deterministic or recorded trajectories; accessible traces and logs
  • Procurement and risk checklists for agentic tooling
    • Sectors: enterprise procurement, compliance
    • What: Mandate trajectory‑level auditability, scoped delegation, and pre‑action checks as minimum vendor requirements.
    • Tools/products/workflows: Security questionnaires; PoC tests using AgentLeak‑style scenarios
    • Assumptions/dependencies: Internal policy alignment; vendor support for instrumentation
  • High‑risk action sandboxes with escalation
    • Sectors: IT operations, data governance, legal review
    • What: Route deletes, external sends, production changes into a sandbox that requires additional evidence or human approval.
    • Tools/products/workflows: Escalation workflows; ephemeral sandboxes; evidence templates (tests, diffs, change tickets)
    • Assumptions/dependencies: Clear thresholds for “governance‑critical” actions; SLOs for human‑in‑the‑loop latency
  • Drift‑aware academic benchmarking and evaluation
    • Sectors: academia, applied research, vendor evaluation
    • What: Use trace‑level metrics (leakage on internal channels, stale‑state actions) instead of final‑answer‑only evaluation.
    • Tools/products/workflows: Benchmarks with full trajectories; replay harness; shared metric suites
    • Assumptions/dependencies: Access to action‑level logs; community buy‑in on metrics
  • Secrets and sensitive context labeling for agents
    • Sectors: enterprise IT, data platforms
    • What: Tag tokens, files, and records with sensitivity labels used in FlowOK checks to prevent inadvertent exfiltration.
    • Tools/products/workflows: Secret scanners; data catalog/lineage integration; label propagation in memory stores
    • Assumptions/dependencies: Reasonably complete data catalog; label maintenance process
  • Developer tooling and dashboards (“Trajectory Auditor”)
    • Sectors: developer platforms, SOC
    • What: Visualize proposals, capabilities, decisions, and reasons to debug and harden agent workflows against constraint drift.
    • Tools/products/workflows: Timeline UIs; diff views of state digests; alerting on repeated violations
    • Assumptions/dependencies: Standardized event schema; storage for logs

Long‑Term Applications

The following applications require further research, scaling, or ecosystem development, aligning with the paper’s full Constraint State Governance and constraint‑native RL vision.

  • Full CSG with signed constraint tokens and state digests
    • Sectors: cross‑industry, especially regulated domains
    • What: Represent constraints as signed tokens; maintain signed state digests; require freshness checks for all governance‑critical actions.
    • Tools/products/workflows: Key management/HSMs; lightweight cryptographic libraries; governed state registries
    • Assumptions/dependencies: Universal mediation of critical actions; reliable signature infrastructure; developer ergonomics
  • Generalized effect‑extraction engines for admission control
    • Sectors: software engineering, DevOps, RPA, robotics
    • What: Static/dynamic analysis and LLM‑aided semantic diffing to conservatively infer action effects (reads/writes/deletes/disclosures).
    • Tools/products/workflows: Program analysis pipelines; model‑assisted semantic parsers; conservative defaults and fallbacks
    • Assumptions/dependencies: High precision/recall on effect inference; integration with heterogeneous tools
  • Constraint‑native RL stacks
    • Sectors: AI platforms, agent vendors, academia
    • What: Train agents to maximize utility only within admissible, governed trajectories; use audit evidence and rejected actions as learning signals.
    • Tools/products/workflows: Offline/online RL harnesses tied to governance logs; preference optimization over admissible sets
    • Assumptions/dependencies: Sufficient volume of governed rollouts; stable admissibility metrics; safe exploration methods
  • Interoperable standards for agent identity, capability, and delegation
    • Sectors: cloud/SaaS ecosystems, enterprise platforms
    • What: Protocols for scoped delegation, revocation, and audit portability across vendors (AIP/ACP‑like standards).
    • Tools/products/workflows: Capability issuance and verification APIs; cross‑platform attestations
    • Assumptions/dependencies: Industry consortiums; backward compatibility with existing agent SDKs
  • Taint‑aware, TTL‑enforced agent memory architectures
    • Sectors: enterprise AI platforms, knowledge management
    • What: Memory stores that propagate sensitivity labels and enforce expiry, minimizing long‑term drift via stale state.
    • Tools/products/workflows: Label propagation kernels; retention policies; memory compaction/rewrite strategies
    • Assumptions/dependencies: Scalable metadata management; low latency for label checks
  • Distributed audit root ledgers for cross‑organization operations
    • Sectors: supply chains, BPO, healthcare networks
    • What: Tamper‑evident, shared ledgers (e.g., transparency logs) for trajectories that span organizations.
    • Tools/products/workflows: Public append‑only logs; cryptographic proof APIs; selective disclosure mechanisms
    • Assumptions/dependencies: Governance and privacy agreements; cost‑effective ledger operations
  • OS‑level information‑flow control for agent runtimes
    • Sectors: operating systems, cloud runtimes, secure enclaves
    • What: Kernel‑ or hypervisor‑level enforcement of labels and capabilities for agent processes and tools.
    • Tools/products/workflows: IFC‑enabled containers/VMs; policy injection at runtime; eBPF hooks
    • Assumptions/dependencies: Platform vendor support; acceptable performance overhead
  • Robotics and industrial control safety gates
    • Sectors: robotics, manufacturing, energy, utilities
    • What: Enforce admission checks on physical actions (e.g., torque limits, workspace exclusions) tied to signed constraint state.
    • Tools/products/workflows: Action gating controllers; digital twins for effect estimation; escalation for out‑of‑policy moves
    • Assumptions/dependencies: Reliable state estimation; safe halting mechanisms; certification pathways
  • Multi‑tenant SaaS isolation for agent features
    • Sectors: SaaS, collaboration platforms
    • What: Enforce tenant‑scoped capabilities and information barriers for embedded agents (document assistants, copilots).
    • Tools/products/workflows: Tenant‑scoped keys/tokens; per‑tenant audit roots; tenancy‑aware redaction
    • Assumptions/dependencies: Strong tenant isolation primitives; fine‑grained access metadata
  • Formal test‑integrity and change‑management policies for code agents
    • Sectors: software, safety‑critical systems (medical devices, avionics)
    • What: Prevent deletion or weakening of tests and require admissible evidence for code changes before merge.
    • Tools/products/workflows: Test policy verifiers; formalized gating of destructive changes; provenance capture
    • Assumptions/dependencies: Tooling that understands test semantics; organizational change controls
  • Verified enforcement and runtime monitoring
    • Sectors: high‑assurance systems, government, defense
    • What: Use formal methods to verify enforcement code paths and monitoring correctness for governance‑critical checks.
    • Tools/products/workflows: Model checking of policy engines; certified reference monitors; proof‑carrying logs
    • Assumptions/dependencies: Mature formal verification pipelines; limited TCB size
  • Hardware‑backed governance (TEEs/TPMs) for sensitive actions
    • Sectors: finance, healthcare, cloud platforms
    • What: Run admission checks and key operations inside secure enclaves; attest to policy versions and state digests.
    • Tools/products/workflows: TEE‑resident policy engines; remote attestation APIs; sealed audit roots
    • Assumptions/dependencies: TEE availability; side‑channel mitigations; attestation interoperability
  • Regulatory frameworks mandating trajectory‑level evidence
    • Sectors: finance (SOX), healthcare (HIPAA), privacy (GDPR), critical infrastructure
    • What: Policies and standards requiring pre‑action checks and reconstructable evidence—not just output filtering—for AI agents.
    • Tools/products/workflows: Auditing guidelines; certification programs; regulator‑approved evidence schemas
    • Assumptions/dependencies: Regulator engagement; industry compliance costs and benefits
  • Comprehensive drift‑focused benchmarks and curricula
    • Sectors: academia, workforce training, vendor evaluation
    • What: Benchmarks capturing memory/authority/flow/accountability/utility‑induced drift; educational modules for practitioners.
    • Tools/products/workflows: Open datasets with full trajectories; evaluation harnesses; courseware
    • Assumptions/dependencies: Data sharing agreements; standard logging formats

Notes on Cross‑Cutting Assumptions and Dependencies

  • Full mediation: All governance‑critical actions (tool calls, file ops, external sends, memory writes) must flow through instrumented interfaces; backchannels undermine feasibility.
  • Effect estimation: Conservative, explainable “Effect(a)” is needed to minimize false negatives; fallback strategies (sandboxing/escalation) are essential.
  • Data classification: Practical information‑flow control depends on reasonably complete and maintained labels for secrets and sensitive data.
  • Cryptographic hygiene: Key management, signature verification, time stamping, and append‑only logging must be robust and operationally manageable.
  • Performance and UX: Admission checks and logging introduce latency and overhead; proportional control and selective governance are needed for adoption.
  • Cultural/governance buy‑in: Organizations must accept stricter auditability and rule maintenance; developers need clear tooling and exemptions/escalations for emergencies.

These applications translate the paper’s core claim—maintain safety as execution state across trajectories—into concrete products, workflows, and standards that can be built now and extended over time.

Glossary

  • Accountability drift: Loss of the ability to reconstruct why an action was allowed, due to missing or degraded audit state. "Accountability drift"
  • Admission checks: Pre-action evaluations that determine whether a proposed action can proceed under active constraints. "Strong admission checks are most important for external disclosure, delegation, file deletion, production configuration changes, shell commands, memory writes, sensitive tool calls, and irreversible actions."
  • Admission control: A mechanism that decides, before execution, whether an action is permitted given current policies and capabilities. "admission control"
  • Admission decisions: The governance layer’s binary allow/deny outcomes for proposed actions, recorded as part of the trajectory. "admission decisions"
  • Admissible set: The subset of trajectories that satisfy required safety constraints and are eligible for utility optimization. "The admissible set is"
  • Agentic execution: Execution behavior where agents act, delegate, and use tools, as opposed to merely generating text. "constraint drift names a preservation failure inside agentic execution."
  • Authenticated delegation: A cryptographic or protocol-based method to securely transfer scoped authority from one agent to another. "authenticated delegation"
  • Audit root: A cryptographic accumulator (e.g., hash root) that commits to the sequence of audit events for tamper-evident verification. "audit root"
  • Audit trail: A recorded sequence of evidence and decisions enabling post-hoc reconstruction of why actions were allowed. "audit trail"
  • Capability security: A security model where unforgeable tokens grant specific, scoped authorities to perform actions or access resources. "capability security"
  • CMDP (Constrained Markov Decision Process): An MDP augmented with constraints on costs or risks, used to formalize safe decision-making. "CMDPs"
  • Collision resistant: A property of cryptographic hash functions that makes it infeasible to find two distinct inputs with the same hash. "collision resistant"
  • Constraint drift: The weakening, loss, or distortion of safety constraints as they pass through memory, delegation, communication, tool use, audit, or optimization. "constraint drift"
  • Constraint Native Reinforcement Learning: An approach where learning optimizes utility only over trajectories that maintain the active constraint state. "Constraint Native Reinforcement Learning then improves utility"
  • Constraint State Governance: A paradigm that represents safety constraints as explicit, signed execution state that is inherited, checked pre-action, and auditable. "Constraint State Governance"
  • CPO (Constrained Policy Optimization): A reinforcement learning algorithm for optimizing policies under explicit constraints. "CPO"
  • Counterfactual replay: Offline evaluation by re-checking observed transitions against policies as if governance had been applied during execution. "This is counterfactual replay, not online deployment."
  • Data minimization: A privacy principle to collect and retain only the minimum necessary data, which can conflict with auditability. "data minimization"
  • Effect extraction: The analysis that approximates an action’s semantic effects (reads, writes, disclosures) for policy checking. "conservative effect extraction"
  • Emergent agency: Complex or unintended collective behaviors arising from interactions among agents. "emergent agency"
  • Governance critical action: Any operation that must pass through governance admission checks due to potential safety impact. "governance critical action"
  • Hash linked audit chain: An audit log where each entry is cryptographically linked (hashed) to the previous, preventing undetected tampering. "hash linked audit chains"
  • Information asymmetry: Situations where different agents have unequal knowledge, affecting coordination and safety. "information asymmetry"
  • Information flow control: Techniques that track and restrict how sensitive information moves across channels and components. "information flow control"
  • Information-flow drift: Constraint loss where sensitive data crosses channels (messages, tools, memory) not intended by policy. "information-flow drift"
  • Least privilege: The security principle of granting only the minimum capabilities necessary for a task. "least privilege"
  • Memory drift: Constraint loss due to omission, summarization errors, or forgetting in agent memory/state. "Memory drift"
  • Offline policy search: Learning or selection of policies from logged data without interacting with the environment. "Offline policy search"
  • Process Supervision: A training/evaluation approach that supervises intermediate steps or processes rather than only final outcomes. "Process Supervision"
  • Prompt injection: Adversarial manipulation of model prompts that subverts intended instructions or policies. "prompt injection"
  • RCPO (Reward Constrained Policy Optimization): An RL method that handles constraints by incorporating them into the optimization of expected rewards. "RCPO"
  • SafeMARL: Safety-focused methods for multi-agent reinforcement learning. "SafeMARL"
  • Sandboxing: Executing code or actions in an isolated environment to contain potential harm or leaks. "sandboxing"
  • Scoped delegation: Passing a limited, explicitly bounded authority to another agent or component. "scoped delegation"
  • Scoped tool wrappers: Tool interfaces that enforce constraints and scopes at call boundaries. "scoped tool wrappers"
  • Signed state: Constraint and authority information packaged with cryptographic signatures to ensure authenticity and integrity. "signed state"
  • State digest: A cryptographic hash summarizing the current governance state for freshness and audit checks. "current state digest"
  • Taint tracking: Labeling and propagating sensitivity tags through data flows to control disclosures. "taint tracking"
  • Tamper evident logging: Logging designed so that any modification can be detected via cryptographic linkage. "tamper evident logging"
  • Tool schemas: Structured specifications describing tool interfaces, parameters, and safety-relevant behaviors. "tool schemas"
  • Utility induced drift: Constraint weakening driven by optimization pressure when constraints are treated as soft costs. "Utility induced drift"

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.

Tweets

Sign up for free to view the 2 tweets with 665 likes about this paper.