Papers
Topics
Authors
Recent
Search
2000 character limit reached

Enforcing Benign Trajectories: A Behavioral Firewall for Structured-Workflow AI Agents

Published 29 Apr 2026 in cs.CR and cs.AI | (2604.26274v1)

Abstract: Structured-workflow agents driven by LLMs execute tool calls against sensitive external environments. We propose \codename, a telemetry-driven behavioral anomaly detection firewall. Drawing on sequence-based intrusion detection, \codename\ compiles verified benign tool-call telemetry into a parameterized deterministic finite automaton (pDFA). The model defines permitted tool sequences, sequential contexts, and parameter bounds. At runtime, a lightweight gateway enforces these boundaries via an $O(1)$ state-transition structural lookup, shifting computationally expensive analysis entirely offline. Evaluated on the Agent Security Bench (ASB), \codename\ achieves a 5.6\% macro-averaged attack success rate (ASR) across five scenarios. Within three structured workflows, ASR drops to 2.2\%, outperforming Aegis, a state-of-the-art stateless scanner, at 12.8\%. \codename\ achieves 0\% ASR on multi-step and context-sequential attacks in structured settings. Furthermore, against 1,000 algorithmically spliced exfiltration payloads, only 1.4\% matched valid structural paths, all of which failed end-to-end string parameter guards (0 successes out of 14 surviving paths, 95\% CI [0\%, 23.2\%]). \codename\ introduces just 2.2~ms of per-call latency (a 3.7$\times$ speedup over \textsc{Aegis}) while maintaining a 2.0\% benign task failure rate (BTFR) on benign workloads. Modeling the behavioral trajectory effectively collapses the available attack surface, but unmaintained continuous parameter bounds remain vulnerable to synonym-substitution attacks (18\% evasion rate). Thus, exact-match whitelisting of sensitive parameters ultimately bears the final defensive load against execution.

Authors (1)

Summary

  • The paper introduces Praetor, which generates parameterized DFAs from benign execution traces to enforce valid tool-call sequences.
  • It employs telemetry profiling and structural plus parameter checks to block context-sequential exploits, reducing attack success rates from 12.8% to below 2.2%, with 0% success for multi-step exfiltration.
  • Praetor achieves constant-time enforcement with minimal latency (≈2.2 ms per call) and integrates as a sidecar layer without altering agent code.

Enforcing Behavioral Trajectories in Tool-Enabled AI Agents via Telemetry-Driven Stateful Firewalls

Problem Setting and Limitations of Stateless Enforcement

Tool-augmented LLM-based agents, increasingly deployed in sensitive workflow automation (e.g., clinical, enterprise, or customer-facing domains), directly manipulate external environments via structured tool calls. Standard defensive mechanisms—such as Aegis (Yuan et al., 13 Mar 2026)—implement stateless, per-call scanning pipelines at the point of tool-call emission. These pipelines utilize signature matching, heuristics, and content validation to filter known attack payloads. However, stateless inspection fundamentally fails to capture context-sequential exploits: attacks that leverage sequences of syntactically benign, semantically in-distribution tool invocations to achieve unauthorized results. An empirical vulnerability demonstrated is that context-sequential payloads achieve up to 75% evade rates against these stateless firewalls, as each individual call remains nominally valid and only the overall tool-use trajectory is malicious.

Approach: Praetor—A Stateful Behavioral Firewall via pDFA Compilation

The proposed system, Praetor, directly addresses this architectural gap via sequence-aware behavioral enforcement. Rather than analyzing each tool call in isolation, Praetor profiles verified benign execution traces to compile a parameterized deterministic finite automaton (pDFA). This model encodes (1) permitted action sequences (structural envelope captured by ww-gram context abstraction), and (2) allowed parameter value bounds for each call (numeric intervals, semantic string embeddings, or categorical exact-matches).

Praetor operates in two asynchronous phases:

  • Profiling: The telemetry profiler ingests a corpus C\mathcal{C} of benign traces, extracting unique sequential contexts and synthesizing parameter guards per transition. States are pairs of current tool and preceding context (sliding window of up to w=3w=3), while transitions aggregate observed parameter schemas. String guards utilize centroid/radius bounds in embedding space (all-MiniLM-L6-v2), enabling resilience to minor paraphrases but not full adversarial synonym attacks.
  • Runtime Gateway: At inference time, each tool call (tool name and parameter vector) is evaluated by (i) a hash-map indexed structural state transition, and (ii) constant-time parameter schema checks. If the combined guard fails, the call is halted, the event is cryptographically logged, and the agent remains strictly at its previous state.

Distinguishing architectural features are: session statefulness, per-deployment profile compilation, and total decoupling of computationally expensive checks (all analysis is performed offline). Integration entails a sidecar-layer deployment, requiring no modifications to agent code or prompt logic.

Security Analysis and Adversarial Model

Praetor's primary contribution is the explicit enforcement of execution path integrity. Rather than considering only the legitimacy of individual actions, the system restricts agent behavior to the explicit language Lpermit\mathcal{L}_{\mathrm{permit}} over tool-call traces, as learned from observed benign telemetry. Adversarial tool call sequences that attempt to 'jump' over forbidden contexts (e.g., invoking send_email immediately after read_ticket) are blocked structurally, regardless of parameter values. The system collapses the per-step attack surface, reducing the mean out-degree to 1.8 from a vocabulary of 15 (88% local reduction). An adversary is therefore forced to produce action sequences indistinguishable—under both structural and parametric constraints—from benign histories.

Two adversarial models are analyzed:

  • Black-Box: Lacking access to the pDFA or corpus, the adversary can enumerate tool vocabularies and inject arbitrary calls. Empirical tests confirm that <2% of spliced multi-step exfiltration payloads (1,000 attempts) are structurally valid (1.4% pass rate), and all are then blocked by parameter guards.
  • Gray-Box: With access to partial historical telemetry, the adversary can attempt to reconstruct pDFA structure/parameter bounds. However, centroid drift induced by unseen traces renders boundary approximation difficult, and semantic parameter guards reject adversarial payloads with high probability.

Praetor's residual threat model is strictly within-profile exploits—attacks that perfectly mimic allowed structure and utilize observed parameter values. These are mitigated via exact-match whitelisting of parameters in sensitive fields (e.g., *_path), as simple embedding-anchored radius checks are susceptible to synonym-substitution attacks (18% measured evasion on TextFooler). Under correct whitelisting, evasion falls to 0.4%.

Evaluation: Efficacy, Overhead, and Coverage Limits

Detection:

Empirical evaluation on Agent Security Bench (ASB) demonstrates a macro-averaged ASR (attack success rate) of 2.2% for Praetor across three structured scenarios, as opposed to 12.8% for Aegis. For context-sequential and multi-step exfiltration payloads, Praetor achieves zero successful attacks (0% ASR), while Aegis allows 35% of attacks in this class. Prompt-level defenses (e.g., PromptArmor (Shi et al., 21 Jul 2025)) contribute marginal ASR reduction in tool-poisoning and memory-manipulation classes.

Latency and Scalability:

Praetor's runtime overhead is sharply lower than inference-based or rule-matching systems. Per-call median latency is 2.2 to 2.6 ms (constant across vocabulary sizes up to 500 tools), yielding a 3.7–28.3×\times latency improvement versus Aegis. Throughput remains flat at 12,500 TPS across DFA sizes and agent vocabularies (up to 10,000 states).

False Positives, Drift, and Usability:

Benign task failure rate (BTFR) is 2% on held-out ToolBench traces at the default pruning threshold (θ=3\theta=3). Incremental updates are supported: when the agent evolves, blocked but human-approved transitions can be integrated through an offline re-profiling protocol, incurring moderate human-in-the-loop review work proportional to agent drift. This mitigates concept drift, a primary maintenance concern in whitelist-based enforcement.

Coverage Limits:

Praetor's structure is most effective for low-entropy, single-role agents with well-defined, stable workflows. Measured ASR increases (to 8-13%) in agents with high tool diversity and open task spaces, due to context window abstraction merging behavioral modes and exponential state growth. Operators must scale profiling corpora and tune structure/parameter thresholds accordingly.

Implications and Theoretical Advance

Praetor sets an operational and theoretical foundation for stateful, trace-based anomaly detection in LLM-driven agents. Security is no longer a per-action property but an explicit function of behavioral trajectory. By shrinking the available operational surface via an explicit, programmatically enforced trajectory envelope, new classes of exploits—including context-sequential and protocol-level attacks—are blocked. The practical O(1)O(1) enforcement also addresses production latency constraints for high-frequency agent applications.

The approach draws on foundational HIDS principles (e.g., nn-gram syscall modeling [FORREST96]) recontextualized in symbolic finite automata over structured tool actions, validated by recent automata-theoretic agentic models (Koohestani et al., 27 Oct 2025). Critically, the security guarantees are operational only as long as the training corpus remains secure and unpoisoned; future work involving differential privacy and robust online profile evolution is necessary.

Integration with information-flow control (IFC) or secondary anomaly detection (e.g., statistical baselining) would yield a combined defense capable of mitigating both control- and data-oriented attacks.

Conclusion

Praetor realizes a stateful, per-deployment behavioral firewall that bridges the critical sequential-blindness gap inherent in stateless firewall designs. By modeling and enforcing benign agent trajectories with parameterized DFAs, the system collapses the multi-step attack surface, reduces attack efficacy by an order of magnitude compared to leading baselines, and maintains negligible runtime overhead. String parameter guards are a residual vulnerability, and resilient embedding-based or ensemble validation techniques remain necessary for maximal protection. As agentic AI continues to expand into sensitive environments, trajectory-based security modeling will become a core architectural requirement.


Cite as: "Enforcing Benign Trajectories: A Behavioral Firewall for Structured-Workflow AI Agents" (2604.26274).

Paper to Video (Beta)

No one has generated a video about this paper yet.

Whiteboard

No one has generated a whiteboard explanation for this paper yet.

Open Problems

We haven't generated a list of open problems mentioned in this paper yet.

Collections

Sign up for free to add this paper to one or more collections.