PlanGuard: Defending Agents against Indirect Prompt Injection via Planning-based Consistency Verification

Published 11 Apr 2026 in cs.CR | (2604.10134v1)

Abstract: LLM agents are increasingly integrated into critical systems, leveraging external tools to interact with the real world. However, this capability exposes them to Indirect Prompt Injection (IPI), where attackers embed malicious instructions into retrieved content to manipulate the agent into executing unauthorized or unintended actions. Existing defenses predominantly focus on the pre-processing stage, neglecting the monitoring of the model's actual behavior. In this paper, we propose PlanGuard, a training-free defense framework based on the principle of Context Isolation. Unlike prior methods, PlanGuard introduces an isolated Planner that generates a reference set of valid actions derived solely from user instructions. In addition, we design a Hierarchical Verification Mechanism that first enforces strict hard constraints to block unauthorized tool invocations, and subsequently employs an Intent Verifier to validate whether parameter deviations are benign formatting variances or malicious hijacking. Experiments on the InjecAgent benchmark demonstrate that PlanGuard effectively neutralizes these attacks, reducing the Attack Success Rate (ASR) from 72.8% to 0%, while maintaining an acceptable False Positive Rate of 1.49%. Furthermore, our method is model-agnostic and highly compatible.

Abstract PDF Upgrade to Chat

Authors (2)

Summary

The paper introduces a dual-path architecture that isolates trusted user instructions from untrusted external data to counter indirect prompt injection.
It employs a hierarchical verification process combining deterministic rule-based filtering with LLM-based intent verification to achieve a 0% attack success rate.
Experimental results demonstrate significant reductions in both attack success and false positives, highlighting practical enhancements in AI safety for tool-integrated LLM agents.

PlanGuard: A Formal Defense Architecture for Actionable Indirect Prompt Injection in Tool-Integrated LLM Agents

Motivation and Problem Setting

Modern tool-integrated LLM agents, empowered by advanced prompt engineering and API-access capabilities, present critical new attack surfaces via Indirect Prompt Injection (IPI). IPI manipulates agent actions by embedding adversarial instructions within external, untrusted content sources, compromising agent integrity during tool usage. These attacks exploit the architectural flaw of context mixing, where LLMs do not reliably differentiate between trusted user instructions and ancillary, potentially malicious data. The severe consequences of actionable IPI—unauthorized tool invocation and argument hijacking—demand security-by-design mitigations beyond probabilistic defenses or training-dependent alignment techniques.

PlanGuard Architecture and Context Isolation

PlanGuard introduces a principled, training-free approach grounded in architectural decoupling via Context Isolation. The framework comprises two decoupled paths: (1) an Isolated Planner that constructs a trusted reference plan based solely on the authenticated user instruction, and (2) an Agent path exposed to the full potentially-poisoned context, responsible for executing agent operations.

Figure 1: Overview of the PlanGuard architecture. The framework decouples the instruction processing into two paths: an Isolated Planner for generating a clean reference plan, and an Agent for executing user instructions.

The Isolated Planner is architecturally prohibited from observing any external retrieved data, ensuring that its reference action set is uncontaminated and maps directly to user intent. At execution, every agent tool call is intercepted and subject to a hierarchical verification process: a deterministic rule-based filter that strictly enforces authorized tool types and canonical arguments (mitigating unauthorized tool invocation), followed by an intent-verification module using an LLM-in-the-loop to semantically assess parameter deviations (mitigating argument hijacking in the presence of benign generation stochasticity).

Hierarchical Verification Mechanism

The defense pipeline operates as follows:

Reference Generation: The Planner ingests $(I, \mathcal{T})$ and emits a reference set of permissible tool invocations, $S_{ref}$ , reflecting user intent.
Action Capture and Verification: Agent-generated actions are matched against $S_{ref}$ $S_{re f}$ :
- Stage I (Hard Rules): String equality on tool and parameter. Mismatch on tool disqualifies and blocks the action (Type I), strict match passes. Parameter mismatch triggers review.
- Stage II (Intent Verifier): For parameter mismatches, an LLM-based verifier determines whether deviations are benign (e.g., formatting variance) or indicate malicious semantic drift (Type II).

Crucially, this split leverages both determinism for strong security boundaries and semantic flexibility for practical usability, especially in stochastic-generation regimes typical of current LLMs.

Experimental Results on Attack Mitigation

Experiments were conducted on the InjecAgent benchmark, covering 1,054 actionable IPI test cases with both Direct Harm (DH) and Data Stealing (DS) attacks. The DeepSeek-V3.2 model served as the backbone for both agent and defense components, and baseline “compliance-inducing” prompts were used to suppress intrinsic LLM safety training for a rigorous security test.

PlanGuard was compared to baseline agents and an ablation using only the hard-constraint Planner stage. The results are visualized in Figure 2.

Figure 2: Performance comparison between Vanilla Agent and PlanGuard on DH and DS subsets.

The baseline (Vanilla Agent) had an Attack Success Rate (ASR) of 56.90% (DH) and 88.67% (DS) under context mixing, demonstrating the severity of actionable IPI.
PlanGuard achieved 0.0% ASR in both attack classes—a strong, deterministic security guarantee attributable to full context isolation.
Ablation to Stage-I only also resulted in 0% ASR but incurred False Positive Rates (FPR) as high as 27%–38%, underscoring the necessity of the Stage II verifier for utility.
Full PlanGuard reduced FPR to 0.97% (DH) and 3.28% (DS)—restoring agent usability while maintaining strong compositional security.

This establishes a structural, rather than probabilistic, defense boundary. Unlike prior classifier or alignment-based methods, PlanGuard’s security derives from the impossibility of adversarial influence under input isolation, guaranteed by architectural design rather than detection heuristics.

Robustness and Limitations

PlanGuard’s security holds even under white-box attackers aware of the defense. Hard architectural isolation prohibits Planner corruption, and schema validation in typical tool integrations further limits parameter-injection attacks. The tool intent verifier, placed after strict rule enforcement, blocks semantically deviant argument attacks that may pass schema constraints.

However, limitations remain for context-dependent argument verification, especially when user intent references ambiguous or implicit, context-derived values (e.g., “Pay the bill in the email”). Because the Planner is context-isolated, PlanGuard can verify correct action type but cannot attest to the correctness of arguments in such cases. Addressing this requires contextual information extraction and rule-based augmentation, a direction for further research.

Implications and Future Work

PlanGuard demonstrates that architectural context isolation, coupled with composable, deterministic-verification logic, can eliminate actionable IPI attacks in LLM tool-use agents without reliance on brittle classifier training or model alignment. This marks a shift toward robust engineering for AI safety in open-world, tool-executing environments. The operational cost (two extra LLM inferences per action) represents a practical security–performance tradeoff, and further optimization—such as custom small-model verifiers—could reduce deployment overhead.

The formal security boundaries established by PlanGuard advance the ongoing theoretical understanding of agent-tool safety. In broader AI applications where LLMs orchestrate real-world actions via tool plugins, principled architectures like PlanGuard will become foundational. Future development may integrate context-extraction modules that balance information availability for deep intent verification while preserving the essential isolation property.

Conclusion

PlanGuard introduces an architectural paradigm for defending LLM agents against Indirect Prompt Injection—realizing a context isolation barrier, hierarchical verification, and structural security guarantees. Empirical validation shows that PlanGuard reduces the actionable IPI attack success rate from over 70% to 0% in rigorous settings, while maintaining low false-positive rates. This work highlights the efficacy and necessity of deterministic, architecture-level defenses in the safety-critical deployment of tool-integrated language agents (2604.10134).