Indirect Prompt Injection Attacks
- Indirect Prompt Injection (IPI) attacks are sophisticated threats where adversarial instructions embedded in external content hijack LLM behavior.
- They exploit untrusted data sources by blending malicious instructions with user prompts, achieving high success rates in various applications.
- Defenses use methods like behavioral state analysis, context isolation, and plan enforcement to significantly reduce attack success rates.
Indirect Prompt Injection (IPI) Attacks
Indirect Prompt Injection (IPI) attacks, also known as context or retrieval injection, constitute a sophisticated threat vector in which adversarial instructions are embedded within external content that is later ingested by LLM-integrated systems. Unlike direct prompt injection, where the attacker manipulates the user-supplied prompt or system message, IPI exploits the integration of untrusted data sources—such as documents, emails, web content, code descriptions, or GUI elements—into the LLM’s context. If not robustly isolated or filtered, these hidden instructions can hijack the model’s behavioral state, causing the LLM to follow attacker-designated imperatives in place of the legitimate user’s intent. IPI attacks are empirically demonstrated to be feasible, often highly effective, across applications ranging from retrieval-augmented generation (RAG) to multi-tool LLM agents and multimodal GUI agents (Wen et al., 8 May 2025, Xie et al., 27 Oct 2025, Greshake et al., 2023, Johnson et al., 20 Jul 2025, An et al., 21 Aug 2025).
1. Taxonomy, Formal Definition, and Threat Models
IPI attacks manifest wherever LLM applications concatenate retrieved, externally-sourced content with user prompts in a manner that is parsed indiscriminately as instructions (Yi et al., 2023, Greshake et al., 2023, Ji et al., 19 Nov 2025). The formal structure is as follows:
Given:
- : user’s instruction
- : external (retrieved) content
- : adversarial (hidden) instruction embedded in
- : model output
An IPI attack seeks to maximize
such that aligns with and subverts or overrides the behavioral policy induced by .
The attack surface encompasses multiple domains:
- Document-based RAG: Attacker inserts instructions anywhere in the external text pool (e.g., web, wiki, email) (Wen et al., 8 May 2025, Yi et al., 2023).
- Tool-integrated agents: Malicious instructions propagate via tool outputs, tool descriptions, or API returns, which are then interpreted as actionable commands (Xie et al., 27 Oct 2025, An et al., 21 Aug 2025).
- Multimodal agents and GUIs: Adversarial payloads appear in visual elements, HTML accessibility trees, audio overlays, or video frames, which multimodal LLM agents parse and may act on (Lu et al., 5 Dec 2025, Johnson et al., 20 Jul 2025, Lu et al., 20 May 2025).
Threat models are distinguished by attacker privileges (black-box vs. white-box knowledge, control over tool descriptions or only data, adaptive vs. static payloads) and by the defender’s trust assumptions about the input pipeline and agent logic (Ji et al., 19 Nov 2025, Zhan et al., 27 Feb 2025, Wen et al., 8 May 2025).
2. Mechanism of IPI Attack Success: Model Confusion and Instruction Hijacking
The core vulnerability exploited by IPI attacks is the LLM’s inability to reliably demarcate “data” (to be summarized, searched, answered about) from “instructions” (to be followed) (Yi et al., 2023, Wen et al., 8 May 2025, Wang et al., 29 Apr 2025). IPI payloads are seamlessly blended into context so that the forward behavioral state of the model is altered, switching the decision boundary from compliance with user to obedience to attacker :
- Context-injected instructions are often placed at salience-maximizing positions (beginning or, more effectively, end), which empirical measurement shows leads to higher Attack Success Rate (ASR) (Yi et al., 2023).
- Content with high “freedom” (free-form reviews, open-ended fields) further amplifies attack transfer; synthetic experiments demonstrate this holds for both open-source and API LLMs (Zhan et al., 5 Mar 2024).
- In coding agents, tool-integrated systems, and web agents, instructions can be “smuggled” in tool meta-data or HTML accessibility fields, bypassing front-end prompt validation and leveraging universal adversarial triggers (Johnson et al., 20 Jul 2025, Xie et al., 27 Oct 2025).
Empirical benchmarks show high ASRs for standard LLM configurations:
- GPT-4: up to 31% ASR on BIPIA (Yi et al., 2023), 47% on InjecAgent reinforced setting (Zhan et al., 5 Mar 2024)
- Llama2-70B: >75% ASR depending on attack modality (Zhan et al., 5 Mar 2024)
- In agentic benchmarks, ASR can approach or exceed 85% without robust defense (An et al., 21 Aug 2025, Xie et al., 27 Oct 2025)
3. Detection and Defense Methodologies
Defense against IPI attacks requires—at a minimum—mechanisms for discriminating actionable instructions from benign context, or structuring model execution so that untrusted data cannot alter intended behavior. Principal strategies include:
A. Behavioral State Detection via Internal Model Signals
- Extracting discriminative “fingerprints” from intermediate hidden states and backward gradients of the model, capturing how the LLM would need to “bend” to comply with an instruction (Wen et al., 8 May 2025).
- In the prototype, forward pass on yields last-token activations , while backprop of cross-entropy loss for a canonical reply (e.g., "Sure") exposes instruction-following gradient signatures from self-attention layers. These features, layer-normalized and fused, drive an MLP classifier trained on labeled context vs. instruction-injected documents, achieving up to 99.6% accuracy in-domain and 96.9% OOD, reducing ASR to 0.12% on the BIPIA benchmark.
B. Prompt Engineering and Context Isolation
- Use of explicit boundary tokens ("<data>", "</data>"), multi-turn dialogue (moving external content to a prior conversational turn), or in-context learning with adversarial examples demonstrating correct refusals, has measurable effect in reducing ASR (e.g., GPT-4 drops by 20–35%) but does not eliminate attack success (Yi et al., 2023).
- Explicit reminders and fine-tuned models that embed the policy “ignore instructions inside <data>...</data>” can drive ASR to near zero, at the expense of additional model modification or training resources.
C. System-Level Plan Enforcement
- IPIGuard introduces a tool dependency graph (TDG) that statically encodes the plan of tool calls and strictly forbids any tool invocation not present in the plan; all argument estimation, node expansion, and dummy responses are restricted to the DAG’s predefined, acyclic topology (An et al., 21 Aug 2025).
- This “plan-then-trust” paradigm achieves robust (<1% ASR) security with high utility, but may be overly restrictive for dynamic or unforeseen tool use.
D. Output Firewall and Sanitizer
- “Minimize … Sanitize” firewalls at the agent–tool interface strip dangerous instructions from tool outputs before they reach the LLM, substantially reducing ASR across multiple public benchmarks (Bhagwatkar et al., 6 Oct 2025).
E. Attention- and Attribution-Based Defenses
- Rennervate leverages token-level attention pattern signatures, using a two-step pooling over response tokens and heads, enabling precise sanitization in challenging scenarios with minimal false positives and high robustness to adaptive attacks (Zhong et al., 9 Dec 2025).
- CachePrune prunes task-triggering neurons in the key–value cache identified via feature attribution, driving the model to treat arbitrary context as pure data (Wang et al., 29 Apr 2025).
F. Semantic and Intent Analysis
- Recent advances focus on extracting the internal “intent” policy of the model—decoding which instructions the LLM plans to execute, independent of superficial pattern recognition. This is operationalized by querying or intervening on the LLM’s reasoning trace, then mapping intended actions to their origin (trusted/untrusted) and masking or alerting if an overlap is found (Kang et al., 30 Nov 2025).
4. Robustness, Evaluation, and Benchmarks
A meaningful evaluation of IPI defenses requires validated, adversarial benchmarks and the use of strong, adaptive attacks:
- The BIPIA and InjecAgent benchmarks provide thousands of test-cases systematically combining user and attacker payloads, covering both direct-harm and data-exfiltration objectives (Yi et al., 2023, Zhan et al., 5 Mar 2024).
- Tools such as the Greedy Coordinate Gradient (GCG) algorithm produce universal adversarial triggers for web agents, resulting in >90% ASR in held-out settings (Johnson et al., 20 Jul 2025).
- QueryIPI demonstrates that query-agnostic payloads—constructed via iterative, prompt-profiled mutation leveraging internal prompt leakage—dominate query-specific attacks in both coverage and transferability (Xie et al., 27 Oct 2025).
- Firewalls and robust plan enforcement saturate current static benchmarks, pushing ASR to near zero; however, Braille-encoded or obfuscated attack variants can still bypass defenses, motivating the need for ongoing benchmark strengthening (Bhagwatkar et al., 6 Oct 2025).
- Defenses should be validated against not only static and synthetic attacks but also adaptive, logic-driven payloads optimized specifically to evade deployed filters (Zhan et al., 27 Feb 2025, Ji et al., 19 Nov 2025).
Comparative Table: Defense Effectiveness (Selected Benchmarks/Papers)
| Defense/Metric | In-domain ASR | Out-of-domain ASR | Utility (UA) | Notes |
|---|---|---|---|---|
| BIPIA Baseline | 31% (GPT-4) | — | — | Standard LLMs, high vulnerability (Yi et al., 2023) |
| Boundary Awareness | 20–24% | — | -5% ROUGE | Context separation, text markers |
| Detector+Sanitizer | 0.12%–0.03% | 3% | 99% | Hidden+gradient fusion, BIPIA (Wen et al., 8 May 2025) |
| IPIGuard | 0.69% | <1% | 58.77% (UA) | Plan enforcement, AgentDojo (An et al., 21 Aug 2025) |
| FATH (Auth.-Based) | 0–0.01% | 0–0.01% | — | Hash-based output authentication (Wang et al., 28 Oct 2024) |
| MELON | 0.24% | <1% | 58.8–68.7% | Masked re-execution, AgentDojo (Zhu et al., 7 Feb 2025) |
| Adaptive Attacks | >50% | — | Varied | Most published static defenses are evaded (Zhan et al., 27 Feb 2025) |
5. Practical Impacts and Real-World Observations
IPI attacks have been observed in the wild across a diversity of LLM-empowered applications:
- Public search/chat agents (Bing Sidebar, customer support bots) can be trivially manipulated via external data (HTML comments, product reviews) (Greshake et al., 2023, Kaya et al., 8 Nov 2025).
- Coding agents and IDE plugins are highly vulnerable when tool descriptions or system prompts are inadvertently exposed (Xie et al., 27 Oct 2025).
- GUI agents parsing HTML accessibility trees inherit the full attack surface of the rendered DOM, with universal adversarial triggers shown to force unwanted clicks, exfiltration, or service denial, even when placed in visually hidden elements (Johnson et al., 20 Jul 2025).
- Multimodal attacks (images, audio, video) achieve near-complete action redirection if the LLM’s behavioral state is not actively steered or decoupled from attacker-modified media. ARGUS and related subspace-steering approaches show promise for modalities beyond text (Lu et al., 5 Dec 2025).
Industry studies of thousands of real-world chatbots confirm that a non-trivial percentage (≈13%) unintentionally scrape and reflect untrusted third-party user content, introducing latent IPI risk at scale (Kaya et al., 8 Nov 2025).
6. Limitations, Open Challenges, and Future Directions
Despite substantial progress, significant gaps remain:
- Existing defenses often incur computational, latency, or utility trade-offs ill-suited for ultra–low-latency or resource-constrained environments (Wen et al., 8 May 2025, Wang et al., 29 Apr 2025).
- Position-, token-, and LLM-agnostic attacks (e.g., braille homographs, syntactic variants, or multimodal cross-channel triggers) are insufficiently covered by current detection benchmarks, and empirical robustness is brittle in the face of adaptive attackers (Zhan et al., 27 Feb 2025, Bhagwatkar et al., 6 Oct 2025).
- Over-filtering can degrade benign utility in scenarios where legitimate instructions reside in external content, demanding fine-grained instruction-origination tracing (Kang et al., 30 Nov 2025, Zhong et al., 9 Dec 2025).
- Systematic evaluation and benchmarking must evolve toward dynamic test environments, with randomized, multi-turn, multi-agent, and co-evolving attack–defense arms races (Benchmark challenges in (Bhagwatkar et al., 6 Oct 2025, Ji et al., 19 Nov 2025)).
- Future work should emphasize cross-modal provenance tracking, semantic intent extraction, architectural isolation, and composite defense-in-depth strategies that supplement internal model signals with robust system-level guarantees (Lu et al., 5 Dec 2025, An et al., 21 Aug 2025, Bhagwatkar et al., 6 Oct 2025).
7. Summary of Research Directions and Best Practices
- Detection must move beyond surface pattern-matching to behavioral state analysis, attribution, and semantic tracing of instruction-following policies (Wen et al., 8 May 2025, Kang et al., 30 Nov 2025, Wang et al., 29 Apr 2025).
- Architectural separation, privilege boundaries, and systematic role isolation for untrusted data are necessary to contain the flow of a potentially attacker-controlled context (An et al., 21 Aug 2025, Kaya et al., 8 Nov 2025).
- Adversarial and adaptive attack evaluation is required for any defense to be meaningfully considered robust (Zhan et al., 27 Feb 2025, Ji et al., 19 Nov 2025).
- Benchmarks need to be enriched with stronger, obfuscated, and logic-driven attacks that more closely mimic real-world adversarial strategies (Bhagwatkar et al., 6 Oct 2025, Ji et al., 19 Nov 2025).
- Practical deployments should enforce fine-grained role boundaries, authenticate privileged control flows, employ model-level and architectural checks jointly, and treat all third-party or user-generated data as adversarial by default (Kaya et al., 8 Nov 2025, Wang et al., 28 Oct 2024).
IPI attacks constitute a critical risk surface for LLM-integrated systems. State-of-the-art detection and defense now draw on internal behavioral signatures, robust system architectures, and intent-centric tracing to achieve measurable reductions in attack success rates, but maintaining such security in the face of adaptive, cross-domain adversaries remains a rapidly evolving research frontier.