Papers
Topics
Authors
Recent
2000 character limit reached

Indirect Prompt Injection Attacks

Updated 14 December 2025
  • Indirect Prompt Injection (IPI) attacks are sophisticated threats where adversarial instructions embedded in external content hijack LLM behavior.
  • They exploit untrusted data sources by blending malicious instructions with user prompts, achieving high success rates in various applications.
  • Defenses use methods like behavioral state analysis, context isolation, and plan enforcement to significantly reduce attack success rates.

Indirect Prompt Injection (IPI) Attacks

Indirect Prompt Injection (IPI) attacks, also known as context or retrieval injection, constitute a sophisticated threat vector in which adversarial instructions are embedded within external content that is later ingested by LLM-integrated systems. Unlike direct prompt injection, where the attacker manipulates the user-supplied prompt or system message, IPI exploits the integration of untrusted data sources—such as documents, emails, web content, code descriptions, or GUI elements—into the LLM’s context. If not robustly isolated or filtered, these hidden instructions can hijack the model’s behavioral state, causing the LLM to follow attacker-designated imperatives in place of the legitimate user’s intent. IPI attacks are empirically demonstrated to be feasible, often highly effective, across applications ranging from retrieval-augmented generation (RAG) to multi-tool LLM agents and multimodal GUI agents (Wen et al., 8 May 2025, Xie et al., 27 Oct 2025, Greshake et al., 2023, Johnson et al., 20 Jul 2025, An et al., 21 Aug 2025).

1. Taxonomy, Formal Definition, and Threat Models

IPI attacks manifest wherever LLM applications concatenate retrieved, externally-sourced content with user prompts in a manner that is parsed indiscriminately as instructions (Yi et al., 2023, Greshake et al., 2023, Ji et al., 19 Nov 2025). The formal structure is as follows:

Given:

  • uu: user’s instruction
  • xretrx_{\mathrm{retr}}: external (retrieved) content
  • pp: adversarial (hidden) instruction embedded in xretrx_{\mathrm{retr}}
  • yy: model output

An IPI attack seeks to maximize

P(yu,xretrp)P(y \mid u, x_{\mathrm{retr}} \oplus p)

such that yy aligns with pp and subverts or overrides the behavioral policy induced by uu.

The attack surface encompasses multiple domains:

Threat models are distinguished by attacker privileges (black-box vs. white-box knowledge, control over tool descriptions or only data, adaptive vs. static payloads) and by the defender’s trust assumptions about the input pipeline and agent logic (Ji et al., 19 Nov 2025, Zhan et al., 27 Feb 2025, Wen et al., 8 May 2025).

2. Mechanism of IPI Attack Success: Model Confusion and Instruction Hijacking

The core vulnerability exploited by IPI attacks is the LLM’s inability to reliably demarcate “data” (to be summarized, searched, answered about) from “instructions” (to be followed) (Yi et al., 2023, Wen et al., 8 May 2025, Wang et al., 29 Apr 2025). IPI payloads are seamlessly blended into context so that the forward behavioral state of the model is altered, switching the decision boundary from compliance with user uu to obedience to attacker pp:

  • Context-injected instructions are often placed at salience-maximizing positions (beginning or, more effectively, end), which empirical measurement shows leads to higher Attack Success Rate (ASR) (Yi et al., 2023).
  • Content with high “freedom” (free-form reviews, open-ended fields) further amplifies attack transfer; synthetic experiments demonstrate this holds for both open-source and API LLMs (Zhan et al., 5 Mar 2024).
  • In coding agents, tool-integrated systems, and web agents, instructions can be “smuggled” in tool meta-data or HTML accessibility fields, bypassing front-end prompt validation and leveraging universal adversarial triggers (Johnson et al., 20 Jul 2025, Xie et al., 27 Oct 2025).

Empirical benchmarks show high ASRs for standard LLM configurations:

3. Detection and Defense Methodologies

Defense against IPI attacks requires—at a minimum—mechanisms for discriminating actionable instructions from benign context, or structuring model execution so that untrusted data cannot alter intended behavior. Principal strategies include:

A. Behavioral State Detection via Internal Model Signals

  • Extracting discriminative “fingerprints” from intermediate hidden states h()h^{(\ell)} and backward gradients g()g^{(\ell)} of the model, capturing how the LLM would need to “bend” to comply with an instruction (Wen et al., 8 May 2025).
  • In the prototype, forward pass on xx yields last-token activations h()Rdh^{(\ell)} \in \mathbb{R}^d, while backprop of cross-entropy loss for a canonical reply (e.g., "Sure") exposes instruction-following gradient signatures from self-attention layers. These features, layer-normalized and fused, drive an MLP classifier trained on labeled context vs. instruction-injected documents, achieving up to 99.6% accuracy in-domain and 96.9% OOD, reducing ASR to 0.12% on the BIPIA benchmark.

B. Prompt Engineering and Context Isolation

  • Use of explicit boundary tokens ("<data>", "</data>"), multi-turn dialogue (moving external content to a prior conversational turn), or in-context learning with adversarial examples demonstrating correct refusals, has measurable effect in reducing ASR (e.g., GPT-4 drops by 20–35%) but does not eliminate attack success (Yi et al., 2023).
  • Explicit reminders and fine-tuned models that embed the policy “ignore instructions inside <data>...</data>” can drive ASR to near zero, at the expense of additional model modification or training resources.

C. System-Level Plan Enforcement

  • IPIGuard introduces a tool dependency graph (TDG) that statically encodes the plan of tool calls and strictly forbids any tool invocation not present in the plan; all argument estimation, node expansion, and dummy responses are restricted to the DAG’s predefined, acyclic topology (An et al., 21 Aug 2025).
  • This “plan-then-trust” paradigm achieves robust (<1% ASR) security with high utility, but may be overly restrictive for dynamic or unforeseen tool use.

D. Output Firewall and Sanitizer

  • “Minimize … Sanitize” firewalls at the agent–tool interface strip dangerous instructions from tool outputs before they reach the LLM, substantially reducing ASR across multiple public benchmarks (Bhagwatkar et al., 6 Oct 2025).

E. Attention- and Attribution-Based Defenses

  • Rennervate leverages token-level attention pattern signatures, using a two-step pooling over response tokens and heads, enabling precise sanitization in challenging scenarios with minimal false positives and high robustness to adaptive attacks (Zhong et al., 9 Dec 2025).
  • CachePrune prunes task-triggering neurons in the key–value cache identified via feature attribution, driving the model to treat arbitrary context as pure data (Wang et al., 29 Apr 2025).

F. Semantic and Intent Analysis

  • Recent advances focus on extracting the internal “intent” policy of the model—decoding which instructions the LLM plans to execute, independent of superficial pattern recognition. This is operationalized by querying or intervening on the LLM’s reasoning trace, then mapping intended actions to their origin (trusted/untrusted) and masking or alerting if an overlap is found (Kang et al., 30 Nov 2025).

4. Robustness, Evaluation, and Benchmarks

A meaningful evaluation of IPI defenses requires validated, adversarial benchmarks and the use of strong, adaptive attacks:

  • The BIPIA and InjecAgent benchmarks provide thousands of test-cases systematically combining user and attacker payloads, covering both direct-harm and data-exfiltration objectives (Yi et al., 2023, Zhan et al., 5 Mar 2024).
  • Tools such as the Greedy Coordinate Gradient (GCG) algorithm produce universal adversarial triggers for web agents, resulting in >90% ASR in held-out settings (Johnson et al., 20 Jul 2025).
  • QueryIPI demonstrates that query-agnostic payloads—constructed via iterative, prompt-profiled mutation leveraging internal prompt leakage—dominate query-specific attacks in both coverage and transferability (Xie et al., 27 Oct 2025).
  • Firewalls and robust plan enforcement saturate current static benchmarks, pushing ASR to near zero; however, Braille-encoded or obfuscated attack variants can still bypass defenses, motivating the need for ongoing benchmark strengthening (Bhagwatkar et al., 6 Oct 2025).
  • Defenses should be validated against not only static and synthetic attacks but also adaptive, logic-driven payloads optimized specifically to evade deployed filters (Zhan et al., 27 Feb 2025, Ji et al., 19 Nov 2025).

Comparative Table: Defense Effectiveness (Selected Benchmarks/Papers)

Defense/Metric In-domain ASR Out-of-domain ASR Utility (UA) Notes
BIPIA Baseline 31% (GPT-4) Standard LLMs, high vulnerability (Yi et al., 2023)
Boundary Awareness 20–24% -5% ROUGE Context separation, text markers
Detector+Sanitizer 0.12%–0.03% 3% 99% Hidden+gradient fusion, BIPIA (Wen et al., 8 May 2025)
IPIGuard 0.69% <1% 58.77% (UA) Plan enforcement, AgentDojo (An et al., 21 Aug 2025)
FATH (Auth.-Based) 0–0.01% 0–0.01% Hash-based output authentication (Wang et al., 28 Oct 2024)
MELON 0.24% <1% 58.8–68.7% Masked re-execution, AgentDojo (Zhu et al., 7 Feb 2025)
Adaptive Attacks >50% Varied Most published static defenses are evaded (Zhan et al., 27 Feb 2025)

5. Practical Impacts and Real-World Observations

IPI attacks have been observed in the wild across a diversity of LLM-empowered applications:

  • Public search/chat agents (Bing Sidebar, customer support bots) can be trivially manipulated via external data (HTML comments, product reviews) (Greshake et al., 2023, Kaya et al., 8 Nov 2025).
  • Coding agents and IDE plugins are highly vulnerable when tool descriptions or system prompts are inadvertently exposed (Xie et al., 27 Oct 2025).
  • GUI agents parsing HTML accessibility trees inherit the full attack surface of the rendered DOM, with universal adversarial triggers shown to force unwanted clicks, exfiltration, or service denial, even when placed in visually hidden elements (Johnson et al., 20 Jul 2025).
  • Multimodal attacks (images, audio, video) achieve near-complete action redirection if the LLM’s behavioral state is not actively steered or decoupled from attacker-modified media. ARGUS and related subspace-steering approaches show promise for modalities beyond text (Lu et al., 5 Dec 2025).

Industry studies of thousands of real-world chatbots confirm that a non-trivial percentage (≈13%) unintentionally scrape and reflect untrusted third-party user content, introducing latent IPI risk at scale (Kaya et al., 8 Nov 2025).

6. Limitations, Open Challenges, and Future Directions

Despite substantial progress, significant gaps remain:

7. Summary of Research Directions and Best Practices

IPI attacks constitute a critical risk surface for LLM-integrated systems. State-of-the-art detection and defense now draw on internal behavioral signatures, robust system architectures, and intent-centric tracing to achieve measurable reductions in attack success rates, but maintaining such security in the face of adaptive, cross-domain adversaries remains a rapidly evolving research frontier.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (18)

Whiteboard

Follow Topic

Get notified by email when new papers are published related to Indirect Prompt Injection (IPI) Attacks.