Indirect Prompt Injection Attacks

Updated 7 December 2025

Indirect Prompt Injection Attacks (IPIAs) are security threats where adversaries inject malicious instructions via external data sources integrated into LLM inputs.
IPIAs exploit data retrieval and concatenation processes, enabling attackers to bypass direct input controls and achieving high attack success rates.
Defensive measures such as behavioral detection, prompt engineering, and execution structure controls are essential to mitigate IPIA vulnerabilities.

An indirect prompt injection attack (IPIA) is a security threat unique to LLM–integrated systems, in which an attacker manipulates external data sources—such as web documents, retrieved files, or tool outputs—embedding instructions that, when incorporated into the LLM’s input context, are interpreted and executed by the model as if they were bona fide user instructions. Unlike direct prompt injections, IPIAs do not require adversarial modification of the user’s immediate prompt or the system instruction, but instead exploit the model's inability to distinguish between “instruction” and “data” when presented with concatenated or interleaved inputs from multiple sources (Chen et al., 18 Jul 2025, Greshake et al., 2023, Yi et al., 2023). As a result, IPIAs expose critical vulnerabilities across retrieval-augmented generation systems, agentic tool pipelines, web-automation agents, and multimodal environments.

1. Formalization and Distinction from Direct Prompt Injection

Indirect prompt injection attacks are formally characterized by an adversary’s ability to plant adversarial instructions in data sources external to the LLM, which are subsequently retrieved by the victim system and concatenated to the original user prompt. If $T_b$ is benign data and $I_{\text{inj}}$ the attacker’s instruction, then the poisoned input is $T_{\text{inj}} = \text{Atk}(T_b, I_{\text{inj}})$ . The system applies any defenses $\text{Def}$ and queries the LLM $\mathcal{M}$ with input $\text{Def}(I_{\text{ori}}, T_{\text{inj}})$ , yielding response $R = \mathcal{M}(\text{Def}(I_{\text{ori}}, T_{\text{inj}}))$ (Chen et al., 18 Jul 2025).

Indirect prompt injection is successful if the LLM’s output $R$ contains a segment $r$ such that $r\in \text{Exec}(I_{\text{inj}})$ , i.e., the adversary’s payload is executed. This is contrasted with direct prompt injection, where the attacker appends $I_{\text{inj}}$ directly to the user/system prompt and retains full control over the immediate input; IPIAs, however, specifically target scenarios where the attacker’s only handle is external data ingested by the model, making the attack channel both more subtle and pervasive (Chen et al., 18 Jul 2025, Yi et al., 2023).

2. Threat Models and Attack Taxonomy

IPIAs are characterized by several threat models:

Content Delivery Vectors: External data sources under attacker control include web pages, retrieved documents, PDFs, HTML comments, plugin APIs, emails, and GUI overlays (Greshake et al., 2023, Lu et al., 20 May 2025).
Injection Modalities:
- Passive retrieval: Poisoning content expected to be fetched by the application (e.g., SEO-optimized web pages).
- Active delivery: Injecting into asynchronous feeds, chat messages, or email content.
- Environmental injection: Modifying the visual interface seen by an agent in GUIs (pop-ups, chat overlays) (Lu et al., 20 May 2025).
Attack Objectives: Encompass confidentiality violations (persuasion or exfiltration), integrity attacks (phishing, content manipulation, API misuse), and availability degradation (service denial, resource abuse) (Greshake et al., 2023).
Application Scenarios: Include search-augmented chat, retrieval-augmented generation (RAG), IDE coding agents, autonomous web agents, and customer-service chatbots (Chen et al., 18 Jul 2025, Xie et al., 27 Oct 2025, Johnson et al., 20 Jul 2025, Kaya et al., 8 Nov 2025).

The attack surface further expands in environments where data is indiscriminately concatenated (e.g., all website text, user-generated reviews, plugin tool outputs) and where clear differentiators between data and instruction are absent or easy to evade (Kaya et al., 8 Nov 2025, Yi et al., 2023).

3. Representative Attack Methodologies

3.1 TopicAttack (Topic Transition IPIA)

TopicAttack introduces a multi-turn conversational transition between benign content $T_b$ and malicious instruction $I_{\text{inj}}$ , thereby smoothing the topic shift and increasing semantic plausibility (Chen et al., 18 Jul 2025). The injection is constructed as

$T_{\text{inj}} = T_b \oplus T_t \oplus [\text{reminder prompt}] \oplus I_{\text{inj}}$

where $T_t$ is a topic-transition prompt generated by an auxiliary LLM, and the reminding prompt ensures attention is focused on $I_{\text{inj}}$ . Empirically, TopicAttack achieves state-of-the-art ASR exceeding 90% across a range of models and defenses, outperforming abrupt-injection baselines by significant margins.

3.2 QueryIPI (Query-Agnostic IPI for Coding Agents)

QueryIPI targets agents with function-calling or tool-descriptive capabilities, leveraging knowledge of internal prompts (possibly leaked) to optimize tool descriptions in a white-box fashion. The attack constructs a description that, regardless of the user’s query, robustly triggers a malicious action $a_{\text{mal}}$ (Xie et al., 27 Oct 2025). Iterative optimization uses a prompt-driven “mutation” LLM, with scoring by a judge LLM. QueryIPI achieves up to 87% success on simulated coding agents and outperforms prior baselines even in transfer and real-world conditions.

3.3 Environmental Injection in GUIs

Environmental IPIAs embed misleading textual cues into the GUI environment (e.g., pop-ups, chat overlays), which are perceived and acted upon by multimodal agents (Lu et al., 20 May 2025). The EVA framework leverages evolutionary optimization, guided by the agent's visual attention maps, to efficiently discover effective injection cues and layouts. Pop-up manipulations, especially those that concentrate attention on a confirmation button, yield attack success rates up to 80%.

3.4 HTML Accessibility Tree Injection

LLM-based web agents are susceptible to adversarial triggers inserted in the HTML Accessibility Tree. An attacker can embed token sequences (often via hidden <span> or ARIA attributes) that are included in the agent’s prompt context, thus overriding user intent for actions such as credential exfiltration or forced ad clicks (Johnson et al., 20 Jul 2025). Using discrete optimization (Greedy Coordinate Gradient, GCG), universal triggers with ASR over 80% across a variety of real websites and navigation tasks were synthesized.

4. Empirical Results and Metrics

Quantitative evaluation across tasks and domains reveals systemic vulnerabilities:

Attack	Model/Context	No Defense ASR	Defense ASR	Notes
TopicAttack	Llama3-8B, GPT-4o	>90%	60–80% (prompt defense)	Outperforms baselines by 20–50pp
QueryIPI	Coding agents (sim)	up to 87%	n/a	50% ASR on real-world closed agents
HTML Tree	Browser agents	83–96%	—	Universal triggers work cross-site
Chatbot Plugin	3rd-party web sites	up to 5–8× more likely	—	Combination of site design flaws and improper context handling

ASR (attack success rate) reflects the fraction of queries where the model follows the injected instruction. Position dependence (injection at end > middle > head) is commonly observed (Yi et al., 2023). Notably, adversarial fine-tuning (white-box) can reduce ASR close to zero, but prompt-based, black-box defenses are often only partially successful. Query-agnostic and universal attacks demonstrate that single, carefully constructed triggers can reliably subvert agent behavior across queries or tasks (Xie et al., 27 Oct 2025, Johnson et al., 20 Jul 2025).

5. Defensive Paradigms and Analysis of Failure Modes

5.1 Detection, Isolation, and Removal

Behavioral State and Attention-Ratio Detection: Detection leveraging hidden state and gradient features of LLMs achieves near-perfect accuracy in in-domain and out-of-domain settings, reducing ASR to 0.12% on standard benchmarks (Wen et al., 8 May 2025).
Explicit Reminder and Boundary-Awareness Prompts: Prompt engineering to instruct the LLM not to follow any instruction in external data can cut ASR by 20–35 percentage points, but does not fully mitigate adaptive or blending attacks (Yi et al., 2023, Chen et al., 18 Jul 2025).
Test-time Authentication (FATH): Hash-based authentication tags surround user instructions and external content, enabling response filtering strictly on authorized tags. FATH reduces ASR to 0–0.5% even under adaptive, worst-case attacks (Wang et al., 28 Oct 2024).
CachePrune (Neuron Masking): Identifies and prunes “task-triggering neurons” in the input context, substantially suppressing ASR without performance loss (Wang et al., 29 Apr 2025).
Execution Structure (IPIGuard): Imposes a tool-dependency graph (TDG), decoupling the agent’s planning phase from execution, and blocking any tool call not in the planned traversal, reducing ASR below 1% with only minor overhead (An et al., 21 Aug 2025).
Masked Re-execution (MELON): Runs the agent twice—once on the full prompt, once masking the user task—and flags an attack if tool calls match, exploiting the statistical decoupling seen in successful IPIAs (Zhu et al., 7 Feb 2025).

5.2 Analytical Findings on Defensive Weaknesses

Comprehensive systematization (Ji et al., 19 Nov 2025) and adaptive attack studies (Zhan et al., 27 Feb 2025) reveal recurrent root causes of defense circumvention:

Imprecise access control over tool selection/parameters.
Incomplete isolation of untrusted information (error-message leakage, cross-channel contamination).
Judgment errors in LLM-based runtime checks or detectors.
Insufficiently comprehensive security policies.
Limited generalization of fine-tuned or static pattern-based defenses.
Lack of robust integration beyond surface-level or static syntactic matching.

Adaptive attacks targeting these flaws can amplify ASR by 2–4×, highlighting the brittleness of defenses that do not operate at representation, structural, or execution-policy levels (Ji et al., 19 Nov 2025, Zhan et al., 27 Feb 2025).

6. Open Challenges and Future Directions

Despite diversified defensive mechanisms, IPIAs remain an open security risk:

Adaptive Attacks: Any static filtering, prompting, or local detection can be circumvented by adversarial optimization—addressing dynamic, logic-embedded, and cross-modal payloads is essential (Zhan et al., 27 Feb 2025, Ji et al., 19 Nov 2025).
Benchmark Rigorousness: Existing evaluations saturate with simple defenses; ongoing efforts focus on revising benchmarks to incorporate semantic goal completion, realistic attack vectors, and utility-aware metrics (Bhagwatkar et al., 6 Oct 2025).
Agentic Structural Enforcement: Defensive paradigms are pivoting toward execution-structure constraints (e.g., tool-dependency graphs, cryptographic tagging), model-internal state analysis, and intent analysis (Kang et al., 30 Nov 2025, An et al., 21 Aug 2025).
Generalization and Transfer: Robustness to out-of-domain and previously unseen attack modalities is not yet systemically solved; defenses must transfer across settings, models, and agentic architectures.
Safety vs. Utility Balance: Strong defenses risk utility collapse (disabled tool use, excessive task blocking); layered or hybrid approaches (e.g., MELON + prompt defense) offer better trade-offs (Zhu et al., 7 Feb 2025).

Future research emphasizes defense-in-depth, rigorous structural separation between data and instruction channels, and principled monitoring of model intent and execution pathways.

7. Impact and Security Implications

IPIAs are a persistent threat in any LLM-powered system that ingests or retrieves external content. Consequences range from data theft and privacy breaches (e.g., VortexPIA extracting user PII (Cui et al., 5 Oct 2025)) to tool misuse (bank transfers, content hijacking, ecosystem contamination). Even identification of adaptive defenses with near-zero ASR demonstrates no silver bullet; the continual arms race between attack and defense mandates regular security auditing and red-teaming using up-to-date benchmarks and multimodal scenarios (Ji et al., 19 Nov 2025, Johnson et al., 20 Jul 2025, Lu et al., 20 May 2025). Alignment and robust RLHF (reinforcement learning from human feedback) appear strongly correlated with resilience, but must be backed with structural and pipeline-level measures (Ganiuly et al., 3 Nov 2025, Yi et al., 2023).