Indirect Prompt Injections in LLMs
- Indirect prompt injection is the exploitation of external data feeds to embed adversarial instructions, leading LLMs to execute unintended commands.
- It involves a range of attack vectors including instruction override, template forging, and stealth tactics that hide malicious payloads in trusted contexts.
- Defensive approaches include input provenance markers, runtime masking, and adversarial training to mitigate the significant risks posed by these attacks.
Indirect prompt injection (IPI) refers to the class of attacks against LLM applications, agents, or pipelines in which an adversary embeds malicious instructions within external input channels—such as retrieved web documents, tool responses, cloud logs, or user-generated content—that are later ingested by the model as part of its prompt context. Unlike direct prompt injection, where the user or attacker directly enters the adversarial instruction via the primary input interface, IPI exploits the LLM’s inability to distinguish between trusted instructions and untrusted data injected along secondary information paths. This vulnerability has been empirically demonstrated across a wide range of agentic LLM systems, code assistants, retrieval-augmented generation (RAG) applications, web-integrated tools, and even vision–LLMs, representing a major emerging risk in autonomous language agent architectures (Greshake et al., 2023, Khodayari et al., 29 Apr 2026, Zhan et al., 2024, Chang et al., 26 Sep 2025, Shah, 15 Apr 2026).
1. Definitions and Threat Models
IPI attacks exploit the tendency of LLM-driven systems to integrate and interpret external data as part of the prompt context. Let be developer-provided (trusted) instructions, and be untrusted content (e.g., a web page, retrieved email, log entry, or tool API response). The system’s prompt typically takes the form or , where is the adversary-controlled data channel (Khodayari et al., 29 Apr 2026, Hines et al., 2024).
Core IPI attack scenario:
- The adversary is capable of inserting controlled content into .
- The system appends, concatenates, or integrates with the rest of the prompt used for LLM inference.
- Payloads are crafted such that, when present in context, the LLM abandons user intent and executes actions aligning with adversarial goals (Wang et al., 24 Feb 2026, Yu et al., 8 Jan 2026, He et al., 11 Mar 2026, Zhan et al., 2024).
Threat models vary:
- Passive IPI: Attacker plants harmful instruction in widely read locations (webpages, documentation, logs), hoping it will eventually be retrieved (Greshake et al., 2023, Khodayari et al., 29 Apr 2026, Shah, 15 Apr 2026).
- Active IPI: Attacker triggers, sends, or social-engineers the content into the system at a strategic moment (e.g., as an email to the user, a poisoned tool response, or a cloud log entry) (Shah, 15 Apr 2026).
- White-box IPI: Attacker leverages knowledge of system prompt structure, hierarchy, or agent design to circumvent parsing or isolation mechanisms (e.g., by forging chat templates) (Chang et al., 26 Sep 2025, Xie et al., 27 Oct 2025).
2. Attack Techniques, Taxonomy, and Prevalence
A diverse taxonomy of IPI techniques has been identified in both empirical and large-scale “in the wild” studies (Greshake et al., 2023, Khodayari et al., 29 Apr 2026, Hines et al., 2024, Zhan et al., 2024, Shah, 15 Apr 2026):
Attack vectors include:
- Instruction override: “Ignore previous instructions. Now…” (Khodayari et al., 29 Apr 2026, Chang et al., 26 Sep 2025).
- Template forging: Forging chat or role tags to escalate secret messages to higher-privilege context (ChatInject) (Chang et al., 26 Sep 2025).
- Workflow hijack: Embedding multi-turn persuasive or authority-driven conversational snippets (Chang et al., 26 Sep 2025).
- Data-leak, exfiltration, code execution: Commands for data theft or shell execution (e.g., via log poisoning/log-based attacks) (Shah, 15 Apr 2026, He et al., 11 Mar 2026).
- Stealth and obfuscation: Base64 encoding, homoglyph substitution, multistage triggers, camouflaging payload inside non-obvious carriers (comments, metadata, JSON-LD, HTTP headers) (Greshake et al., 2023, Khodayari et al., 29 Apr 2026).
Empirical studies demonstrate real-world prevalence:
- In a crawl of 1.2B web pages (Common Crawl 2025), over 15,000 validated IPI payloads were found, most embedded in non-rendered HTML (headers, comments, metadata), of which 87% were invisible to human users (Khodayari et al., 29 Apr 2026).
- In cloud environments, LLM debugging agents were found to execute adversarial commands verbatim from log content with up to 86.2% success (Llama 3.3 70B) under “active” conditions (Shah, 15 Apr 2026).
- Benchmarking (InjecAgent) across 30 agents revealed attack success rates of up to 85% on large open-source models, with even fine-tuned agents exhibiting nonzero rates (Zhan et al., 2024).
| Attack Type | Example Mechanism | Reference |
|---|---|---|
| Instruction override | “Ignore all previous instructions…” | (Chang et al., 26 Sep 2025, Khodayari et al., 29 Apr 2026) |
| Template forging | <im_start>system ... <im_end> |
(Chang et al., 26 Sep 2025) |
| Log poisoning | Inject shell/CLI code via cloud logs | (Shah, 15 Apr 2026) |
| Stealth/obfuscation | Base64, homoglyphs, CSS hiding | (Greshake et al., 2023, Khodayari et al., 29 Apr 2026) |
| Chat/role escalation | Forged tags, multi-turn role-play | (Chang et al., 26 Sep 2025) |
3. Detection Methodologies and Systemic Vulnerabilities
IPI exploits the architectural ambiguity between trusted and untrusted prompt segments. Key vulnerabilities stem from agentic workflows that concatenate context without authenticated provenance or isolation (Greshake et al., 2023, Hines et al., 2024, Shah, 15 Apr 2026).
Detection strategies:
- Behavioral and action-level probing: AttriGuard and similar approaches test why an action (e.g., tool call) was proposed by counterfactually suppressing steering information in untrusted context; if the action disappears under control attenuation, it is flagged as “observation-driven” (malicious) (He et al., 11 Mar 2026).
- Latent space anomaly detection: ICON detects over-focusing or “attention collapse” in the LLM’s latent space (abnormally low entropy in attention heads toward adversarial tokens), then surgically redistributes attention to restore correct behavior (Wang et al., 24 Feb 2026).
- Representation engineering: Classifier probes can detect abnormal hidden-state representations or entropy spikes just prior to unauthorized actions, enabling pre-commitment circuit breakers (Zhu et al., 4 Apr 2026).
- Instruction-following intent analysis: IntentGuard extracts the LLM’s own planned intents and traces their provenance, blocking any intentions linked to untrusted prompt regions (Kang et al., 30 Nov 2025).
- Zero-shot embedding drift: ZEDD and related methods detect prompt injection by quantifying shifts in high-dimensional embedding space between benign and suspect inputs (Sekar et al., 18 Jan 2026).
- External detectors and segment extractors: Segmentation or extraction modules remove sentence-level or phrase-level regions of untrusted context where detectors identify instruction-like content (Chen et al., 23 Feb 2025).
Table: Empirical detection performance (sample)
| Method | Approach | Detection Acc. | Attack Success Rate (ASR) | Reference |
|---|---|---|---|---|
| ICON | Latent attention trace | 0.4% | 0.4% | (Wang et al., 24 Feb 2026) |
| AttriGuard | Action-level attribution | ~100% (static) | 0.0% (static) | (He et al., 11 Mar 2026) |
| ZEDD | Embedding drift | >93% | <7% | (Sekar et al., 18 Jan 2026) |
| CachePrune | Neuron cache pruning | N/A | 7–15% (code/QA) | (Wang et al., 29 Apr 2025) |
| IntentGuard | Intent provenance | ≥92% | <9% (adv. attacks) | (Kang et al., 30 Nov 2025) |
| Segmentation | Sentence detector/remover | 99% | <1% | (Chen et al., 23 Feb 2025) |
4. Defense Architectures and Mitigation Strategies
Multiple architectural and prompt engineering defenses have been proposed and evaluated, with complex trade-offs between security/coverage, utility, computational cost, and over-refusal.
System-level and prompt-based strategies:
- Input provenance signaling (“spotlighting”): Delimit or encode untrusted input with markers, provenance bits, or one-way Base64 encodings. Proper configuration reduces ASR from >50% to <2% in experiments on GPT-family models (Hines et al., 2024).
- Planning/execution decoupling: IPIGuard first statically plans calls via a tool dependency graph (TDG), then strictly enforces only planned invocations, eliminating the capacity for hijacks outside the pre-approved workflow (An et al., 21 Aug 2025).
- Runtime masking and re-execution: MELON re-executes the agent with the user’s task masked or replaced by a “task-neutral” prompt; if generated actions are similar to the original, a successful attack is declared (Zhu et al., 7 Feb 2025).
- Fine-tuning for injection robustness: Systematic adversarial training on IPI examples substantially reduces attack success, but is costly and fragile to novel attack variants (Zhan et al., 2024).
- Parsing and field constraint enforcement: Strict extraction and validation of necessary fields from tool output filters out content not matching strict schema or logical requirements (Yu et al., 8 Jan 2026).
- Circuit breakers: Inserted at positions of latent ambiguity in agent workflows, they halt execution before hypotheses can be committed, informed by hidden state probing (Zhu et al., 4 Apr 2026).
Failure modes and limitations:
- Delimiting and “sandwich” defenses (prompt repetition, delimiters) are often bypassed by obfuscated, encoded, or template-based attacks (Chang et al., 26 Sep 2025, Hines et al., 2024, Zhu et al., 4 Apr 2026).
- Off-the-shelf detectors and guardrails fail when payloads are disguised within realistic carrier structures (headers, logs, chat roles) (Shah, 15 Apr 2026, Khodayari et al., 29 Apr 2026).
- High-utility preservation is only achieved in modern frameworks that balance proactive detection/mitigation with minimal interruption of benign workflow (ICON, AttriGuard, IntentGuard) (Wang et al., 24 Feb 2026, He et al., 11 Mar 2026, Kang et al., 30 Nov 2025).
5. Emergent Properties, Risks, and Real-World Impact
Empirical studies confirm that IPI attacks present both offensive and defensive strategic possibilities, and have transitioned from theoretical concern to observable phenomenon in the web and cloud production systems:
- Prevalence: Recurring injection templates and invisible prompt-embedding strategies are present in production-scale web, cloud, and email corpora, with a small set of “families” accounting for most observed attacks (e.g., top 54 templates cover 95% of 15,000+ web injections) (Khodayari et al., 29 Apr 2026).
- Agentic systems: Multi-step automated agents are more vulnerable in dynamic environments, with attack success rates exceeding 80% for representative vector attacks in open-source LLM backbones (Zhu et al., 4 Apr 2026, Zhan et al., 2024).
- Cross-system implications: Associated risks include privilege escalation, arbitrary code execution (“curl | bash”), data exfiltration, AI-bot detection evasion, and reputation manipulation (Shah, 15 Apr 2026, Khodayari et al., 29 Apr 2026, Greshake et al., 2023).
- Defensive friction: Some countermeasures, such as strong output-stripping or excessive input refusal, degrade the agent’s intended utility, blocking not only attacks but also legitimate workflows (Wang et al., 24 Feb 2026, Yu et al., 8 Jan 2026).
6. Open Challenges and Future Directions
Research continues into more robust IPI defenses, guided by persistent vulnerabilities and adversarial adaptation:
- Robust provenance enforcement: Soliciting out-of-band or architectural channels to segregate control instructions from untrusted data, moving beyond in-band signaling (Hines et al., 2024).
- Adaptive and adversarial attacks: IPI attack methods evolve in response to detection heuristics, leveraging white-box knowledge, prompt-leakage, and stealthy obfuscation (Xie et al., 27 Oct 2025, Chang et al., 11 Jan 2026).
- Practical deployment: Integrating lightweight, runtime-safe detection modules and circuit breakers (e.g., representation engineering) into production environments at minimal overhead and user intervention (Zhu et al., 4 Apr 2026).
- Benchmarking and certification: Ongoing need for standard datasets, agentic benchmark suites, and system-level certification (analogous to SBOM) to track security properties throughout LLM pipelines (Khodayari et al., 29 Apr 2026, Zhan et al., 2024, Yu et al., 8 Jan 2026).
- Empowerment and counter-control: Prompt injection may be also repurposed as a tool for identity preservation or grassroots resistance by indirect users, opening further research into ethical and technical boundaries (Glazko et al., 17 Oct 2025).
Key open questions:
- Theoretical limits to IPI detection vs. utility preservation (Greshake et al., 2023, He et al., 11 Mar 2026, Zhu et al., 7 Feb 2025).
- Generalization across unseen attacker templates, out-of-domain content, or new agent architectures (Wen et al., 8 May 2025, Chen et al., 23 Feb 2025).
- Multimodal extension—secure handling of vision, audio, or multimodal flows (Wang et al., 24 Feb 2026, Wen et al., 8 May 2025).
7. References (Sample)
- (Greshake et al., 2023): Not what you’ve signed up for: Compromising Real-World LLM-Integrated Applications with Indirect Prompt Injection
- (Hines et al., 2024): Defending Against Indirect Prompt Injection Attacks With Spotlighting
- (Zhan et al., 2024): InjecAgent: Benchmarking Indirect Prompt Injections in Tool-Integrated LLM Agents
- (Chang et al., 26 Sep 2025): ChatInject: Abusing Chat Templates for Prompt Injection in LLM Agents
- (Xie et al., 27 Oct 2025): QueryIPI: Query-agnostic Indirect Prompt Injection on Coding Agents
- (Zhu et al., 7 Feb 2025): MELON: Provable Defense Against Indirect Prompt Injection Attacks in AI Agents
- (Shah, 15 Apr 2026): LogJack: Indirect Prompt Injection Through Cloud Logs Against LLM Debugging Agents
- (Khodayari et al., 29 Apr 2026): Indirect Prompt Injection in the Wild: An Empirical Study of Prevalence, Techniques, and Objectives
- (Wang et al., 24 Feb 2026): ICON: Indirect Prompt Injection Defense for Agents based on Inference-Time Correction
For further methodological or experimental details, readers are directed to the original referenced arXiv works.