Papers
Topics
Authors
Recent
Search
2000 character limit reached

Indirect Prompt Injections in LLMs

Updated 11 May 2026
  • Indirect prompt injection is the exploitation of external data feeds to embed adversarial instructions, leading LLMs to execute unintended commands.
  • It involves a range of attack vectors including instruction override, template forging, and stealth tactics that hide malicious payloads in trusted contexts.
  • Defensive approaches include input provenance markers, runtime masking, and adversarial training to mitigate the significant risks posed by these attacks.

Indirect prompt injection (IPI) refers to the class of attacks against LLM applications, agents, or pipelines in which an adversary embeds malicious instructions within external input channels—such as retrieved web documents, tool responses, cloud logs, or user-generated content—that are later ingested by the model as part of its prompt context. Unlike direct prompt injection, where the user or attacker directly enters the adversarial instruction via the primary input interface, IPI exploits the LLM’s inability to distinguish between trusted instructions and untrusted data injected along secondary information paths. This vulnerability has been empirically demonstrated across a wide range of agentic LLM systems, code assistants, retrieval-augmented generation (RAG) applications, web-integrated tools, and even vision–LLMs, representing a major emerging risk in autonomous language agent architectures (Greshake et al., 2023, Khodayari et al., 29 Apr 2026, Zhan et al., 2024, Chang et al., 26 Sep 2025, Shah, 15 Apr 2026).

1. Definitions and Threat Models

IPI attacks exploit the tendency of LLM-driven systems to integrate and interpret external data as part of the prompt context. Let DD be developer-provided (trusted) instructions, and UU be untrusted content (e.g., a web page, retrieved email, log entry, or tool API response). The system’s prompt typically takes the form C=[D;U]C = [D; U] or X=SsysSuserSadvX = S_\text{sys} \| S_\text{user} \| S_\text{adv}, where SadvS_\text{adv} is the adversary-controlled data channel (Khodayari et al., 29 Apr 2026, Hines et al., 2024).

Core IPI attack scenario:

Threat models vary:

2. Attack Techniques, Taxonomy, and Prevalence

A diverse taxonomy of IPI techniques has been identified in both empirical and large-scale “in the wild” studies (Greshake et al., 2023, Khodayari et al., 29 Apr 2026, Hines et al., 2024, Zhan et al., 2024, Shah, 15 Apr 2026):

Attack vectors include:

Empirical studies demonstrate real-world prevalence:

  • In a crawl of 1.2B web pages (Common Crawl 2025), over 15,000 validated IPI payloads were found, most embedded in non-rendered HTML (headers, comments, metadata), of which 87% were invisible to human users (Khodayari et al., 29 Apr 2026).
  • In cloud environments, LLM debugging agents were found to execute adversarial commands verbatim from log content with up to 86.2% success (Llama 3.3 70B) under “active” conditions (Shah, 15 Apr 2026).
  • Benchmarking (InjecAgent) across 30 agents revealed attack success rates of up to 85% on large open-source models, with even fine-tuned agents exhibiting nonzero rates (Zhan et al., 2024).
Attack Type Example Mechanism Reference
Instruction override “Ignore all previous instructions…” (Chang et al., 26 Sep 2025, Khodayari et al., 29 Apr 2026)
Template forging <im_start>system ... <im_end> (Chang et al., 26 Sep 2025)
Log poisoning Inject shell/CLI code via cloud logs (Shah, 15 Apr 2026)
Stealth/obfuscation Base64, homoglyphs, CSS hiding (Greshake et al., 2023, Khodayari et al., 29 Apr 2026)
Chat/role escalation Forged tags, multi-turn role-play (Chang et al., 26 Sep 2025)

3. Detection Methodologies and Systemic Vulnerabilities

IPI exploits the architectural ambiguity between trusted and untrusted prompt segments. Key vulnerabilities stem from agentic workflows that concatenate context without authenticated provenance or isolation (Greshake et al., 2023, Hines et al., 2024, Shah, 15 Apr 2026).

Detection strategies:

  • Behavioral and action-level probing: AttriGuard and similar approaches test why an action (e.g., tool call) was proposed by counterfactually suppressing steering information in untrusted context; if the action disappears under control attenuation, it is flagged as “observation-driven” (malicious) (He et al., 11 Mar 2026).
  • Latent space anomaly detection: ICON detects over-focusing or “attention collapse” in the LLM’s latent space (abnormally low entropy in attention heads toward adversarial tokens), then surgically redistributes attention to restore correct behavior (Wang et al., 24 Feb 2026).
  • Representation engineering: Classifier probes can detect abnormal hidden-state representations or entropy spikes just prior to unauthorized actions, enabling pre-commitment circuit breakers (Zhu et al., 4 Apr 2026).
  • Instruction-following intent analysis: IntentGuard extracts the LLM’s own planned intents and traces their provenance, blocking any intentions linked to untrusted prompt regions (Kang et al., 30 Nov 2025).
  • Zero-shot embedding drift: ZEDD and related methods detect prompt injection by quantifying shifts in high-dimensional embedding space between benign and suspect inputs (Sekar et al., 18 Jan 2026).
  • External detectors and segment extractors: Segmentation or extraction modules remove sentence-level or phrase-level regions of untrusted context where detectors identify instruction-like content (Chen et al., 23 Feb 2025).

Table: Empirical detection performance (sample)

Method Approach Detection Acc. Attack Success Rate (ASR) Reference
ICON Latent attention trace 0.4% 0.4% (Wang et al., 24 Feb 2026)
AttriGuard Action-level attribution ~100% (static) 0.0% (static) (He et al., 11 Mar 2026)
ZEDD Embedding drift >93% <7% (Sekar et al., 18 Jan 2026)
CachePrune Neuron cache pruning N/A 7–15% (code/QA) (Wang et al., 29 Apr 2025)
IntentGuard Intent provenance ≥92% <9% (adv. attacks) (Kang et al., 30 Nov 2025)
Segmentation Sentence detector/remover 99% <1% (Chen et al., 23 Feb 2025)

4. Defense Architectures and Mitigation Strategies

Multiple architectural and prompt engineering defenses have been proposed and evaluated, with complex trade-offs between security/coverage, utility, computational cost, and over-refusal.

System-level and prompt-based strategies:

  • Input provenance signaling (“spotlighting”): Delimit or encode untrusted input with markers, provenance bits, or one-way Base64 encodings. Proper configuration reduces ASR from >50% to <2% in experiments on GPT-family models (Hines et al., 2024).
  • Planning/execution decoupling: IPIGuard first statically plans calls via a tool dependency graph (TDG), then strictly enforces only planned invocations, eliminating the capacity for hijacks outside the pre-approved workflow (An et al., 21 Aug 2025).
  • Runtime masking and re-execution: MELON re-executes the agent with the user’s task masked or replaced by a “task-neutral” prompt; if generated actions are similar to the original, a successful attack is declared (Zhu et al., 7 Feb 2025).
  • Fine-tuning for injection robustness: Systematic adversarial training on IPI examples substantially reduces attack success, but is costly and fragile to novel attack variants (Zhan et al., 2024).
  • Parsing and field constraint enforcement: Strict extraction and validation of necessary fields from tool output filters out content not matching strict schema or logical requirements (Yu et al., 8 Jan 2026).
  • Circuit breakers: Inserted at positions of latent ambiguity in agent workflows, they halt execution before hypotheses can be committed, informed by hidden state probing (Zhu et al., 4 Apr 2026).

Failure modes and limitations:

5. Emergent Properties, Risks, and Real-World Impact

Empirical studies confirm that IPI attacks present both offensive and defensive strategic possibilities, and have transitioned from theoretical concern to observable phenomenon in the web and cloud production systems:

  • Prevalence: Recurring injection templates and invisible prompt-embedding strategies are present in production-scale web, cloud, and email corpora, with a small set of “families” accounting for most observed attacks (e.g., top 54 templates cover 95% of 15,000+ web injections) (Khodayari et al., 29 Apr 2026).
  • Agentic systems: Multi-step automated agents are more vulnerable in dynamic environments, with attack success rates exceeding 80% for representative vector attacks in open-source LLM backbones (Zhu et al., 4 Apr 2026, Zhan et al., 2024).
  • Cross-system implications: Associated risks include privilege escalation, arbitrary code execution (“curl | bash”), data exfiltration, AI-bot detection evasion, and reputation manipulation (Shah, 15 Apr 2026, Khodayari et al., 29 Apr 2026, Greshake et al., 2023).
  • Defensive friction: Some countermeasures, such as strong output-stripping or excessive input refusal, degrade the agent’s intended utility, blocking not only attacks but also legitimate workflows (Wang et al., 24 Feb 2026, Yu et al., 8 Jan 2026).

6. Open Challenges and Future Directions

Research continues into more robust IPI defenses, guided by persistent vulnerabilities and adversarial adaptation:

  • Robust provenance enforcement: Soliciting out-of-band or architectural channels to segregate control instructions from untrusted data, moving beyond in-band signaling (Hines et al., 2024).
  • Adaptive and adversarial attacks: IPI attack methods evolve in response to detection heuristics, leveraging white-box knowledge, prompt-leakage, and stealthy obfuscation (Xie et al., 27 Oct 2025, Chang et al., 11 Jan 2026).
  • Practical deployment: Integrating lightweight, runtime-safe detection modules and circuit breakers (e.g., representation engineering) into production environments at minimal overhead and user intervention (Zhu et al., 4 Apr 2026).
  • Benchmarking and certification: Ongoing need for standard datasets, agentic benchmark suites, and system-level certification (analogous to SBOM) to track security properties throughout LLM pipelines (Khodayari et al., 29 Apr 2026, Zhan et al., 2024, Yu et al., 8 Jan 2026).
  • Empowerment and counter-control: Prompt injection may be also repurposed as a tool for identity preservation or grassroots resistance by indirect users, opening further research into ethical and technical boundaries (Glazko et al., 17 Oct 2025).

Key open questions:

7. References (Sample)

For further methodological or experimental details, readers are directed to the original referenced arXiv works.

Definition Search Book Streamline Icon: https://streamlinehq.com
References (20)

Topic to Video (Beta)

No one has generated a video about this topic yet.

Whiteboard

No one has generated a whiteboard explanation for this topic yet.

Follow Topic

Get notified by email when new papers are published related to Indirect Prompt Injections.